Topic Model Deployment Framework and Gibbs Sampling-Based Quadratic Topics Approximation Method for Large Text Analytics

Wambugu, Geoffrey Mariga

JKUAT Repository Home
→
Theses and Dissertations
→
College of Pure and Applied Sciences (COPAS)
→
View Item

dc.contributor.author	Wambugu, Geoffrey Mariga
dc.date.accessioned	2019-07-25T13:17:12Z
dc.date.available	2019-07-25T13:17:12Z
dc.date.issued	2019-07-25
dc.identifier.uri	http://hdl.handle.net/123456789/5185
dc.description	Doctor of Philosophy in Information Technology	en_US
dc.description.abstract	Probabilistic topic modelling provides computational methods for text analytics. Latent Dirichlet Allocation (LDA) is a popular Bayesian model that aids in discovering hidden thematic structure in large text datasets. Application of LDA requires that Dirichlet prior parameters alpha and beta, and number of topics be specified. The performance of many machine learning methods depends critically on parameter settings. Currently, parameter estimation is based on Markov Chain Monte-Carlo algorithms that must be run for many iterations before convergence. Further, deployment of topic models require definition of constituent constructs and their relationships. This thesis consists of two parts. In the first part, a topic model deployment framework (TMDF) is developed. TMDF identifies all components that are required for deployment of topic models. The second part of the thesis deals with the development of Quadratic Topics Approximation Method (QTAM) for fast estimation of optimal number of topics input parameter. QTAM models minimum perplexity against number of topics for any given dataset. Both the framework and the method are validated for application in large unstructured text data analytics. A python experimental environment was set up in the Google cloud compute engine’s custom machine and then used to study parameter behaviour by manipulating selected existing datasets. Asymptotic time analyses as well as descriptive statistics were used to validate QTAM. A critical review of literature and expert opinion survey were used to identify, develop and validate TMDF. Results indicate improvement in inferential speed for the number of topics (K) and hyper parameter alpha thereby enhancing LDA application. This implies that QTAM is more efficient in estimating number of topics parameter than other existing methods currently being used for developing text applications, especially those that have large data processing requirements. Validation results also indicate that TMDF can be used in the deployment of topic models. Further study may be carried out to determine the best combination of pre-processing steps and topic model parameters that were outside the scope of this thesis but otherwise related. Another direction for further study is to conduct replica experiments to further enhance the results of the new method with the aim of setting it as an industry standard.	en_US
dc.description.sponsorship	Dr. George Okeyo, PhD JKUAT, Kenya Prof. Stephen Kimani, PhD JKUAT, Kenya	en_US
dc.language.iso	en	en_US
dc.publisher	JKUAT-COPAS	en_US
dc.subject	Large Text Analytics	en_US
dc.subject	Quadratic Topics Approximation Method	en_US
dc.subject	Gibbs Sampling-Based	en_US
dc.title	Topic Model Deployment Framework and Gibbs Sampling-Based Quadratic Topics Approximation Method for Large Text Analytics	en_US
dc.type	Thesis	en_US

Files in this item

Name: Mariga-Normal.pdf

Size: 2.470Mb

Format: PDF

View/Open

This item appears in the following Collection(s)

College of Pure and Applied Sciences (COPAS) [404]
Depts. in this collection Mathematics, Chemistry, Physics. ICT, Biochemistry, Microbiology

Topic Model Deployment Framework and Gibbs Sampling-Based Quadratic Topics Approximation Method for Large Text Analytics

Files in this item

This item appears in the following Collection(s)

Search DSpace

Browse

All of DSpace

This Collection

My Account