Topic Model Deployment Framework and Gibbs Sampling-Based Quadratic Topics Approximation Method for Large Text Analytics

Show simple item record

dc.contributor.author Wambugu, Geoffrey Mariga
dc.date.accessioned 2019-07-25T13:17:12Z
dc.date.available 2019-07-25T13:17:12Z
dc.date.issued 2019-07-25
dc.identifier.uri http://hdl.handle.net/123456789/5185
dc.description Doctor of Philosophy in Information Technology en_US
dc.description.abstract Probabilistic topic modelling provides computational methods for text analytics. Latent Dirichlet Allocation (LDA) is a popular Bayesian model that aids in discovering hidden thematic structure in large text datasets. Application of LDA requires that Dirichlet prior parameters alpha and beta, and number of topics be specified. The performance of many machine learning methods depends critically on parameter settings. Currently, parameter estimation is based on Markov Chain Monte-Carlo algorithms that must be run for many iterations before convergence. Further, deployment of topic models require definition of constituent constructs and their relationships. This thesis consists of two parts. In the first part, a topic model deployment framework (TMDF) is developed. TMDF identifies all components that are required for deployment of topic models. The second part of the thesis deals with the development of Quadratic Topics Approximation Method (QTAM) for fast estimation of optimal number of topics input parameter. QTAM models minimum perplexity against number of topics for any given dataset. Both the framework and the method are validated for application in large unstructured text data analytics. A python experimental environment was set up in the Google cloud compute engine’s custom machine and then used to study parameter behaviour by manipulating selected existing datasets. Asymptotic time analyses as well as descriptive statistics were used to validate QTAM. A critical review of literature and expert opinion survey were used to identify, develop and validate TMDF. Results indicate improvement in inferential speed for the number of topics (K) and hyper parameter alpha thereby enhancing LDA application. This implies that QTAM is more efficient in estimating number of topics parameter than other existing methods currently being used for developing text applications, especially those that have large data processing requirements. Validation results also indicate that TMDF can be used in the deployment of topic models. Further study may be carried out to determine the best combination of pre-processing steps and topic model parameters that were outside the scope of this thesis but otherwise related. Another direction for further study is to conduct replica experiments to further enhance the results of the new method with the aim of setting it as an industry standard. en_US
dc.description.sponsorship Dr. George Okeyo, PhD JKUAT, Kenya Prof. Stephen Kimani, PhD JKUAT, Kenya  en_US
dc.language.iso en en_US
dc.publisher JKUAT-COPAS en_US
dc.subject Large Text Analytics en_US
dc.subject Quadratic Topics Approximation Method en_US
dc.subject Gibbs Sampling-Based en_US
dc.title Topic Model Deployment Framework and Gibbs Sampling-Based Quadratic Topics Approximation Method for Large Text Analytics en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Repository


Browse

My Account