Abstract:
Probabilistic topic modelling provides computational methods for text analytics. Latent Dirichlet Allocation (LDA) is a popular Bayesian model that aids in discovering hidden thematic structure in large text datasets. Application of LDA requires that Dirichlet prior parameters alpha and beta, and number of topics be specified. The performance of many machine learning methods depends critically on parameter settings. Currently, parameter estimation is based on Markov Chain Monte-Carlo algorithms that must be run for many iterations before convergence. Further, deployment of topic models require definition of constituent constructs and their relationships. This thesis consists of two parts. In the first part, a topic model deployment framework (TMDF) is developed. TMDF identifies all components that are required for deployment of topic models. The second part of the thesis deals with the development of Quadratic Topics Approximation Method (QTAM) for fast estimation of optimal number of topics input parameter. QTAM models minimum perplexity against number of topics for any given dataset. Both the framework and the method are validated for application in large unstructured text data analytics. A python experimental environment was set up in the Google cloud compute engine’s custom machine and then used to study parameter behaviour by manipulating selected existing datasets. Asymptotic time analyses as well as descriptive statistics were used to validate QTAM. A critical review of literature and expert opinion survey were used to identify, develop and validate TMDF. Results indicate improvement in inferential speed for the number of topics (K) and hyper parameter alpha thereby enhancing LDA application. This implies that QTAM is more efficient in estimating number of topics parameter than other existing methods currently being used for developing text applications, especially those that have large data processing requirements. Validation results also indicate that TMDF can be used in the deployment of topic models. Further study may be carried out to determine the best combination of pre-processing steps and topic model parameters that were outside the scope of this thesis but otherwise related. Another direction for further study is to conduct replica experiments to further enhance the results of the new method with the aim of setting it as an industry standard.