Abstract:
The exponential growth of published information presents significant challenges in managing and extracting meaningful insights from unstructured data. Named Entity Recognition (NER) plays a critical role in transforming such data into structured, machine-readable formats by identifying and categorizing named entities into predefined semantic classes. Extracting meaningful entities from informal, unstructured text especially in low-resource settings remains a major challenge. This research aimed at developing and evaluating a memory-based NER framework tailored for Kikuyu, a low-resource Bantu language. To leverage the TiMBL (Tilburg Memory-Based Learner) algorithm, k-nearest neighbor (KNN) approach was implemented to classify entities based on manually annotated data using 17k words from the Kikuyu Bible corpus. This research used experiments as a methodology which incorporated quality control measures such as inter-annotator agreement and cross-validation to ensure reliability. Experimental results demonstrated promising performance, achieving 72.54% Precision, 72.67% Recall, and a 72.5% F-score. These findings underscore the viability of memory-based learning for NER in resource-scarce languages and contribute a novel annotated Kikuyu corpus and framework for future linguistic and Natural Language Processing (NLP) applications.
Keywords: Named Entity Recognition, Memory-based learning algorithms, Semantic Web problem, Question-Answering Systems, Natural Language Processing.