Graph-Based Feature Selection Model for Genes’ Phenotype Prediction

Show simple item record

dc.contributor.author Mugwika, Consolata Gakii
dc.date.accessioned 2022-10-07T08:11:12Z
dc.date.available 2022-10-07T08:11:12Z
dc.date.issued 2022-10-07
dc.identifier.citation MugwikaCG2022 en_US
dc.identifier.uri http://localhost/xmlui/handle/123456789/5938
dc.description Doctor of Philosophy in Information Technology en_US
dc.description.abstract High throughput sequencing technologies generate large volumes of data and this effectively ushers’ life sciences into the big data realm. Data generated using these technologies is oftentimes noisy or high-dimensional and therefore several preprocessing steps for its computational analysis are required. Dimensionality reduction methods focus on evaluating each feature individually instead of putting into consideration the interactions or dependencies between features. These relationships are very important because they reflect the functional/ phenotypic aspect in living systems. The aim of this study was to develop a graph-based network feature selection model for gene-phenotype prediction in high dimensional RNAseq data. Three different datasets (RNAseq data from; antennae of Glossina morsitans morsitans, Small Cell Lung Cancer (SCLC) and Non-small Cell Lung Cancer (NSCLC)) were used. Pre-processing involved quality checking, adapter trimming, contamination removal and quality filtering. Differential expression analysis was done, and genes were considered differentially expressed and retained for further analysis if the test statistics p-value (adjusted for false detection rate) (FDR) was less than 0.05. Feature selection was performed using Principal Component Analysis (PCA), Recursive Feature Elimination (RFE) and a Graph-based approach. Equal Frequency Discretization (EFD) was used to transform the selected features from a continuous or numerical attributes into discrete values. Association rules were generated using a minimum support value between of 0.5 and 0.9, minimum confidence value of 0.9 and lift of ≥2 . Features from the three feature selection techniques were classified using three classifiers namely Naïve Bayes, Sequential Minimal Optimization (SMO) and Multilayer Perceptron. Results from the quality trimming showed that the window-based algorithm performed better than the other two approaches whereby the percentage of the surviving reads ranged between 83.39% and 90.87%. Mapping results showed that Burrows wheeler algorithm performed better than Bowtie2 in terms of the alignment across all the samples with accuracy values between 93% and 97.97%. During differential gene (feature) expression analysis, 2,097 low-count features were filtered out leaving a final tally of 10,921 features. Three global networks with 2,110 nodes and 4,783 edges, 990 nodes and 3154 edges and 876 nodes and 3676 edges were generated from three datasets used in this study. The resulting networks were further filtered, and the final reduced networks had 51 nodes and 148 edges, 134 nodes and 396 edges, and 81 nodes and 169 edges respectively. The proposed graph-based feature-selection approach provided 15 and 36 non-redundant rules, respectively, from the two datasets at a support of 0.5 confidence value of 0.9 and a lift of 2. PCA and RFE feature-selection methods did not generate any rules at a support of 0.5. The lower support values provided by RFE feature selection approach implies that the features selected by this method were negatively correlated. For the PCA-based feature selection, support ranged between 0.405 and 0.425 which was lower than the support of the rules generated by the graph-based feature selection approach. The results of classification before and after feature selection showed a reduction in classifier model building time with minimal effect on accuracy. This study demonstrates that graph-based feature selection approach combined with association rule mining can be very useful in associating genes with a known function with those with unknown function for phenotype prediction based on gene expression levels. en_US
dc.description.sponsorship Dr. Richard Rimiru, PhD JKUAT, Kenya Dr. Paul. O. Mireji, PhD BioRI-KALRO, Kenya   en_US
dc.language.iso en en_US
dc.publisher JKUAT-COPAS en_US
dc.subject Graph-Based en_US
dc.subject Selection Model en_US
dc.subject Genes’ Phenotype Prediction en_US
dc.title Graph-Based Feature Selection Model for Genes’ Phenotype Prediction en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Browse

My Account