INVESTIGATING THE SEQUENCE ELEMENTS THAT AFFECT THE TRANSLATION EFFICIENCY OF THE FUNGAL PATHOGEN HISTOPLASMA CAPSULATUM
by Annika Viswesh
Category: STEM
Abstract – Histoplasmosis disease is caused by the dimorphic switch of the Histoplasma capsulatum fungus. Predicting the Translational Efficiency (TE) for Histoplasma capsulatum will lead to techniques that can regulate its protein production and thereby help in the treatment of Histoplasmosis. However, what sequence elements in the mRNA determine TE in Histoplasma is not well understood. The 5' Untranslated region (UTR) of 4981 genes common to 4 strains of Histoplasma were explored to identify the correlation between the longest 5 Upstream Open Reading Frame (uORFs) with start codon ATG, length of the 5' UTR, the energy of constrained secondary RNA structure, CG-to-ATG ratio and TE, using Wilcoxon tests, normal distribution plots, and Area under the receiver operating characteristics (ROC) curve. Subsequently, using all these sequence elements as features, four computational models were developed using different machine learning algorithms to predict TE. The results demonstrate that the maximum length of uORF with start codon ATG and the CGto-ATG ratio have the best correlation to TE with the highest Area Under the Curve (AUC) amongst all sequence elements at 0.74 and 0.79, respectively. Also, computational model created using Random Forest outperformed other models to best predict TE with an AUC of 0.85. This research helped identify a set of sequence elements that affect TE in Histoplasma capsulatum and also showed that computational models can be c