Purdue University Graduate School
Browse
Junhui_Wang_dissertation.pdf (3.72 MB)

SYSTEMATICALLY LEARNING OF INTERNAL RIBOSOME ENTRY SITE AND PREDICTION BY MACHINE LEARNING

Download (3.72 MB)
thesis
posted on 2019-05-15, 19:48 authored by Junhui WangJunhui Wang

Internal ribosome entry sites (IRES) are segments of the mRNA found in untranslated regions, which can recruit the ribosome and initiate translation independently of the more widely used 5’ cap dependent translation initiation mechanism. IRES play an important role in conditions where has been 5’ cap dependent translation initiation blocked or repressed. They have been found to play important roles in viral infection, cellular apoptosis, and response to other external stimuli. It has been suggested that about 10% of mRNAs, both viral and cellular, can utilize IRES. But due to the limitations of IRES bicistronic assay, which is a gold standard for identifying IRES, relatively few IRES have been definitively described and functionally validated compared to the potential overall population. Viral and cellular IRES may be mechanistically different, but this is difficult to analyze because the mechanistic differences are still not very clearly defined. Identifying additional IRES is an important step towards better understanding IRES mechanisms. Development of a new bioinformatics tool that can accurately predict IRES from sequence would be a significant step forward in identifying IRES-based regulation, and in elucidating IRES mechanism. This dissertation systematically studies the features which can distinguish IRES from nonIRES sequences. Sequence features such as kmer words, and structural features such as predicted MFE of folding, QMFE, and sequence/structure triplets are evaluated as possible discriminative features. Those potential features incorporated into an IRES classifier based on XGBboost, a machine learning model, to classify novel sequences as belong to IRES or nonIRES groups. The XGBoost model performs better than previous predictors, with higher accuracy and lower computational time. The number of features in the model has been greatly reduced, compared to previous predictors, by adding global kmer and structural features. The trained XGBoost model has been implemented as the first high-throughput bioinformatics tool for IRES prediction, IRESpy. This website provides a public tool for all IRES researchers and can be used in other genomics applications such as gene annotation and analysis of differential gene expression.

History

Degree Type

  • Doctor of Philosophy

Department

  • Biological Sciences

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Dr. Michael Gribskov

Additional Committee Member 2

Dr. Richard J. Kuhn

Additional Committee Member 3

Dr. Daisuke Kihara

Additional Committee Member 4

Dr. Barbara L. Golden

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC