Language of Proteins

Source: Figure after © 2010 PJ Russell, iGenetics 3rd ed.

Drug development and discovery is a time and labor-intensive process that could be enhanced and improved by next-generation protein sequencing techniques.

Recent breakthroughs in Natural Language Processing (NLP) and training of large transformer models have paved the way for new types of domain-specific deep language models. In the biological domain, predicting specific protein type could, among other things, facilitate development of more effective and safer drugs and treatments.

Proteins can be represented by a sequence of tokens where each token is an amino acid (e.g., GMASKAGSVLGKITKIALGAL). Peptides are proteins of relatively small lengths.

This was a FourthBrain group capstone project. Three datasets, one for each of the following types of proteins, were provided by InstaDeep:

Anticancer Peptides (ACP)
Antimicrobial Peptides (AMP)
DNA-Binding Proteins (DBP)

For each of these datasets, we developed a system that can predict, given a sequence of amino acids, if a peptide is an ACP, AMP, or DBP:

Embeddings – we used the following pre-trained protein language models to get embeddings for our sequences:
- ProSE (Protein Sequence Embeddings)
- ESM (Evolutionary Scale Modeling)
Modelling – we tested a variety of classical ML models on the computed embeddings to predict the labels for the sequences.
Results – the results of our modelling are reported in the results directory, after which we settled on models for each task for deployment.
Deployment – our application was deployed using FastAPI and an html template and containerized using Docker. For our Demo Day we deployed the Docker container on AWS EC2 instance.

All codding can be found in notebooks and Python scripts at the above links [1, 2, 4] and in the scripts folder.

Our full presentation deck can be found here.

Tools/techniques used: Python, Jupyter Notebook, Seaborn, Matplotlib, Scikit-learn, Pandas, pipelines, PCA, FastAPI, hyperparameter tuning, Docker
Algorithms used: Logistic Regression, SVM, Random Forest, XGBoost

GitHub Repository for this project.