Contributor
Elisa Gómez de Lope

DeepChem meets Hugging Face pLM ESM-2: Predicting protein binding sites. A tutorial.


Mentors
Rakshit Kumar Singh
Organization
DeepChem
Technologies
python
Topics
machine learning, Drug Discovery, protein language model, protein binding sites
Protein-ligand interactions play a crucial role in various biological processes, and accurately predicting protein binding sites, where ligand molecules interact, is essential for drug discovery and understanding protein function. Current approaches integrated in DeepChem, such as the use of (geometric) deep learning or molecules fingerprints, mainly focus on predicting protein-ligand binding affinity, and require complex data and/or are computationally expensive. This project focuses integrating ESM-2 (a state-of-the-art protein language model from Hugging Face) with DeepChem, for the purpose of protein binding site prediction. This integration will enable DeepChem library to utilize powerful protein representations learned by ESM-2. A detailed tutorial will be written to provide not only a workflow for streamlined binding site prediction, but a guide that empowers researchers to use protein language models for further tasks within DeepChem. This project broadens DeepChem's toolbox, making it more versatile for the drug discovery community, and ultimately accelerating progress in this area. The main deliverables are: A function for feature extraction using the pre-trained ESM-2 model. Integration of extracted features with binding site information. DeepChem model architecture for protein binding site prediction. Reports for model performance and evaluation metrics. Tutorial documenting the complete workflow, including code, data pre-processing steps, model training details, and visualizations of predicted binding sites (if possible). If time allows, an additional tutorial showcasing the application of ESM-2 protein representations for a different task (e.g., generating peptide binders for target proteins), or implementation of a LoRA wrapper for the ESM-2 model, along with a corresponding tutorial explaining its functionality will be developed. Alternatively, both will be documented as open issues on the DeepChem GitHub repository for future development.