Contributor: Abhilash Dhal

Automated Curation and Harmonization of cBioPortal Clinical Metadata using Sentence Transformers

Mentors: Sehyun Oh, Jonathan Davenport, Michele W.
Organization: cBioPortal for Cancer Genomics
Technologies: python
Topics: machine learning, bioinformatics

Omics data repositories often contain heterogeneous data from multiple studies and diverse sources. This lack of structure in the metadata is challenging for the development of new algorithms and application of machine learning or deep learning to cross-study datasets. Under this project, some work has already been conducted. Currently, Manual review of the metadata schema, consolidation of similar or identical information spread across schema, and incorporation of ontologies where possible has already been done. In this light, manual harmonization cBioPortal’s key clinical metadata across the whole data repository, not just at the study level, and incorporation of ontology terms has improved the AI/ML-readiness of the cBioPortal data. We want to take the current work further to harmonize/digest new/incoming data in the format of the data dictionary already established in an automated fashion with minimum manual curation. For this purpose, we will explore advanced natural language processing techniques, particularly sentence transformers, to automate the process of metadata curation for clinical metadata within the cBioPortal platform.Additionally, we will also creating an interactive dashboard for visualizing and potentially editing the automated harmonization results, enhancing user accessibility and control over the curation process.