Automated Curation and Harmonization of cBioPortal Clinical Metadata using Sentence Transformers
- Mentors
- Sehyun Oh, Jonathan Davenport, Michele W.
- Organization
- cBioPortal for Cancer Genomics
- Technologies
- python
- Topics
- machine learning, bioinformatics
Omics data repositories often contain heterogeneous data from multiple studies
and diverse sources. This lack of structure in the metadata is challenging
for the development of new algorithms and application of machine learning
or deep learning to cross-study datasets. Under this project, some work has
already been conducted. Currently, Manual review of the metadata schema,
consolidation of similar or identical information spread across schema, and
incorporation of ontologies where possible has already been done. In this light,
manual harmonization cBioPortal’s key clinical metadata across the whole data
repository, not just at the study level, and incorporation of ontology terms has
improved the AI/ML-readiness of the cBioPortal data.
We want to take the current work further to harmonize/digest new/incoming data in the format of the data dictionary already established in an automated fashion with minimum manual curation. For this purpose, we will explore advanced natural language processing techniques, particularly sentence transformers, to automate the process of metadata curation for clinical metadata within the cBioPortal platform.Additionally, we will also creating an interactive dashboard for visualizing and potentially editing
the automated harmonization results, enhancing user accessibility and control over the curation process.