Contributor
Malay Joshi

Extract important information from scientific papers


Mentors
MagdalenaZ
Organization
Genome Assembly and Annotation
Technologies
python, nlp
Topics
NER
During GSoC 2021, BioBERT and RegEx/string matching technique based “Named Entity Recognition” (NER) system was developed to recognize and extract data classes of Mutation, Gene, and Gene-Var combo, Strains, Variants, Variation type, and Functional effect. But it has limitations that it can still not recognize many entities of predefined classes due to being trained on a dataset having less training data in natural language form and less generalized RegEx/string matching rules. Also, due to the poor entity normalization approach, many extracted entities are being ignored in the final output stage of the pipeline. This project proposal aims to increase the entity detection capabilities of this “Named Entity Recognition” (NER) system by firstly integrating additional RegEx/string matching rules in the current pipeline. Secondly, by combining other training datasets with the existing IDP4 dataset and then extending this combined training dataset using active learning to capture more data classes in natural language form. Lastly, re-training the current BioBERT model using a modified approach and making the entity standardization approach more general and scalable. Along with improving entity recognition of existing data classes, this proposal also aims to extend the current “Named Entity Recognition” (NER) system to extract data classes related to the CRISPR-cas9 experiment.