Reducing Fossology's false positive copyrights
- Mentors
- Gaurav Mishra, Anupam Ghosh, HastagAB, Kaushlendra
- Organization
- FOSSology
- Technologies
- python, spaCy, Regex
- Topics
- machine learning, natural language processing, Named Entity Recognition, Text Categorization
This project’s goal is to improve the accuracy of Fossology’s copyright
detection system using machine learning. This functionality was originally
implemented in 2021 to improve the two-step process of copyright
detection used by most copyright detection software; using Regex and then
human intervention.
The machine learning approach uses Named Entity Recognition (NER) and
Part of Speech (POS) tagging to figure out which statements contain
copyright and which do not.
I’ll be working on improving all parts of the project; starting from the
dataset & preprocessing, then working my way to the NER hypothesis until
the machine learning model used and the final integration.