Contributor
Abdelrahman Jamal

Reducing Fossology's false positive copyrights


Mentors
Gaurav Mishra, Anupam Ghosh, HastagAB, Kaushlendra
Organization
FOSSology
Technologies
python, spaCy, Regex
Topics
machine learning, natural language processing, Named Entity Recognition, Text Categorization
This project’s goal is to improve the accuracy of Fossology’s copyright detection system using machine learning. This functionality was originally implemented in 2021 to improve the two-step process of copyright detection used by most copyright detection software; using Regex and then human intervention. The machine learning approach uses Named Entity Recognition (NER) and Part of Speech (POS) tagging to figure out which statements contain copyright and which do not. I’ll be working on improving all parts of the project; starting from the dataset & preprocessing, then working my way to the NER hypothesis until the machine learning model used and the final integration.