Contributor
Abdelrahman Jamal

AI-Powered Software License Identification


Mentors
Anupam Ghosh, HastagAB, Kaushlendra, Vasudev, SinghShreya
Organization
FOSSology
Technologies
python, pytorch, pandas, Transformers (Hugging Face)
Topics
machine learning, natural language processing, Large Language Models, Open-Source Licenses, Copyrights
One of Fossology's primary features is extracting licenses and license text from files. Traditional methods like text comparison, regular expressions, and SPDX identifiers can result in false positives, often requiring human review. This project aims to leverage recent advancements in Large Language Models (LLMs). LLMs can process text at a near-human level or beyond. In the domain of copyrights and licenses, models like Gemini-Pro, ChatGPT-3.5, and GPT4 have demonstrated exceptional accuracy in recognizing and converting licenses into formats like SPDX. Research across various fields, including medicine and coding, indicates that smaller, domain specific LLMs can outperform larger, general-purpose models in their respective areas. For this project, we'll fine-tune a relatively small LLM with parameters in the 2-7 billion range for optimal performance in license identification and other applications. Potential models include gemma (2B or 7B), Mistral (7B variant), LLaMA-2 (7B), phi-2 (2.7B) and others. This project involves selecting the best-suited model, utilizing existing datasets for fine-tuning, experimenting with different fine-tuning techniques based on data and computational resources, and finally, deploying the refined model.