AI-Powered Software License Identification
- Mentors
- Anupam Ghosh, HastagAB, Kaushlendra, Vasudev, SinghShreya
- Organization
- FOSSology
- Technologies
- python, pytorch, pandas, Transformers (Hugging Face)
- Topics
- machine learning, natural language processing, Large Language Models, Open-Source Licenses, Copyrights
One of Fossology's primary features is extracting licenses and license text from files. Traditional methods like text comparison, regular expressions, and SPDX identifiers can result in false positives, often requiring human review.
This project aims to leverage recent advancements in Large Language Models (LLMs). LLMs can process text at a near-human level or beyond. In the domain of copyrights and licenses, models like Gemini-Pro, ChatGPT-3.5, and GPT4 have demonstrated exceptional accuracy in recognizing and converting licenses into formats like SPDX.
Research across various fields, including medicine and coding, indicates that smaller, domain specific LLMs can outperform larger, general-purpose models in their respective areas.
For this project, we'll fine-tune a relatively small LLM with parameters in the 2-7 billion range for optimal performance in license identification and other applications. Potential models include gemma (2B or 7B), Mistral (7B variant), LLaMA-2 (7B), phi-2 (2.7B) and others.
This project involves selecting the best-suited model, utilizing existing datasets for fine-tuning, experimenting with different fine-tuning techniques based on data and computational resources, and finally, deploying the refined model.