Improve ScanCode License detection accuracy, by leveraging the ClearlyDefined dataset of Scans
- Mentors
- Philippe Ombredanne, Arnav Mandal
- Organization
- AboutCode.org
ScanCode license detection is using multiple techniques to accurately detect licenses based on automatons, inverted indexes, and multiple sequence alignments. The detection is not always accurate enough. The goal of this project is to improve the accuracy of license detection leveraging the ClearlyDefined data set, where ScanCode is used to massively scan millions of packages.
Some of the cases (not limited to) where this project proposes to improve license detection accuracy are:
- when multiple licenses are detected with a low score and some detections are incorrect.
- when some unknown licenses may not be detected correctly.
- text/code identical to license tags resulting in false-positives
- when license references such as "see license in file LICENSE.txt" are reported as unknown license references.
This project aims to write tools and create models to massively analyze the accuracy of license detection and detect areas where the accuracy could be improved. These tools and models would be reusable to assist in the semi-automated review of scan results. It will also create new license detection rules semi-automatically to fix the detected anomalies.