The Great Library of Source Code
Software Heritage is an archival project for source code and its development history. Its long-term mission is to collect, preserve, and share our entire Software Commons, that is the body of knowledge expressed as publicly available software source code. Software Heritage archive both source code and the associated development history, as it is captured by modern version control system and package managers. The data model is a Merkle DAG, where all source code artifacts — file contents, directories, commits,etc. — are thoroughly deduplicated, reducing storage requirements. The Software Heritage archive is already the largest of its kind, having archive more than 9 billion unique source code files and 2 billion unique commits coming from more than 150 million software projects. The archive crawls periodically forges like GitHub and GitLab.com, distributions like Debian, and package managers like PyPI or NPM. The archive is accessible via a Web UI as well as a WebAPI. The archive serves various different use cases, ranging from preservation of our cultural heritage for posterity to scientific research on "big code" analysis, from business needs of tracking software provenance to educational purposes in computer science curricula.
Software Heritage 2021 Projects
Advanced search features for Software Heritage ArchiveSoftware Heritage is on a mission to collect, preserve, and share all the publicly available software with its source code and development history....
Software Heritage Code Scanner Improvements for Production EnvironmentsSoftware Heritage has the biggest open archive of the source code publicly available, it captures software projects from various forges and all of...