The Great Library of Source Code
Software Heritage is an archival project for source code and its development history. Its long-term mission is to collect, preserve, and share our entire Software Commons, that is the body of knowledge expressed as publicly available software source code.
Software Heritage archive both source code and the associated development history, as it is captured by modern version control system. The data model is a Merkle DAG, where all source code artifacts—file contents, directories, commits, etc.—are thoroughly deduplicated, reducing storage requirements.
The Software Heritage archive is already the largest of its kind, having archive more than 5 billion unique source code files and more than 1 billion unique commits coming from more than 80 million software projects. The archive crawls periodically forges like GitHub and GitLab.com, distributions like Debian, and package managers like PyPI. The archive is accessible via a Web UI as well as a Web API.
The archive serves various different use cases, ranging from preservation of our cultural heritage for posterity to scientific research on "big code" analysis, from business needs of tracking software provenance to educational purposes in computer science curricula.
Software Heritage 2019 Projects
Graph compression on the development history of softwareSoftware Heritage is an ambitious research project whose goal is to collect, preserve in the very long term, and share the whole publicly accessible...
Increase archive coverageIncrease archive coverage As Software Heritage works on archiving and sharing source code, one of the major tasks is to ingest the latest source code...
Software Heritage - Web UI Improvements - kalpitkImprove the Web UI of the archive Software Heritage can be accessed through a beautiful and rich Web UI, developed in Django. Since the web portal is...