The Great Library of Source Code

Software Heritage is an archival project for source code and its development history. Its long-term mission is to collect, preserve, and share our entire Software Commons, that is the body of knowledge expressed as publicly available software source code. Software Heritage archive both source code and the associated development history, as it is captured by modern version control system and package managers. The data model is a Merkle DAG, where all source code artifacts — file contents, directories, commits,etc. — are thoroughly deduplicated, reducing storage requirements. The Software Heritage archive is already the largest of its kind, having archive more than 9 billion unique source code files and 2 billion unique commits coming from more than 150 million software projects. The archive crawls periodically forges like GitHub and GitLab.com, distributions like Debian, and package managers like PyPI or NPM. The archive is accessible via a Web UI as well as a WebAPI. The archive serves various different use cases, ranging from preservation of our cultural heritage for posterity to scientific research on "big code" analysis, from business needs of tracking software provenance to educational purposes in computer science curricula.

lightbulb_outline View ideas list

Technologies

  • python
  • postgresql
  • django
  • git
  • elasticsearch

Topics

comment IRC Channel
email Mailing list
mail_outline Contact email

Software Heritage 2021 Projects

  • Kumar Shivendu
    Advanced search features for Software Heritage Archive
    Software Heritage is on a mission to collect, preserve, and share all the publicly available software with its source code and development history....
  • danseraf
    Software Heritage Code Scanner Improvements for Production Environments
    Software Heritage has the biggest open archive of the source code publicly available, it captures software projects from various forges and all of...
close

2021