Software Heritage is on a mission to collect, preserve, and share all the publicly available software with its source code and development history. The archive periodically crawls GitHub, GitLab, Debian, PyPI, etc. It has preserved more than 11 billion unique source code files with 2.3 billion commits covering more than 165 million software projects.

The archive has a search feature to find repositories based on the repository URL or the metadata. This metadata includes the package name, description, license, etc.

TLDR:

I made this search more expressive with the help of advanced search features like filters, sorting options, and search query language (DSL) with autocompletion (optional)

Overview:

  • Introduced new fields in the search service (based on Elasticsearch), ingested data from other swh services through their RPC APIs or the journal service (Kafka), and built filters/sorting features.
  • Designed a grammar, built a parser and a translator (with TreeSitter) that traverses the AST to translate the custom query language (DSL) queries into Elasticsearch queries.
  • Implemented autocomplete features for the query language in the Web UI (using a wasm version of the same parser)

Organization

Student

Kumar Shivendu

Mentors

  • Valentin Lorentz
  • Vincent Sellier
close

2021