Currently, Atarashi has 4 active agents types that use 9 different types of algorithms to find the similarity between license statements. According to the evaluation of all the agents and their types, it can be seen that there is a need for improvement in the speed as well as the accuracy of the agents. I’ll be implementing the basics of Lucene and percolate search on the top of different agents to increase the speed of the agent without much tweaking with their original algorithm.

A python library for code comment extraction is needed to be created with various additional features for providing support to Atarashi as well as FOSSology. This will be packaged & published to PyPI for easy installation and usage.

For any Machine Learning/Deep Learning model to be implemented we need a dataset of SPDX open-source licenses. There isn’t any dataset available online so we need to create our own dataset as per the requirements.

I’ll be working on the improvement of the semanticTextSim agent and will be introducing a new part of it. Currently, we have doc2vec implementation and I’m planning to introduce the BERT embeddings for finding semantic text similarity.





  • Anupam Ghosh
  • Gaurav Mishra
  • Aman Jain