Contributor: Fabian Billert

EDGAR-crawler 2.0: Enhancing Information Extraction of Company Reports

Mentors: Lefteris Loukas, Ion Androutsopoulos
Organization: Open Technologies Alliance - GFOSS
Technologies: python, Transformers, Regex, Gradio
Topics: natural language processing, fintech, Large Language Models, Information Extraction

This project aims to expand the information extraction capabilities of the EDGAR-crawler, a project which allows users to download different types of company-published reports on the SEC managed platform EDGAR. More specifically, the project will add support for the 10-Q and 8-K report types. Additionally, it is planned to explore the possibilty of using large language models in order to automatically create regular expressions to support the current workflow in two ways: First, when an already supported extraction mechanism is facing a report with structural errors, using automatic regex generators could fix the problem online. Second, using this technology would allow for the addition of many more report types to the crawler much more quickly than if it were done manually.