Contributor
Harshit Joshi

Extracting data from PDF invoices and bills for financial accounting


Mentors
Thomas Levine, Manuel Riel, Pieter Willem Moerenhout
Organization
Debian Project

This project aims to develop a complete workflow for discovering bills (in a directory, mail folder or with a browser plugin to extract them from web pages), storing them (a document management system, folder or Git repository), extracting relevant data (bill data, currency and amount) and saving the data (in a format like cXML) in the same document management system. It may be necessary to create a GUI window to help the tool 'learn' how to read a PDF, remember the placement of different data fields in the PDF and automatically extract the same fields next time it sees a bill from the same vendor.