Contributor
kshitijsoni

GSoC 2022 Proposal - Extract text from tables in Scientific Papers by Kshitij Soni


Mentors
MagdalenaZ
Organization
Genome Assembly and Annotation
Technologies
python, sql, Biology
Topics
machine learning, genomics, biology, nlp, OCR
PyTesseract is really helpful, the first time I knew PyTesseract, I directly used it to detect some a short text and the result is satisfying. Then, I used it to detect text from a table but the algorithm failed to perform. This project aims to machine develop algorithms (preferably CNN and CNNA architectures) to identify tables and figures in pdf, then extract texts from those tables and figures and finally format them in a standard manner.