Spiders can become broken due to changes on the target site, which lead to different page layouts (therefore, broken XPath and CSS extractors). Often however, the information content of a page remains the same, just in a different form or layout. This project would concern the use of snapshotted versions of a target page, combined with extracted data from that page, to infer rules for scraping the new layout automatically. “Scrapely” is an example of a pre-existing tool that might be instrumental in this project.


Viral Mehta


  • asadurski
  • Cathal Garvey