Currently Apache Nutch has the concept of a WebGraph which builds Web graphs, performs a stable convergent link-analysis, and updates the crawldb with those scores. The main purpose of building a new Graph Generator tool for Nutch is to create a substantiated ‘deep’ graph enabling true traversal, this could be a game changer for how Nutch Crawl data is interpreted. This will involve storage of the crawl data as RDF datasets in the form of serialized n-quad statements. This graph can be used to execute queries on the webpages. Graph generation will be achieved using the Apache Tinkerpop ScriptInputFormat and ScriptOutputFormat’s respectively.

Student

Omkar Reddy Gojala

Mentors

  • Lewis McGibbney
close

2017