Currently Apache Nutch has the concept of a WebGraph which builds Web graphs, performs a stable convergent link-analysis, and updates the crawldb with those scores. The main purpose of building a new Graph Generator tool for Nutch is to create a substantiated ‘deep’ graph enabling true traversal, this could be a game changer for how Nutch Crawl data is interpreted. This will involve storage of the crawl data as RDF datasets in the form of serialized n-quad statements. This graph can be used to execute queries on the webpages. Graph generation will be achieved using the Apache Tinkerpop ScriptInputFormat and ScriptOutputFormat’s respectively.


Omkar Reddy Gojala


  • Lewis McGibbney