Contributor
Omkar Reddy Gojala

NUTCH-2369 Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph.


Mentors
Lewis McGibbney
Organization
Apache Software Foundation

Currently Apache Nutch has the concept of a WebGraph which builds Web graphs, performs a stable convergent link-analysis, and updates the crawldb with those scores. The main purpose of building a new Graph Generator tool for Nutch is to create a substantiated ‘deep’ graph enabling true traversal, this could be a game changer for how Nutch Crawl data is interpreted. This will involve storage of the crawl data as RDF datasets in the form of serialized n-quad statements. This graph can be used to execute queries on the webpages. Graph generation will be achieved using the Apache Tinkerpop ScriptInputFormat and ScriptOutputFormat’s respectively.