Contributor
Krishnan R

Big Data Tools for Physics Analysis


Mentors
Danilo Piparo, Prasanth Kothuri, Enric Tejedor, Kacper Surdy
Organization
CERN-HSF

Jupyter Notebook is an interactive computing environment that is used to create notebooks which contain code, output, plots, widgets and theory. Jupyter notebook offers a convenient platform for interactive data analysis, scientific computing and rapid prototyping of code. A powerful tool used to perform complex computation intensive tasks is Apache Spark. Spark is a framework for large scale cluster computing in Big Data contexts. This project aims to leverage these existing big data tools for use in an interactive scientific analysis environment. Currently Spark jobs can be called from Jupyter Notebook using the pySpark module. However to know what is happening to a running job, it is required to connect separately to the Spark server. This project aims to develop a plugin to monitor jobs sent from a notebook application, from within the notebook itself. The plugin will have features to monitor tasks, stop ongoing jobs and detect errors. ROOT is a data analysis framework, widely used in the scientific community at CERN. The plugin designed will be used to monitor processing of ROOT objects using Spark. The plugin will also be used to monitor distributed machine learning tasks.