Distributed Big Data Analysis with TDataFrame
- Mentors
- Enrico Guiraud, Enric Tejedor Saavedra, Diogo Castro, Prasanth Kothuri, Danilo Piparo, Javier Cervantes
- Organization
- CERN-HSF
The main objective of this project is to make it easier for researchers/developers to submit distributed jobs for analyzing datasets using TDataFrame in ROOT library and a Distributed Computing framework like Apache Spark. This project proposes a Python library with tidy abstractions to perform distributed analysis as well as to select appropriate distributed environments [like Apache Spark].
Also, Jupyter notebook has become quite popular these days to carry out numerical/graphical analysis tasks. Hence, a new Jupyter extension would also be implemented as a part of this project. The extension gives users a graphical interface to select various parameters for launching a Distributed job. This extension also allows users to select cells for constructing analysis functions for datasets.