Leverage Spark Connect for interactive data analysis in Jupyter Notebooks
- Mentors
- Diogo Castro, Enric Tejedor, Luca Canali
- Organization
- CERN-HSF
- Technologies
- python, typescript, spark, JupyterLab
- Topics
- web, Data Engineering
CERN uses a service called Service for Web Analysis (SWAN) to perform analyses on scientific data, which is built on top of Jupyter notebook. Currently, the way a notebook kernel connects to a Spark cluster is through SWAN’s own open-source SparkConnector extension. Due to a current Spark limitation, it is not possible for multiple notebooks to share the same set of Spark resources. In addition, the process of spawning the Spark resources could take a while, and may add an inconvenience.
Currently, an effort is underway to employ a client-server architecture known as Spark Connect. This would allow multiple notebooks to connect to a previously instantiated Spark session and submit computations to it.
To make allocating Spark resources easier for users, this project proposes the development of a JupyterLab extension. The extension shall have a friendly interface that would allow users to instantiate one or more Spark sessions–in which the notebook will be able to connect, configure proper credentials and authentication, and make the connection accessible to the notebook code. The connection and session will be persistent across multiple kernels and restarts.