Spark3D: Extend Apache Spark to support 3D Spatial Datasets
- Mentors
- Christian Arnault, Julien Peloton
- Organization
- CERN-HSF
A large amount of 3D data is generated in High Energy Physics & Astrophysics experiments. To process this data efficiently, one would need state-of-the-art tools. Already a lot of development has been done in processing 2D data with projects like spatial Hadoop and GeoSpark but, there are very few frameworks to process the 3D data. The idea is to follow the footsteps of GeoSpark and provide a way to load, process and analyse 3D data sets economically and efficiently by leveraging the distributed computation functionality of the spark. Spark3D would provide the set of out-of-the-box 3D Spatial RDD (3D SRDD) to partition the data across machines. Ultimately, Spark3D would be available as an open-source library which works with all recent versions of the Spark (2.0+), has user friendly APIs (in Scala, Java and Python), works on top of all major platforms out of the box (HDFS, S3, Cassandra, etc.) and supports all major file formats (CSV, Parquet, JSON, Avro, etc.) including popular scientific file formats such as FITS.