Contributor
Zahra Nabila

ScaleBugs: Reproducible Scalability Bugs


Mentors
Haryadi Gunawi, Cindy Rubio-González
Organization
UC OSPO
Technologies
java, docker, Phyton, linux shell
Topics
reproducibility, Scalability systems, bug patterns, bug dataset
High-performance computing (HPC) data centers tend to have hundreds to thousands of nodes in their clusters. The use of “extreme-scale” distributed systems has given birth to a new type of bug: scalability bugs, bugs that depending on the scale of a run, and thus, symptoms may only be observable in large-scale deployments, but not in small or median deployments. The symptom is not observed unless ~1000 nodes are deployed, making scalability bugs challenging to reproduce and fix. The goal of this project is to build a dataset of reproducible scalability bugs. To achieve this, studying the bugs in detail to understand the root causes, behavior and impact on the performance system and craft workloads designed to trigger the functionalities of the system under different configurations (e.g.different numbers of nodes) is needed to reproduce these scalability bugs. Having a dataset of reproducible bugs can help in developing better testing and validation strategies for scalable systems to ensure that they work as expected under different scales. Also to provide a resource for researchers and developers to study and address scalability bugs and contribute to the improvement of the reliability and efficiency of scalable distributed systems.