Contributor
Goodness Ay

[ScaleBugs] ScaleBugs: Reproducible Scalability Bugs


Mentors
Haryadi Gunawi, Cindy Rubio-González
Organization
UC OSPO
Technologies
python, java, docker, linux shell
Topics
distributed systems, debugging, reproducibility, scalability
Scalable systems lay essential foundations of the modern information industry. HPC data centers tend to have hundreds to thousands of nodes in their clusters. The use of “extreme-scale” distributed systems has given birth to a new type of bug: scalability bugs. As its name suggests, scalability bugs may be presented depending on the scale of a run, and thus, symptoms may only be observable in large-scale deployments, but not in small or median deployments. For example, Cassandra-6127 is a scalability bug detected in the popular distributed database Cassandra. The scalability bug causes unnecessary CPU usage, however, the symptom is not observed unless ~1000 nodes are deployed. This demonstrates the main challenge of studying scalability bugs: it is extremely challenging to reproduce without deploying the system at a large scale. The project goal is to build a dataset of reproducible scalability bugs. Achieving this will involve analyzing bug reports from popular distributed systems like (e.g., Cassandra, HDFS, Ignite, Kafka) and determining if the reported bug depends on the scale of the run, such as the number of nodes utilized, the size of files, and the number of requests. Identified bugs will be used to build a dataset of bug artifacts that contain the buggy and fixed versions of the scalability system, a runtime environment that ensures reproducibility, and a workload shell script that would trigger some system functionalities and demonstrate the symptoms of the bug under different configurations. For example, a successful reproduction should be able to show the performance drop along with an increasing number of nodes.