CREB project

In large-scale distributed systems, node crashes are inevitable, and can happen at any time. As such, distributed systems are usually designed to be resilient to these node crashes via various crash recovery mechanisms, such as write-ahead logging in HBase and hinted handoffs in Cassandra. However, faults in crash recovery mechanisms and their implementations can introduce intricate crash recovery bugs, and lead to severe consequences.

CREB documents 103 Crash REcovery Bugs from four popular open-source distributed systems, including ZooKeeper, Hadoop MapReduce, Cassandra and HBase. For all the bugs, we analyze their root causes, triggering conditions, bug impacts and fixing. CREB can serve as a basis for future work on finding and fixing crash recovery bugs. We have made the collected bugs and our analysis available online for future studies.

Please use the data only for teaching and research purposes. Should you have any questions, please contact Wensheng Dou (wsdou@otcaix.iscas.ac.cn) and Yu Gao (gaoyu15@otcaix.iscas.ac.cn).

Please leave your contact information before downloading.

An Empirical Study on Crash Recovery Bugs in Large-Scale Distributed Systems [PDF] [Slide]
Yu Gao, Wensheng Dou, Feng Qin, Chushu Gao, Dong Wang, Jun Wei, Ruirui Huang, Li Zhou, Yongming Wu
26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018).

Name:
Email:
Affiliation: