2024 Faculty Courses School of Computing Department of Computer Science Graduate major in Computer Science
Fault Tolerant Distributed Algorithms
- Academic unit or major
- Graduate major in Computer Science
- Instructor(s)
- Francois Pierre Andre Bonnet / Xavier Defago
- Class Format
- Lecture (Face-to-face)
- Media-enhanced courses
- -
- Day of week/Period
(Classrooms) - 7-8 Mon / 7-8 Thu
- Class
- -
- Course Code
- CSC.T527
- Number of credits
- 200
- Course offered
- 2024
- Offered quarter
- 3Q
- Syllabus updated
- Mar 14, 2025
- Language
- English
Syllabus
Course overview and goals
The course aims to develop a thorough understanding of fault-tolerance in distributed systems. Due to their nature, distributed systems are inherently vulnerable to failures if not designed properly. At any time, a subset of the processes in a distributed system may fail by crashing or could be compromised and behave in a treacherous way (e.g., Byzantine failures). It is hence essential to design distributed systems and applications in such a way that they can adequately cope with failures. The lecture will present focus on how to deal with these issues.
Course description and aims
By studying relevant methods and algorithms in details, the student will acquire a deep understanding of the issues at hand and the basic mechanisms to deal with such failures. Although the course will focus on the theory of such systems, it will also systematically draw links with practical applications, making it valuable to both theoreticians and practitioners.
Keywords
Distributed algorithms, message-passing, synchrony models, agreement, replication, fault-tolerance, Byzantine agreement, self-stabilization, blockchain, randomized algorithms
Competencies
- Specialist skills
- Intercultural skills
- Communication skills
- Critical thinking skills
- Practical and/or problem-solving skills
Class flow
Typical classes will alternate between slide-based presentations, interactive discussions, class exercises. Active contribution to class discussions is strongly encouraged.
Course schedule/Objectives
Course schedule | Objectives | |
---|---|---|
Class 1 | Introduction, models & definitions | Revision of basic concepts of distributed algorithms (models, synchrony, causality) |
Class 2 | Synchronous consensus | instructed in class. |
Class 3 | Asynchronous consensus, FLP impossibility proof | instructed in class. |
Class 4 | Consensus in the presence of partial synchrony | instructed in class. |
Class 5 | Asynchronous consensus with unreliable failure detectors | instructed in class. |
Class 6 | Eventual leader election, Paxos | instructed in class. |
Class 7 | Byzantine consensus (I) | instructed in class. |
Class 8 | Byzantine consensus (II) | instructed in class. |
Class 9 | Randomized consensus | instructed in class. |
Class 10 | State-machine replication | instructed in class. |
Class 11 | Distributed transactions, distributed ledger, blockchain mechanisms | instructed in class. |
Class 12 | Self-stabilization (I) | instructed in class. |
Class 13 | Self-stabilization (II) | instructed in class. |
Class 14 | Q&A + final test | instructed in class. |
Study advice (preparation and review)
To enhance effective learning, students are encouraged to spend approximately 100 minutes preparing for class and another 100 minutes reviewing class content afterwards (including assignments) for each class.
They should do so by referring to textbooks and other course material.
Textbook(s)
Course materials:
Slide copies, additional article copies, ...made available for download from the course webpage.
Reference books, course materials, etc.
Reference Books:
1. Michel Raynal, "Fault-tolerant message-passing distributed systems," Springer, 2018. https://www.springer.com/gp/book/9783319941400
2. Ajay Kshemkalyani, Mukesh Singhal, "Distributed computing: principles, algorithms, and systems," Cambridge Uni. Press, 2011.
3. Wan Fokkink, "Distributed algorithms: an intuitive approach ," MIT Press, 2013.
4. Vijay K. Garg, "Elements of distributed computing," IEEE, 2002.
5. Gerard Tel, "Introduction to distributed algorithms (2nd ed.)," Cambridge Univ. Press, 2000.
6. Shlomi Dolev, "Self-Stabilization," MIT Press, 2000. https://mitpress.mit.edu/books/self-stabilization
7. Karine Altisen, Stéphane Devismes, Swan Dubois, Franck Petit, "Introduction to Distributed Self-Stabilizing Algorithms", Morgan & Claypool, 2019.
Evaluation methods and criteria
Homework assignments and contribution to class discussion, assignments, reports (60%); and examination (40%).
Examination will assess the understanding of basic concepts of fault-tolerant distributed algorithms (problems, algorithms, and methodology) and reasoning (correctness and complexity).
Related courses
- CSC.T438 : Distributed Algorithms
- CSC.T524 : Dependable Computing
- MCS.T213 : Introduction to Algorithms and Data Structures
Prerequisites
Required knowledge:
Prior to taking this course, the student must have previously acquired,
through lectures or self-study, background knowledge on basic concepts
of fault-free distributed algorithms such as taught in the following
course:
- CSC.T438 Distributed algorithm
Other
Related course:
In the field of fault-tolerant and dependable computing systems, this course is complementary with:
- CSC.T524 Dependable Computing
- CSC.T438 Distributed Algorithms