Books are an essential driver for one's growth. Never stop reading! If you're in software architecture and distributed systems, here's 12 book I would like to read myself. Go through it and tell me, if you want to add or replace anything!
1. Distributed Systems (3rd Edition, 2017)
Authors: Maarten van Steen, Andrew S. Tanenbaum
Key Themes: Fundamentals of distributed computing, communication, synchronization, consistency, fault tolerance.
Why Read It: This academic classic provides a robust theoretical foundation. It covers the major challenges in designing large, fault-tolerant systems and examines the underlying mechanisms of modern distributed architectures.
data:image/s3,"s3://crabby-images/b4f54/b4f54ef494296cf8c55c795c8c3e48131195e4e2" alt=""
2. Implementing Domain-Driven Design (2013)
Author: Vaughn Vernon
Key Themes: Domain-driven design (DDD), strategic design, bounded contexts, aggregates, ubiquitous language.
Why Read It: A modern deep dive into Eric Evans’ original DDD concepts. Demonstrates practical techniques for modeling complex software systems, which is crucial for scalable architectures—distributed or otherwise.
data:image/s3,"s3://crabby-images/70af4/70af4175b6164eefa187a01ca81b92d518db4d4d" alt=""
3. Balancing Coupling in Software Design (2024)
Author: Vlad Khononov
Key Themes: Managing coupling and cohesion in software, understanding dependencies, trade-offs in modularity, decoupling strategies.
Why Read It: Offers a fresh perspective on balancing coupling and cohesion to build maintainable, scalable systems. Practical guidance helps navigate trade-offs in real-world software design, emphasizing long-term adaptability and robustness.
data:image/s3,"s3://crabby-images/5489a/5489aaf0d122ba7b5796209ad0ccaa6bb088bcf5" alt=""
4. Chaos Engineering: System Resiliency in Practice (2020)
Authors: Casey Rosenthal, Nora Jones
Key Themes: Resilience testing, failure injection, risk management, continuous verification of distributed systems.
Why Read It: Explores how intentionally “breaking things” in controlled experiments can reveal weaknesses and help build bulletproof systems. Ideal if you’re interested in resilience and fault tolerance beyond typical high-level theory.
data:image/s3,"s3://crabby-images/6ad31/6ad31549d9a0e0ee8c77cd72dec61e944f4cf4ed" alt=""
5. Systems Performance: Enterprise and the Cloud (2nd Edition, 2020)
Author: Brendan Gregg
Key Themes: Performance tuning, benchmarking, Linux kernel internals, observability, distributed tracing.
Why Read It: An authoritative resource on diagnosing performance bottlenecks in modern systems. Gregg’s methodologies and instrumentation practices are invaluable when scaling distributed services.
data:image/s3,"s3://crabby-images/54b42/54b4231900f62b3ab1683f337915b7e2487f09a3" alt=""
6. Designing Distributed Systems: Patterns and Paradigms for Scalable, Reliable Services (2018)
Author: Brendan Burns
Key Themes: Container-based architectures, cluster scheduling, sharding, service discovery, load balancing.
Why Read It: Provides patterns drawn from real-world usage, particularly around Kubernetes, but goes beyond microservices. Focuses on core distributed concepts—like replication, queue-based load leveling, and orchestrating containerized workloads.
data:image/s3,"s3://crabby-images/67ef6/67ef602c709cc8bac1066cef4e9fa09151012f11" alt=""
7. Release It! (2nd Edition, 2018)
Author: Michael T. Nygard
Key Themes: Production readiness, circuit breakers, bulkheads, observability, risk management.
Why Read It: A must-read on building resilient applications that can handle real-world failures. Introduces “stability patterns” and “anti-patterns” gleaned from extensive in-the-trenches experience.
data:image/s3,"s3://crabby-images/c5263/c526327dc3f641a002a7dbb03969f2b0dfa93a9b" alt=""
8. Enterprise Integration Patterns (2003)
Authors: Gregor Hohpe, Bobby Woolf
Key Themes: Messaging systems, routing, transformations, asynchronous communication, event-driven architecture.
Why Read It: The classic reference for designing robust message-based systems. Although it’s older, the patterns remain integral to modern distributed architectures—especially for event-oriented solutions that don’t revolve solely around microservices.
data:image/s3,"s3://crabby-images/d40b4/d40b4aa759584505133dabf1f026040656ad1013" alt=""
9. The Art of Scalability (2nd Edition, 2015)
Authors: Martin L. Abbott, Michael T. Fisher
Key Themes: Scaling technology, scaling organizations, the “Scale Cube,” performance architecture.
Why Read It: Addresses both the technical and managerial challenges of operating large systems. Offers a structured approach to analyzing and remediating scale bottlenecks at different layers of the stack.
data:image/s3,"s3://crabby-images/51eec/51eecd9f9a229689d65ae438405c21d8c5c9075f" alt=""
- Building Multi-tenant SaaS Architectures: Principles and Best Practices (O’Reilly, 2023/2024)
Authors: Tod Golding
Key Themes: Multi-tenant fundamentals (silo vs. pooled models), data isolation and security, cost optimization, automated tenant onboarding, monitoring and observability, scaling strategies, and compliance requirements.
Why Read It: Offers clear guidance on how to design, build, and operate shared-infrastructure SaaS platforms that support multiple customers efficiently. Covers everything from database partitioning and identity management to DevOps workflows and cost management—ensuring a secure, compliant, and scalable multi-tenant environment.
data:image/s3,"s3://crabby-images/b5de0/b5de0bd0b4dff3f84ea4529f4f18d69d12917468" alt=""
11. Building Secure & Reliable Systems (2020)
Authors: Heather Adkins, Betsy Beyer, Paul Blankinship, the Google Security & Reliability Teams
Key Themes: Security best practices, reliability engineering, risk management, secure-by-design principles.
Why Read It: A follow-up of sorts to Google’s SRE-related works, focusing specifically on designing resilient systems that are also hardened against security threats. Ties in nicely with production-readiness and system reliability.
data:image/s3,"s3://crabby-images/1b30b/1b30b3800825dc03caaa199f9c3cec78afb3ee16" alt=""
12. Scalability Rules: 50 Principles for Scaling Web Sites and Applications (2011)
Authors: Martin L. Abbott, Michael T. Fisher
Key Themes: Pragmatic scaling principles, capacity planning, caching strategies, parallelism, concurrency.
Why Read It: Written by the same authors of The Art of Scalability but offers actionable, distilled guidelines in a format that’s easy to reference. Each “rule” serves as a best-practice template for tackling growth challenges.
data:image/s3,"s3://crabby-images/812d1/812d15056fd9bc8a7c86e6a1c42938b739970446" alt=""
Let me know if you would like to add anything here!
System Design Course
Looking to advance your system design skills further? I've got a Business Oriented System Design Course to help you! The Cohort #3 is running now, so you can sign up for the next one starting end of January. Follow this page: https://vvsevolodovich.dev/business-oriented-system-design-course/
data:image/s3,"s3://crabby-images/d098c/d098cff2200953e92165ecc5c225441f7b85f052" alt=""