Architecture Weekly #16

Architecture Weekly Issue #16. Articles, books, and playlists on architecture and related topics. Every record has the complexity indication: 🀟 means hardcore, πŸ‘·β€β™‚οΈ is technically applicable right away,  🍼 - introduction to the topic or an overview. Now in telegram as well.

WARNING πŸ‡ΊπŸ‡¦

It's already a two month and a half of crazy, inhuman, unjustified war of Russia against Ukraine. We condemn this war and want it to stop ASAP. We continue this newsletter so you can advance your skill and help the millions of Ukrainian people in any way possible.

Fault Injection for Reliability Testing πŸ‘·β€β™‚οΈ

We already shared an article about Chaos Engineering in Netflix. DoorDash goes beyond and introduces a tool that can analyze the microservices and inject the expected failure in order to test the output and measure resiliency. Details inside.

Using Fault Injection Testing to Improve DoorDash Reliability
When failure is inevitable, building fault tolerance with fault injection testing ensures that failures do not bring the platform down with them

Load-balanced Brooklin Mirror Maker 🀟

Having multiple Kafka clusters in multiple data centers can be challenging from the replication perspective. LinkedIn was using Kafka Mirroring for it but experienced scaling issues. So they introduced their own open-source mirroring solution: Brooklin. Read below what difficulties they faced further and how they overcame them.

Load-balanced Brooklin Mirror Maker: Replicating large-scale Kafka clusters at LinkedIn
At LinkedIn, Apache Kafka is used heavily to store all kinds of data, such as member activity, log storage, metrics storage, and a multitude of inter-service messaging. LinkedIn maintains multiple data centers with multiple Kafka clusters per data center, each of which contains an independent set of…

Operation-Based SLOs 🍼

Service Level Objectives are a measure of how a service behaves in terms of latency, availability, etc. This is a good SRE practice. However in the microservice world, the product usually consists of multiple services, so it is hard to connect the product metrics to individual MS SLOs. That's why Zalando basically introduced business SLOs or how do they call it Operation-Based SLOs. Read further.

Zalando Engineering Blog - Operation-Based SLOs
Zalando developed a new type of SLOs to monitor the critical aspects of its business which is based on Operations. This blog post describes how that framework works, and how it contributes to...

Combining Event Driven Architecture with Microservices Style πŸ‘·β€β™‚οΈ

Huge article by IBM from 2020 on architecture considerations of both styles, complexities, architecture blueprint, patterns, technology stack, and deployment practices.

IBM Developer

Fault Tolerance via Idempotence 🀟

The complex white paper which introduces a language to talk about process failures and idempotence and then proves that having idempotence leads to a correct transaction behavior in the distributed system. A lot of monads inside.

Fault Tolerance via Idempotence - Microsoft Research
Writing applications for distributed systems is challenging because of the pitfalls of distribution such as process failures, communication failures, asynchrony and concurrency. Abstractions such as distributed transactions and workflows address some of these pitfalls, but many challenges remain. On…

Books for Great Software Architects 🍼

This series of articles describes the architecture learning path based on great books. It provides a clear sequence of materials and helps to avoid double reading. It is not finished yet, but it looks good. Also, we recommend adding Learning Domain-Driven Design by Vladik Khononov in the DDD thread.

How Scentbird moved to a new payment service πŸ‘·β€β™‚οΈ

An exciting story written by my pal Andrew Rebrov, a CTO of Scentbird, on how they understood what they expect from a subscription service, how much does it cost to build one, and how to migrate the users. Must read.

No plan survives contact with the enemy: How we moved to a new payment service
For me, January of this year was not only a series of holidays, but also an occasion to celebrate one year since we completed our transition to a new payment gateway. As a perfume subscription service, this is a critical system, and if it has problems, then this affects everyone

Warp: Lightweight Multi-Key Transactions for Key-Value Stores 🀟

Murat continues to review computer science papers. This time it is about distributed transactions for NoSQL systems. There are interesting decisions with chained communication pattern and optimistic concurrency control. It looks like 2PC, but the paper's authors assure good performance (75% of non-transactional solution).

Warp: Lightweight Multi-Key Transactions for Key-Value Stores
This paper introduces a simple yet powerful idea to provide efficient multi-key transactions with ACID semantics on top of a sharded NoSQL ...

SoundCloud Chronicles the End of the Public API Strangler 🍼

A short story about an 8-year-long migration to a fully-fledged Backend For Frontend inside SoundCloud. Strangler pattern with telemetry is safe. But it also brings non-obvious risks: unhealthy codebase for many years, security concerns, and complexity for feature development.

SoundCloud Chronicles the End of the Public API Strangler
SoundCloud has successfully completed their migration journey using the Strangler pattern from a monolith application to a fully-fledged BFF.

Delivering Large-Scale Platform Reliability 🍼

Roblox team describes what they do for reliability. First, it bases on measurements for all product lifecycle: from CI to client experience. Also, they have architectural reviews and aggressive low latency policies for clients. Look how they keep attention to SLA of internal dependencies. Monthly Reliability Report is the perfect instrument to share information with the whole team. All of these look very open and promising after huge downtime history.

Delivering Large-Scale Platform Reliability - Roblox Blog
Showcasing a quality-oriented process for achieving higher reliability in microservices acting together to improve the platform.

The newsletter is supported by 5 premium subscribers. It helps to pay for the hosting and mailing services, but doesn't do the job completely and we still need at least another 5. If you liked the newsletter please consider supporting it as well by subscribing to a premium subscription.

Brought to you by Vladimir @vvsevolodovich Ivanov and Ilya @puzan Zonov