Architecture Weekly Issue #172. Articles, books, and playlists on architecture and related topics. Split by sections, highlighted with complexity: 🀟 means hardcore, πŸ‘·β€β™‚οΈ is technically applicable right away,  πŸΌ - is an introduction to the topic or an overview. Now in telegram and Substack as well.

Highlights

How getsupplied.ai adopted Infrastructure-as-Code πŸ‘·β€β™‚οΈ

Going with Infrastructure-as-a-Code from day 1 is a mistake. But once you see a clear business need, it becomes a blessing. At getsupplied.ai we just migrated to infrastructure as code and solved several problems at once.

How getsupplied.ai adopted Infrastructure-as-a-Code
Going with Infrastructure-as-a-Code from day 1 is a mistake. But once you see a clear business need, it becomes a blessing. At getsupplied.ai we just migrated to infrastructure as code and solved several problems at once.

#iaac #terraform #casestudy

It's The End Of Observabilty As We Know It 🍼

Charity Majors delivers another insightful piece where she demonstrates how combining observability data and LLMs you may not obtain mere hints, you can analyze entire production problems with a fraction of a cost. The question is - how it really shapes our future?

It’s The End Of Observability As We Know It (And I Feel Fine)
The history of observability tools over the past decade has been about a pretty simple concept, but LLMs bring the death of that paradigm.

#observability #ai

Coordinated Progress Series 🀟

Orchestration, Choreography, Durable Worlflows are relevant today as never before. Jack Vanlightly wrote an awesome 4 part series to give you the whole understanding of the distributed process execution, and it would be a shame if you go by. Don't make this mistake!

Coordinated Progress – Part 1 – Seeing the System: The Graph β€” Jack Vanlightly
At some point, we’ve all sat in an architecture meeting where someone asks, β€œ Should this be an event? An RPC? A queue? ”, or β€œ How do we tie this process together across our microservices? Should it be event-driven? Maybe a workflow orchestration? ” Cue a flurry of opinions, whiteboard arrows, and

#distributedsystems

Wanna know how to Design Systems that deliver business value?

Business Oriented System Design Course Cohort #6 is officially open!

Looking for a way to advance your career? Felt you overgrew the mere feature development, but lack skills to design complete systems? Want to make the business impact? 10 hours of content packed lectures, engaging practice and the final work you will be proud to showcase as well as Credly(by Pearson)-based digital certificate proving your experience. More than 70 engineers already passed the course with amazing feedback and advanced their careers. New cohort starts on 23rd of July. Find the Details, Feedbacks and Enrollment into the course is here. Only 3 places left!

Follow-Up

High Availability Load Balancers with Maglev πŸ‘·β€β™‚οΈ

For Cloudflare Load Balancing is the backbone of their entire business making them experts in Zero Downtime. Find out why Maglev algorithm was an approach of choice for them.

High Availability Load Balancers with Maglev
We own and operate physical infrastructure for our backend services. We need an effective way to route arbitrary TCP and UDP traffic between services and also from outside these data centers.

#performance #availability

Avoid Hot Keys in Aurora DSQL πŸ‘·β€β™‚οΈ

On instance-based Postgres, such as RDS, the latest row value is usually available in memory. It's not the case with Aurora DSQL, as the value should be propagated to replicas. Then even counting page visits becomes a performance bottleneck. Marc Bowes explains what happening and advises in this great blog post. 

Aurora DSQL Best Practices: Avoid Hot keys

#db #aws #auroradsql

The Million-Dollar Problem of Slow Microservices Testing πŸ‘·β€β™‚οΈ

A great socio-technical question backed by the financial calculations - can't dream about a better start to an article. And while a profound question on microservices testing is answered with Tenant-Based Environment, I still wonder: can you solve the 80% of the problem with testcontainers?

The Million-Dollar Problem of Slow Microservices Testing
By shifting integration tests from the slow outer loop into the rapid inner loop, organizations can fundamentally transform their development process.

#microservices #qa

Scaling AWS DynamoDB πŸ‘·β€β™‚οΈ

DynamoDB is a serverless key-value database offering at AWS. It claims to scale indefinitely while providing high durability and %99.99 of availability. It is quite interesting how AWS achieves this under the hood. That's why I am sharing this article. Grab a read!

Software Architecture Deep Dive - Scaling AWS Dynamo DB
How DynamoDB, key-value schemaless cloud-native data store scales: Architecture and Design Lessons

#db #aws

NATS ADRs repo 🍼

NATS is a simple, secure and high performance open source data layer for cloud native applications, IoT messaging, and microservices architectures. But we are here not for the product, but for the ADR repo of them: frequently people ask about real life examples, and typically they are either artifical or not availble due to NDA. Well, grab real life examples and learn!

GitHub - nats-io/nats-architecture-and-design: Architecture and Design Docs
Architecture and Design Docs . Contribute to nats-io/nats-architecture-and-design development by creating an account on GitHub.

#adr

Google's incident Post Mortem led to half of the internet outage

On June 12th, 2025 high number of major companies experienced outage, Cloudflare included. The reason for this was a deployment of a new functionality in Google Cloud and the quality applied was substandard. Important learning though!

Google Cloud Service Health

Big thanks to Nikita, Constantin, Anatoly, Oleksandr, Dima, Pavel B, Pavel, Robert, Roman, Iyri, Andrey, Lidia, Vladimir, August, Roman, Egor, Roman, Evgeniy, Nadia, Daria, Dzmitry, Mikhail, Nikita, Dmytro, Denis and Mikhail for supporting the newsletter on Patreon!