Architecture Weekly #166

Architecture Weekly Issue #166. Articles, books, and playlists on architecture and related topics. Split by sections, highlighted with complexity: 🤟 means hardcore, 👷‍♂️ is technically applicable right away, 🍼 - is an introduction to the topic or an overview. Now in telegram and Substack as well.

System Design Course Cohort #5 is closed, but you can apply for Cohort #6 waiting list here.

Highlights

Understanding Distributed Consensus with Paxos 🤟

Paxos is a consensus algorithm. It may apparent as highly complex, but in the core it's just two methods - get and set - implemented for a group of 3 nodes. This awesome post will guide you step by step of the whole problematic of getting and setting a value in distributed system given timeouts and partitioning. Absolute must read.

Distributed consensus

#distributedsystems

Choose Boring Technology 🍼

Technology is indeed fun, even if it's not art anymore. However the technology should primarily solve business problems. Apparently, if you pick up a narrow circle of technologies to do, you experience less risk and spend more time building the product rather than fighting the tech in the first place. So, choose boring technology!

Choose Boring Technology

How to be old, for young people.

#philosophy

Understanding transaction visibility in PostgreSQL clusters with read replicas 🤟

See why your read replica might disagree with the primary! 🔍 AWS’s latest post breaks down the “Long Fork” quirk in PostgreSQL clusters—where replicas can show commits in a different order than the primary—and why it doesn’t risk data loss. You’ll get a plain‑English tour of snapshot vs. WAL timing, learn how future Commit Sequence Numbers aim to fix it, and pick up quick tips to keep your apps safe until the patch lands.

Understanding transaction visibility in PostgreSQL clusters with read replicas | Amazon Web Services

On April 29, 2025, Jepsen published a report about transaction visibility behavior in Amazon RDS for PostgreSQL Multi-AZ clusters. We appreciate Jepsen’s thorough analysis and would like to provide additional context about this behavior, which exists both in Amazon RDS and community PostgreSQL. In this post, we dive into the specifics of the issue to provide further clarity, discuss what classes of architectures it might affect, share workarounds, and highlight our ongoing commitment to improving community PostgreSQL in all areas, including correctness.

Amazon Web Services

#db

Follow-Up

How to avoid Single Point of Failure? 🍼

If a component goes down and the system stops functioning, that components becomes a single point of failure. Having such points is big risks for availability, and system architects should avoid it. This post is a good starting point to understand what SPoFs are all about.

System Design: How to Avoid Single Point of Failures?

A Single Point of Failure (SPOF) is a component in your system whose failure can bring down the entire system, causing downtime, potential data loss, and unhappy users.

AlgoMaster NewsletterAshish Pratap Singh

#reliability

Postgres as a Graph Database 👷‍♂️

Turn your everyday Postgres into a mini graph powerhouse! Supabase’s new post shows how the pgRouting extension lets you run classic graph tricks—shortest paths, critical‑path scheduling, even smart server‑to‑server routing—without leaving SQL. It’s a quick, code‑packed tour that proves you don’t need Neo4j to think in nodes and edges.

Postgres as a Graph Database: (Ab)using pgRouting

Learn how to use pgRouting as a lightweight graph database solution in Postgres.

Supabase

#postgresql #db

Monarch: Google’s Planet-Scale In-Memory Time Series Database 👷‍♂️

Borgmon was the initial system at Google responsible for monitoring the behavior of internal applications and infrastructure. Each team has to deploy and maintain their own instance of Borgmon, thus requiring specialized knowledged about the tool. In 2010 Google moved from Borgmon to Monarch: in-memory time series database now handling all the internal monitoring across the globe. Read the paper to understand it's distributed architecture and scale.

#observability #distributedsystems

How Cursor Works 👷‍♂️

AI IDEs truly blew up. And it's even more interesting how they work underhood - do they just really send a piece of code and a prompt to an LLM? This post will show how it happens.

How Cursor (AI IDE) Works

Turning LLMs into coding experts and how to take advantage of them.

Shrivu’s SubstackShrivu Shankar

#ai

How Forethought saves over 66% in costs for generative AI models 👷‍♂️

Forethought’s engineering team shows how moving their fleet of fine‑tuned, customer‑specific generative‑AI models from EKS to Amazon SageMaker multi‑model endpoints chopped hosting costs by 66 %, while SageMaker’s smart model loading keeps latency sub‑second. The article walks through the old vs. new stacks, shares real $/hour numbers, and offers tips for anyone wrangling lots of small LLMs on shared GPUs.

How Forethought saves over 66% in costs for generative AI models using Amazon SageMaker - Forethought AI Engineering

Welcome to the Forethought AI engineering team’s blog! We are a group of software engineers, data scientists, and machine learning experts who are committed to building innovative solutions to improve the efficiency and effectiveness of customer service teams.

logo

#ai #sagemaker

Big thanks to Nikita, Constantin, Anatoly, Oleksandr, Dima, Pavel B, Pavel, Robert, Roman, Iyri, Andrey, Lidia, Vladimir, August, Roman, Egor, Roman, Evgeniy, Nadia, Daria, Dzmitry, Mikhail, Nikita, Dmytro, Denis and Mikhail for supporting the newsletter on Patreon!