About me

I’m fractional CTO, Software engineer and hands-on engineering manager. I was always passioned about building highly scalable, secure and robust applications.

What this guide will cover

Why do you need to scale?
What do you need to scale?
Complete scaling techniques list
Process of designing any system
Design few apps together

Why scaling?

Your business is growing, and there are so many users on your site that things are starting to slow down. The amount of data your business needs to store is increasing, and servers cannot handle the load.

During the holidays or other events, the load on your application or website can increase more than ten times. If your app has vulnerabilities or weaknesses, many users will find them.

For SaaS platforms, scaling problems can lead to lower revenue than expected. You may not be able to onboard new clients quickly or significantly expand your client base.

Usually, a bad user experience causes you to lose clients. Messages, notifications, or emails are not delivered to end users. Visitors must repeat certain steps to complete the business flow.

You start losing important data. User invoices, consent forms, or transactions can become legal issues.

Pages load slowly, network connections time out, and your servers struggle under heavy load.

You need to keep everything running and ensure the user experience is fast and smooth – speed and smoothness are features.

Amazon found every 100ms of latency cost them 1% in Sales!

“A one-second delay in page response can result in a 7% reduction in conversions."

along with

“47% of consumers expect a web page to load in two seconds or less.”

What do you need to scale?

From the top overview it’s always about CPU, Memory or I/O because all apps or services live on hardware. In the cloud or your server room.

But to understand where losses or slowdowns happen you have to know what is going on in your system starting from user input in the browser or application till the database query and back as detailed as possible.

You tweaked all possible components in your code and server settings but it’s still not enough. Let’s talk about possible techniques which will allow you to scale your app even more.

Further on we will talk concepts, no technology specifics, no tooling, pure fundamental concepts which can be implemented in any system.

The cube of scalability is our Holy Grail. It nicely and simply describes how to achieve “nirvana” - infinite scaling. But it covers mostly application level. Based on that I split all techniques in 3 parts. Application level, Auxiliary level and Data level.

Application level - holds your entire business logic, usually it’s ruby or php scripts which are running on your server.

Auxiliary level - kind of helper in scaling world. It’s not tied to your business logic or data.

Data level - usually holds the state of your program/application which might be database or other kind of storage.

Let’s go through a complete list of techniques which covers all aspects of scaling cube and even more. ➡️

Vertical scaling

Vertical scaling - commonly used and the most simple technique. Just add more power to your machines.

Database ran out of disk space? - Add more gigabytes.

Server with application became slow? - Upgrade from medium to large instances.

Implications of this technique are temporary. But it easily allows you to win some time to implement more fundamental solutions.

Horizontal scaling

Horizontal scaling means to run your logic on multiple machines with load balancer on top of that. It will allow you to gain extra power.

Let’s say your app started dropping some requests or just simply became slow. And upgrading to a bigger instance is quite pricey or your app already on the biggest instance.

You can run your app on multiple machines during rush hours. It will allow you to load balance between them and solve issues with slow requests. As extra plus it will make your system more fault tolerant. One instance died? - Not a problem, you have one more instance which will take the load. And your team will have time to understand why it happened and fix it.

This technique has few requirements to your backend application:

Your app should be stateless, user requests should be executed the same way on server “a” and server “b”.

Try to eliminate shared components between multiple servers. Otherwise it will add a common point of failure and possible bottleneck in the future. Bottleneck is a component which slows the whole system down. But do we really need to process everything in real time?

Postponed execution

Postponed execution - is a really nice and elegant technique with entry level.

As I mentioned before, you don’t need to do everything in real time. You can create a task and execute it later.

Send an invoice - can be done in a few minutes after clicking the payment button.

When your post should be visible on instagram? - Immediately, don’t think so.

This technique will allow you to offload non critical functionality to be able process requests faster.

Asynchronous processing

Similar to postponed execution but done in parallel. Same task split across several machines.

Send 5mln notifications or send millions of emails, just split the task and send it to multiple machines.

This technique requires some modification to your infrastructure and you will have to be very careful few failures. From my experience even a simple task might have invisible bugs which might collapse something. Once I sent 8 emails instead of 1 to a few users because of a uid collision.

Functional separation

Functional separation involves isolating different functionalities onto separate machines, allowing each to be optimised independently. This technique enhances maintainability, performance, and fault tolerance. It also supports concurrent development by enabling teams to work on different parts of the system simultaneously.

Authentication Service: Manages user login and registration.
Content Management Service: Handles content creation and storage.
Recommendation Service: Provides personalised user recommendations.

Service-oriented architecture

This is one of the most powerful architectures. The entry level is a bit higher than previous options, but it offers a lot of flexibility in development, deployment, and testing.

The idea is simple: imagine you have a website that serves articles, allows users to follow others, has authentication, and includes a commenting system. You can split all of these into separate services. Your system will become more fault-tolerant. For example, if the weather service on a website like Yahoo goes down, everything else will remain fully functional.

However, if your team lacks experience with SOA, it’s better to seek advice from a company. Also, talk with product managers and other stakeholders to clearly define the business value and strategic goals of each service.

Service-oriented architecture illustration

Adds communication overhead
Split logic and data
Concurrent deploys and development
Making system more fault tolerant
Individual scaling of components

Parallel execution

This is rocket science. This technique is similar to asynchronous queue but done in real time by splitting request and processing it on several machines.

I know not more than 2-3 companies that use it in the business logic layer. Google one of them, when you hit the search page your request goes to dozens or even hundreds of servers. Your request is split, servers processing it, after that collect all results and return back to the user in real time!

If anyone used map reduce you probably know what I was talking about. If you know how to bake this do it, if not think about other options.

Fat client

So we covered possible ways to scale your application level, let’s talk about your clients. Fat client - simply means to move as much as possible to the client side.

Instead of server side rendering, move everything to the client. Send lightweight json or graphql query and render views using its data. Your client can go to each service asynchronously to provide better user experience. But this trend to be reversed at the moment and static content seems to rise.

Fat client can be a great save of your resources. Let me explain why: c5.large Amazon instance vs IPhone X. Amazon c5.large - 2 cores 4gb of ram, iPhone - 6 cores 4gb of ram. Yes, different process architecture, but as powerful.

You can even move the load balancer to the client side, as I know ~~twitter~~ X is doing it.

Caching

Everyone knows about caching. I will just add a few words about it.

From what I saw in some companies, cache is slowing their system down. Let’s imagine that to go to cache is 50ms and to go to database is 200ms. You benefited 5 times. But what if your app hit cache less than in 50%? Your system becomes slower, instead of 50 or 200ms it spends 250, plus storing useless results in memory.

Always calculate the hit rate. And your system should work without cache, if it is not able to start without cache something is wrong in architecture.

We already covered application level, auxiliary level

Sharding

Sharding - in a way horizontal scaling of the database.

Let’s say we have a database where we store subscribers to our newsletters. 10 million subscribers, plus their profiles, preferences etc. It can’t fit in one database. Let’s shard it by primary id or some hash. Move 5mln to one database and 5to another.

But from time to time you will have to do manual migration and likely freeze your system for some time. To be more prepared you can have a look on virtual sharding.

Sharding :: Virtual sharding

The idea of this technique is to prepare your system for high load. As we talked at the beginning even lost consent might be a critical issue for the business.

Let’s say we are building a social network, messenger or dating app. You know that at the beginning you will have 10.000 people. After the first marketing campaign you expect 150.000 and so on. To be prepared for such a load of data, just shard your database up front and create 500 databases in one physical database. Once one is full, move it to the physical server.

Sharding :: Central dispatcher

This pattern allows you to control your shards. In a way it works as a proxy between your app and databases.

For example, on one of your shards you spotted 1mln bots or scam accounts and truncated them. After that you can say to the central dispatcher to load new users there.

Central dispatcher adds quite some complexity and coupling to one component, but if you have a specific need - why not.

Replication

We’ve talked about heavy write and storing data, but usually web projects have much more read queries than write. The well known approach - replication.

There are a lot of cases, let’s take the simple one. You publish an article, store it in a database. And how many users will request it? Thousand, 10 or million.

A lot of databases have native instruments which you can use. Principe is next: you have the main database (master) and slaves (replications). Write queries go to master and read to slaves.

Partitioning

Functional or logic separation of your data.

Or as some people from micro services world call it “Polyglot Persistence”.

Articles are stored in database 1 and comments are stored in database 2. And as well this pattern allows you to choose best fitted technology for each use case.

Denormalisation

This technique is heavily used in our system. It does not require introducing any new tools or services. Denormalisation of database is a normal process in any web project.

You are reading 10 times more than writing, optimise your data to read it much more easily and efficiently.

Redundancy

Redundancy is similar to the denormalisation process except you do not change the form of data.

Let’s say you have the latest articles which always have a high volume of views - store them in a separate table with 10 rows. Another lates article came out - store it in your main storage and update your new small table for further better performance.

How to design?

BUSINESS LOGIC We describe the business logic of the future system, including potential ways of developing the system. Outlining key features and functionalities.
THE NUMBERS We calculate the volume of data stored and the speed of their increment. Choosing a critical path - storing, writing or reading data?
DEGRADATION Determine the acceptable degree of degradation of the system.
DATA We will construct the data movement scheme and make a decision which of the features of the designed system we will use.
SCHEME We are designing a data storage scheme.

Let’s design some apps

Job applications

Requirements

Business logic

Read fresh vacancies
Read vacancies from archive
Recruiters can publish and update vacancies

Numbers

Each vacancy ~15-30 kb
Store all vacancies from the 2000
Each day 5k vacancies ~2mln per year (~40gb), 20 mln per 10 years, (~400gb)

Degradation degree

Data

Reads much more often than writes
A lot of views goes to latest vacancies
Majority views goes to vacancies from last week

Design

Sharding?

Data not equal
No

Redundancy

Write to two databases (hot and cold)

Cache hot database
Partitioning in archive database

By date (example)

Dating app

Requirements

Business logic

Fill profile (profiles have common structure)
Email and password for login
Users can see profiles of others

Numbers

150-250 mln users
Each profile 10kb (2.000gb)
5 billion hits per day

Degradation degree

Data

Reads much more often than writes
Equal size of profiles
Each profile has unique id
No leaders

Design

Caching?

No leaders - caching will be useless
No

Replication?

5 billion hits - 24 * 60 * 60 ~140k reads per second

Sharding?

Sharding.
Which key?
We have unique id in data requirements
But in functional requirements we have email and pass login.
First two email characters.
Central dispatcher (if needed)

Friend feed (aka X)

Requirements

Business logic:

Unlimited amount of friends or follows
Infinite feed (store all entities)

Numbers

On average 100 friends
3 posts per day
1 post ~1kb
100 mln users per day, each user produces 100 hits, 1 bln requests per day
30mln posts per day, 10bln rows per year

Degradation degree

Post might be shown with delay
Order might be not exact as by timeline

Data

99% goes to fresh posts
Users with million friends or followers

Design

How to store posts?
- Sharding
- Starting with virtual sharding
Storing just ids
Users with big amount of followers
- queue, postponed execution
Fetching actual posts?
- Fat client
- Caching

➡️ Download as PDF ⬅️

Ultimate Tech Scaling Guide

About me

What this guide will cover

Why scaling?

What do you need to scale?

Vertical scaling

Horizontal scaling

Postponed execution

Asynchronous processing

Functional separation

Service-oriented architecture

Parallel execution

Fat client

Caching

Sharding

Sharding :: Virtual sharding

Sharding :: Central dispatcher

Replication

Partitioning

Denormalisation

Redundancy

How to design?

Let’s design some apps

Job applications

Requirements

Design

Dating app

Requirements

Design

Friend feed (aka X)

Requirements

Design