Processing Billion Jobs. Fault Tolerance Simulation

5 min readMay 5, 2021

Problem Statement

Rather than going into industry jargon, I would like to keep problem statement as generic as possible.

Given you have an undirected graph (network) with many nodes and links. Each node and link have different properties. We have to simulate impact of one or multiple links failure on overall network.

Use Cases

P.S.: Above images are taken directly from internet. Please don’t sue me :)

Real life use cases would be to simulate:

Impact of road(s) blockage in traffic network
Impact of failure of rail line(s) in railway network
Impact of failure of fiber optic cable(s) in telecom network
Impact of power cable(s) failure in power network
Impact of capture of few base camps by enemy forces in army deployment network

Basically when we are building road networks in certain area or state, we would not want one or two road blockages to stop traffic between two places.

Domino Effect

When a link A is unavailable, all that traffic could be directed to another link Link B. Link B may or may not be able to handle this new traffic. This could lead to further failure of Link B. This could eventually lead to domino effect, disrupting entire network.

Infamous Suez canal blockage by single ship, wrecking entire global supply chain.

Scale Challenge

A network with N links. We are simulating a k link failure.
Scale is O(N^k)

TL;DR; Skip below Maths if not interesting.

links = {l1, l2, l3……ln}

Possible combinations for one link failure = [{l1}, {l2}, {l3} ……..,{ln}]
== n cases
Possible combinations for two link failures =[{l1, l2}, {l1, l3}……..,{ln-1,ln}]
== n * n-1 cases

i.e. total combinations for k link failures = C(N, k) == N * (N-1) *…..(N-k)

So lets say its a 1000 links network
Total cases for one link failures = 1000
Total cases for two link failures = 1000 * 999 =~ 10⁶ == 1 million
Total cases for three link failures = 1000 * 999 =~ 10⁹ == 1 billion

Architecture

Since each node, link is unique in itself. Failure simulation of each link would be unique and independent as well.

Components

Combination Service
This service create all possible combinations based on parameters of simulation. All Combinations are independent in nature, are pushed to queue and their state is created in database.

Queue and database being external resource, Every communication with them will inherit a network latency.

Processing time for each combo includes:

network call latency = ~2ms
Queue publisher latency= ~50ms
Cassandra Read/Write Latency = ~10ms

On an average every combination would take around 100ms. Sequentially, 1billions combinations would take 10⁸ seconds == 3.5years.

P.S. If this time frame is acceptable. You live a very great life :)

Batching
Higher batch size doesn’t linearly increase latency. So batch of 1000 messages doesn’t have 1000x latency. Latency remain <100ms. Same goes for db write.

Along with batching, a single message can contain multiple messages. This would save us network call and APIs but consumer would be picking this message collection as a single unit and will process it as such. This would require additional handling at consumer end.

So batch of 1000 would almost linearly reduce time from 10⁸ to 10⁶ seconds

Multi-Threading
Given how I/O intensive our process is, we should let other threads run concurrently while this thread is doing I/O stuff. Queue as well as DB can support 10 of thousands of concurrent connection in parallel with minimal impact on performance. We should leverage that.

But there’s a limit onto to number of threads can run on an host. Vertically scaling could help increasing this limit.

Lot of performance tuning is required based on memory of hosts, cores available, resources required for each thread.

This tuning could reduce this time from 10⁵ to 100 seconds.

Simulator

This service picks a message from Queue, pulls its state from Cassandra, process it and push the result back to Queue and updates the state in Cassandra.

This black box takes network and run some proprietary algorithms based on Flow network — Wikipedia and graph traversals to determine impact of link(s) failure on whole network.

On each instance, 2n workers(process) of simulator are running in parallel where n is number of virtual cores.

After job is completed, worker picks another job from queue. This runs in `while true` loop.

After job completion, combinator service need to be informed about the status of jobs to monitor overall progress. Calling combinator service synchronously would overwhelm the service and in-return would increase wait time for completion. So instead, results are pushed back into another topic of queue , of which combinator service is a consumer. Hence making whole process asynchronous.

Auto Scaler

This scale up and down the number of simulator instances based on total messages in the queue. AWS provides auto-scaling groups which can scale up or down based on certain monitors. Issue here is our monitor is queue size, which ASG doesn’t provide natively. So we have a lambda function running, which calculates the size of queue and based on that increase or decrease *desired instances* configuration.

ASG based on this, scales up or down. This way, when size of queue is 0, desired instances would be 0. Hence no idle cost.

Lambda Function

This constantly read total messages available in queue and update auto scaling group desired instances config.

Cassandra

Our primary requirement is high read/write through where we can let go ACID properties. Eventual consistency is something we can live with as long as it favors performance. Hence Cassandra is perfect candidate.
It promises mili-second latency irrespective of the size of database. Easy to horizontal scale. Learn more about Cassandra.

To avoid managing this ourselves, we again looked at AWS. It provides dynamoDB as a managed service with Cost as per use. Its perfect as we do have certain long idle time in-between jobs.

Redis
There is certain data which is shared among jobs like network configuration etc. Such data is used by all jobs. So we needed a cache to avoid making DB calls.

This cache had to be distributed in nature as there are 1000’s of instances running in parallel, this can’t be specific to instance memory.

We also needed collection of data structures in this cache as data was quite heterogeneous.

Redis was perfect candidate for this task.

These were major components we discussed. This is simplistic architecture for such major generic problem.

There could be other solutions as well like using Spark streaming with Kafka. Easy to implement and manage. Simplicity. Was major requirements in my head while designing it.

I hope this was useful. Thank you for reading.

Long live open source. Long live sharing.