Distributed web service architecture is highly used these days. Horizontal scaling seems to be the answer of providing scalability and availability.
With ASG (Auto scaling groups) provided by AWS, its very easy to manage and upscale, downscale the cluster. ELB (elastic load balancer) automatically handles the load balancing among all the nodes in the cluster.
Now that all nodes are capable of doing same piece of work, we need to ensure that multiple nodes should not end doing same work. This may happen when same concurrent requests are distributed among nodes or a process is running on all machine might get triggered at same time (aka cron).
This is where distributed locks are used.
As beautifully explained in this article http://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html , distributed locks are used primarily for efficiency (saves you from unnecessarily doing the same work twice) or accuracy (same data is not modified at the same time), Overall avoiding race conditions.
Hence come the mighty Redis, in-memory database, best used for caching and as a broker in messaging queue. Since being an in-memory data storage, its read and write speed is very high, so suitable for fast-changing data.
Since Redis is single-threaded, every operation is atomic.
INCR operation in redis is preferred to acquire lock over a key. INCR is a atomic operation which returns incremented value which helps us avoid multiple operation to do (lock if not already locked), hence it ensure different processes doesn’t take lock over same key. Lock code looks something like:
if redis.INCR(“lock_key”) > 1 : return “Lock already acquired”
That’s easy and solves distributed lock problem. Awesome. huh?
Not yet.
Redis has inherent problem with replication and persistence. Redis replication is asynchronous, hence if master fails before writing it over to slave. Slave will be promoted to master and slave doesn’t have data about lock. Inconsistency.
Managing redis cluster is a tricky job. Hence comes the elastic cache.
Elastic cache itself manages failed nodes, cluster management, replication, providing high availability and its easily scalable.
Thanks for reading. Post a comment if you have further suggestions.