Cassandra: Why and when to use

nand
2 min readApr 19, 2018

--

Cassandra is a NoSql database.

Its
Open source
Distributed
Highly Available
Fault Tolerant
And most important of all, its Eventual Consistent

Because of Eventual Consistency, it provides high write throughput with low latency.

Distributed
Data is split and distributed across multiple machines.

Highly Available
Database is always available for read/write even if some of nodes are not available. This is achieved by adding redundancy.

No matter how sophisticated the hardware is, there’s always chances of its failing.

Eventual Consistency (details)
Cassandra has configurable consistency where we can make it strong consistent or eventual consistent. Means we can configurable how many replicas needs to contacted before acknowledging the client.

Consistency levels are One, quorum, all, any.

N = the number of nodes that store replicas of the data

W = the number of replicas that need to acknowledge the receipt of the update before the update completes

R = the number of replicas that are contacted when a data object is accessed through a read operation

If W+R > N, then the write set and the read set always overlap and one can guarantee strong consistency.

Here’s a great article on eventual consistency by Amazon CTO

Higher the consistency level, slower the operation.

By CAP theorem, Cassandra is AP (Available and partition tolerant) with eventual consistency.

By tuning W & R, we can control read and write level latency.

Data Model

Cassandra is very different from SQL when it comes to data modelling.
It is a NoSql db, so it has a Flexible Schema

KeySpace is like database, replication factor is defined at this level.
A replication factor of 1 means that there is only one copy on a node. A replication factor of 2 means two copies, where each copy is on a different node. All replicas are equally important; there is no primary or master replica.

Replica placement strategy is also defined at keyspace level. It defines which node to put replica on.

When creating a keyspace, you must define replica placement strategy and number of replicas you want.

Column Family is like Table, a Keyspace can have multiple Column Family. Like a database can have many tables.

Primary Key is column or group of columns (composite key) which uniquely define a row. Like user_id, uuid etc

Composite Key is primary key which is combination of multiple columns. like (phone, email, city), collection of these 3 columns makes a single key

Partition Key is used to distribute data across cluster i.e. partition the data. E.g. you partition the user data with city, then all users belonging to a same city will reside on the same node.

These keys plays a very important role while defining data model.

--

--

No responses yet