Zero Data Loss

Suchak Jani
11 min readJun 17, 2018

--

Data is at the heart of all Information Technology.

We will try and discuss “Zero Data Loss”, as objectively as possible. I have used various sources with links.

ACID

Databases store data. Early on ACID (Atomicity, Consistency, Isolation, Durability) became a key factor in data storage and retrieval.

Lots of time and energy was spent in building the large databases.

Oracle and DB2 and SQL Server, were the de-facto ways of storing data.

Deploying in a single site made sense, and many times we needed some custom hardware, for example, writing to custom block storage, instead of a simple commodity file syste,.

Monolith

For Monolith applications ACID databases, made sense. We had a stateless application which talked to a ACID database on a single site. All was well.

Internet scale

And then came internet scale. We, no longer were getting data from a single site in a single system, we now had users logging in from the whole world and the systems had many sites to service the world users.

Microservices

Monolith worked great for a long time, and then internet exploded. Something else was needed

Microservices came in and helped with the “internet scale” complexity.

Complies like Netflix, Amazon, Facebook, WhatsApp etc, had to scale. And in many use cases, these companies did not need ACID level databases. Eventual consistency would be all right in their use cases.

Financial services

With the success of companies like Amazon, we in Financial Services also wanted do avail of the benefits of Microservices.

Also now many of us in Financial Service, need to think IOT(internet of devices) scale.

Some use cases are all right for Microservices.

For example a non critical complex web service, could be written using Microservices Architecture.

Thus we could have many teams working on many applications, at the same time. An issue with one service would not affect other services. One thing we do need good interface contracts between dependent services.

No data loss

Certain financial messages can-not be lost. For example if i sell something to a client(Say X) and X pay’s me electronically, the financial institution cannot turn around tell me, the payment was lost, and that i must ask X to send it again. If the bank does do this one too many times, i will loose confidence in the financial institution.

This is not a text message or a Netflix stream or a Facebook post.

And I am not talking about eventual consistency here. I mean no data loss.

Even eventual consistency that would be funny in this case, “do not worry, the money will eventually arrive” does not give a lot of confidence in the financial institution.

Replication

Replication is a widely used way to ensure no data loss.

Kafka for example has a replication factor(replication-factor) of N, where data is copied N times. Cassandra has something called Data replication.

Below is a logical digram of what is happening when we do data replication.

Data Store Replication
  1. The “Save Service” that needs to save a message, sends the message to “data store”, which could be Kafka or a NoSQL database or a NewSQL database. The important point here is to note, we have a replication factor of 2.
  2. The data store then saves the message on a node.
  3. It then replicates the message on another node
  4. It then replicates the message on yet another node
  5. It replies to the save service with a “sucess”

No Multiple sources of failures

Replication implies write to multiple nodes.

Simple logic dictates, we cannot be expected to cope with multiple sources of simultaneous failure.

In the above diagram the data store replies with success, only after writing to three nodes.

If this was AWS and we needed this kind of durability, we would have the nodes, in multiple AZ’s(Availability zones) in a region. All AZ’s going down at the same time is rare.

Hence replication may just work, as a no data loss mechanism, yes ?

Maybe ?

Lets dive a bit deeper and think about write to disk.

Why is write to disk important ?

Replication saves data on multiple nodes. If the replicated nodes go down, without saving data to disk, then replication is meaningless.

In the image above if nodes 1, 2 and 3 all die, without writing to disk, messages will be lost. Of course this is rare.

To be clear, in the above example, do note, we may not have flushed the records to disk at point 5. Why ?

Why would data not have been written to disk ?

OS regularly batch(page) data and write the batch to disk. Depending on the batching of writes at the OS level and the “data store” parameters, we may still have data in memory.

Example : Kafka Write to Disk

Kafka Writes all data to disk as soon as the message comes in.

However at the OS level, if the OS decides to batch up writes, we have actually not written to disk. Read here.

Now there is a way to force this using fsync and the kafka parameter for the same is “log.flush.*” type parameters http://kafka.apache.org/documentation.html#brokerconfigs

However form Stack overflow :“Kafka, in general, recommends you not to set these and use replication for durability and allow the operating system’s background flush capabilities as it is more efficient”, read here.

So us forcing fsync is not efficient, The answer is have more replication, say n=10, :-) and thus we will need more nodes and more hardware.

What about NoSQL/NewSQL Write to Disk ?

NoSQL/NewSQL disk writes are not much different to the Kafka example above.

What i mean is, even if the NoSQL/NewSQL writes each record to disk, the OS may still batch writes to disk.

And for doing forced flushes many NoSQL/NewSQL databases, will have parameters to force save data to disk.

Single Site

If we are in a single site(AWS region) and we have replicated in various locations in that site(AWS Availability Zones), we should be ok ?

Can we claim no data loss ? Depends !

What is the probability of an earthquake or a nuke affecting all locations in the site ?

If the above disaster happens, then, there will be data loss, even after the data store has said it has written to multiple locations and even if the data store writes each record to the database.

Why ? Because the OS may not have yet written to disk as batching of writes, is more efficient.

Multi Site write (Actual)

Write to more than one site(AWS region), could be a solution.

Multi Site Write

As seen above, we cover the probability, of a single site going down as we have more than one site where the data has been written. The write is not successful till we write to both sites.

The data store does not say “success” on a write till it has written to two sites(AWS region).In the example above, we write to Singapore and New York.

The chance that both have an earthquake or a nuke attack at the same time, should be rare.

Now, Can we at last, claim no data loss ? Depends !

Look at live network lag here.

Issue 1 :

Assume for every write we have to talk to Singapore and N. Virginia. This is geographically very far and hence any disaster, natural or otherwise, should be very far away.

In AWS, Singapore is “ap-southeast-1", and N. Virginia is “us-east-1”.

The network lag is approximately “228.93” ms. Again look here.

The distance between Singapore and North Virginia is around “15,687 km” as per google. Speed of light is “299 792 458 m / s” as per google.

Speed = Distance/Time-taken

Time-taken = Distance/Speed

Time-taken = ((15687 * 1000)/299792458) *1000 ms

= 52.3261996137 ms

The above, would be the best speed for light in vacuum and light travelling in a straight line. But since here we have under sea cables, lots of network hops, the best we get on earth is around “228.93” ms by Amazon.

We will pay a penalty on network lag and may impact our throughput.

Why ? The speed of light is not changing anytime soon and the study of quantum coupling is in its infancy.

Also when we have IOT scale to think about, we do not want to go around the world for every single write.

Issue 2 :

With such long distances, we have to work with network spikes. This does mean, we will have to have a timeout type circuit breaker.

In a timeout, we do not know if the message reached the other site or not.

Some people suggest to do a verify call to the other site after a timeout, but even that call could timeout due to a network spike.

How many verifies are ok ?. We will reach a point in time, we have to acknowledge the message.

Issue 3 :

The same issue with OS batching the actual disk writes, applies here.

Even if the application thinks it has written to both sites, the OS may not have done so.

That may be just around the time, when the nuke or an earthquake hit a site.

Multi Site write (just for the sake of it)

Write to multiple sites which are actually very near each other and then claim we have multi site write. A nuke or a earthquake will affect both sites equally as they are near each other

It has all the same issues as single site write.

Say we have two sites 40 km apart. Lets calculate.

The distance between our two sites, is around “40 km”. Speed of light is “299 792 458 m / s” as per google.

Speed = Distance/Time-taken

Time-taken = Distance/Speed

Time-taken = ((40 * 1000)/299792458) *1000 ms

= 0.13342563807 ms

The above, would be the best speed for light in vacuum and light travelling in a straight line.

But since here we have cables, network hops, the best we get on earth will be less.

Lets call it “x” and approximately calculate it, comparing it from values we got before.

Singapore to N Virginia, ideal lag as per speed of light, was around 52.326 ms and what amazon gave us actualy was 228.93 ms.

Here for 40 km the ideal lag as per speed of light in vacuum, is 0.133 ms what is x, the actual lag ?

52.326 ~ 228.93

0.133 ~ x

x = (228.93 * 0.133 ) / 52.326 = 0.58188453159

So we should get around 0.582 ms, which is pretty good i guess, however we are not safe from disasters(natural or man made), as the distance is not much.

RPO/RTO (Multisite)

Recovery Point Objective (RPO) and Recovery Time Objective (RTO)

Read here for more details.

Many times it is claimed that a system has RTO of zero a RPO of zero.

In simple words RTO is “no data loss” and RPO is“no time loss” in case of any major problem.

RTO(maybe not exactly zero) may be achieved by, having two active sites(AWS regions).

There will be some way to load balance traffic to both sites and in case of a disaster detected by heartbeat failure on one site, all traffic is sent to other site. AWS load balancers with Route 53 is one simple way to achieve this.

The RTO time taken is the same as the time to detect heartbeat failure.

Hence RTO time, is around the same time as the heartbeat frequency multiplied by number of heartbeats indicate a failure. Say we heartbeat every second and 3 unanswered heartbeats tell us the site is down, our RTO is approximately 1second *3 = 3 seconds.

RPO of zero means data is fully in sync between two sites(AWS regions).

This is not possible as we have the network lag between the two sites.

Lets say we code our application in a way that we write to both sites before we reply. This is the same as Multi Site write (Actual) described above and thus has the same issues.

Jepson

Have we not read Jepson analysis ? maybe we should.

They have been testing databases for a long time now. Any claim made by a particular data store is tested.

Have a look here and a video here .

Shockingly, data stores loose data even after acknowledgement of a data save.

For example, look at jepson reports of two of most popular databases, Cassandra and Redis.

Quote from the Cassandra test and link below

No. Cassandra lightweight transactions are not even close to correct. Depending on throughput, they may drop anywhere from 1–5% of acknowledged writes–and this doesn’t even require a network partition to demonstrate. It’s just a broken implementation of Paxos. In addition to the deadlock bug, these Jepsen tests revealed #6012 (Cassandra may accept multiple proposals for a single Paxos round) and #6013 (unnecessarily high false negative probabilities).

Quote from the Redis test and link below

These results are catastrophic. In a partition which lasted for roughly 45% of the test, 45% of acknowledged writes were thrown away. To add insult to injury, Redis preserved all the failed writes in place of the successful ones.

No Data Loss

As we can see from the above examples, “system down time” and “no data loss”, both need to be always seen in context.

Anyone who claims they have achieved RTO of zero and RPO of zero must be asked for explaining “the context” in which the claims are true.

Even big firms have melt-downs.

Look at this links and comments below for Whatsapp

http://downdetector.co.uk/problems/whatsapp.

This site seems to track downtime for many apps

http://downdetector.co.uk/

In a financial company, this would not be so nice, as it would not instil much coveted “confidence”.

What do we engineers do ?

Bunch of things to consider

  1. The very first thing we do, is to ask and clearly understand, the system requirements via various use cases. Each flow within the system may have different requirements for data loss.
  2. Explain to stake holders, “designing for multiple sources of failure, is technically possible but comes at a cost”. Be open and scientifically objective, rather than subjective.
  3. There is a money cost, every time we replicate. Within a site, the cost would be, the cost of RAM, RAID SSD disks, Redundant network cards and the like. In a multisite environment, it would be the cost of the same things as as in a single site, plus the cost of a reliable inter-site network.
  4. There is also a system cost, every time we replicate, in terms of network lag and timeout/failure scenarios, which will affect, the overall, message processing time. Is the system able to take that kind of hit, in other words, would a slow but consistent system, fulfil the use case ?
  5. Explain to stake holders, “This is not a new problem” i.e. The issues, we discussed here, also apply to the established “data stores” like DB2, Oracle and SQLServer.

Conclusion

These are just some thoughts, please feel free to add corrections, comments or suggestions.

Thanks for your time.

--

--

Suchak Jani
Suchak Jani

Written by Suchak Jani

Principal Architect at Mastercard

Responses (1)