Incident Summary

All Applications in production were impaired due to a race condition bug in Sparta leading to a cluster-wide outage. All transactions on the Zeta Platform were failing.

Incident Started : 12:43 PM 9th Oct 2020 IST
Incident Detected : 12:48 PM 9th Oct 2020 IST
Incident Resolved : 11:45 PM 9th Oct 2020 IST
Impact duration : ~5 Hours 42 Minutes

SEV-1 between 12:43 PM to 02:45 PM;
SEV-0 between 07:45 PM to 11:25 PM;

Resolution

During the Sev1 Incident in the Afternoon We,
- Performed a rolling restart of Sparta, Helios, Proteus followed by forced restarts of all the Application which had null allocation and Dropped messages. This enabled us to get our Payment stack fully functional.
- We identified a problem with Sparta and started rolling the deployment of the patch. [We had to decide between rollback of Sparta or to patch; We chose to patch]

During the Sev0 Incident,
- We blocked all traffic and declared a SEV-0. We regained visibility of the issue and started the rollout of another patch for Sparta as a short term fix

Root Cause

A bug in Sparta service;

A race condition bug in Allocation.java and excessive logging in AllocationWorker.java in Sparta.

Trigger: At Oct 9, 2020, @ 12:43:08.850 two services got disconnected from one instance of Sparta and reconnected to another instance. This is expected to be a benign operation. But it led to cluster destabilization.

Analysis: A detailed analysis of the events is available here.

Detailed Root Cause Analysis

TLDR; of RCA

We discovered a race condition in a critical service that led to a frozen production cluster.
Our logging and monitor infrastructure got under severe stress due to full cluster outage.
We had to wait until we could recover the logging infrastructure.
We had to apply two patches to the critical service to fully resolve the issue and thus had to restart the full cluster twice.
This is an extremely rare incident and is likely never going to reoccur.
We are moving to a new production environment architecture by the end of October 2020. This single point of failures would be eliminated in this new architecture.

The Detailed Story

In a microservices architecture, it is common to have a service that provides service discovery and traffic shaping services. In our world, we call that service Sparta. Sparta not only provides information about various instances of services running in the cluster but also provides a mechanism to distribute load across these services. We use a distributed hash map with consistent hashing using the resource identifiers as keys to determine the instance to service a specific request. This is akin to how redis and memcached distribute the keys across their cluster. The approach requires a lightweight client component in each of the services that need to make requests to other services. We call this client Spartan. Spartan relies on the topology information it receives from Sparta to determine the destination for each request. For a given topology, the service instance that is expected to handle a request is unique and deterministic. Sparta orchestrates the load allocation of all instances as the topology changes due to the introduction or removal of instances into the cluster. Our applications rely on the load allocation characteristics to simplify data caching and to implement reactive programming patterns with minimal overheads. Given the reliance our applications place on this unique load distribution choice, our clusters are not tolerant to split-brain scenarios. The information exchange protocol between Sparta and Spartan has a strong defense against split-brain problems. When Spartan cannot confidently assert that its state is consistent with that of Sparta, it evicts itself out of the ring and tries to rejoin the ring from an empty state. This behavior makes the availability and consistency of Sparta very critical for the operation of the cluster. This is similar to how Zookeeper or etcd is put to use in the more well-known open-source world. A failure of such services could have catastrophic consequences on the cluster. Such services could often turn out to be a single point of failure. They are designed and operated with extreme caution. We have been operating the Sparta cluster with relatively insignificant incidents for over 11 years¹ now.

We discovered a race condition in Sparta that led to a frozen production cluster.

Over the last couple of years at Zeta, we have been running 200+ microservices within a single Sparta cluster. We have added these services organically and the growth was accommodated without any hiccups. However, if we were to bring up the same sized cluster from a clean state, we have to resort to a slow start and ramp-up procedures. Generally, the introduction of a new service is a noisy process with a lot of information exchange across services that form the cluster and services that use the cluster. The noise multiplies with each additional concurrent instance being started². Therefore, it is not practicable to bring up the entire 200 instances cluster in simultaneously. Given how rare it is to ever bring the entire cluster, we did not automate this. Bringing up a cluster requires our engineers to orchestrate the process³. Having to stagger the deployment of these services spreads the deployment over a period of 40-60 minutes.

We had to restart the full cluster⁴.

Sometimes when things are to go wrong, anything can go wrong! The logging and metrics pipeline is critical to understand the status of the cluster. We use AWS Kinesis to gather logs and metrics from all the services and provide a single consolidated view to our production engineers. When some of the critical services went silent, the number of log lines due to errors and warnings thrashed the Kinesis pipeline. Kinesis buffered enough log lines that it would take 45 min to read them⁵. We would be quite handicapped to understand what’s happening in our cluster without these. Adding capacity wouldn’t change much as Kinesis can’t provide more throughput for data already ingested into the existing shards. While we added additional shards for future logs and metric data, we had no way but to wait for about 40 minutes to understand the consolidated status of the cluster.

The observability of our cluster is severely impaired.

Running a payment platform taught us the criticality of incident management quite early on. We have a sophisticated setup to help us deal with incidents. Our video wall with a dozen 24” monitors would show several facets of the cluster in real-time, if we were on our production operations floor. Thanks to COVID-19 now we are all working from home. Almost all of the team involved in managing the incident has just their laptop monitor to go by. For a SEV-0 it’s too little to work with.

We are under severe constraints.

At the first instance of the issue observed at 12:48 PM on the 9th Oct 2020 IST, we recovered the cluster through a predefined process. However, after the cluster is fully recovered, we noticed that the initial cluster freeze was triggered as the spartan clients saw that the Sparta server sent a load allocation version less than what the client was expecting. Not dealing with this appropriately could lead to a similar situation in the future. We didn’t have appropriate loglines at the server to identify why it could have happened. We had to choose between a rollback or a patch. We couldn’t rely on rollback as the Sparta service is rarely changed and the last change was a few months ago when the cluster size was smaller. The running version had changes required for the larger cluster.

I was personally acting as the incident commander. Based on what I noticed, I wasn’t sure if the bug didn’t exist in the older version. As we identified one issue that could cause this situation, I decided to patch. The patch was meant to be rolled out slowly on a stable cluster. We didn’t anticipate any issue but to identify and recover if the server goes wrong. However, the recovery attempt had a tight loop to recover from a situation where the server had a lower allocation version than what the client expected. After about 30 minutes of safe rollout across all instances in the Sparta cluster, the services broke. The tight recovery loop aggravated the problem as it logged heavily. The clients started thrashing the Sparta servers, increasing the total sessions at Sparta by 4x. Each client was logging excessively about them not being able to establish a session with Sparta. This led to a complete cluster collapse.

I misjudged the consequence of a patch.

All of the above led to a near-disaster situation. Of course, nothing here impacted data or any of our data infrastructure components. However, our applications were not able to process any requests.

As new incoming requests were further aggravating the pressure on the logging systems, we had to stop all traffic.

As a team worked on recovering the logging and monitoring infrastructure, another team worked on the issue with the patch. We could get to the root cause and were able to deliver another patch.

Incident Timeline

Date: 9th October 2020

Time in IST	Events
12:48 PM	Health check alerts went off for over two dozen services.
12:55 PM	Our first response operations based on standard runbooks were initiated.
01:19 PM	It is observed that the issue is further spreading across the cluster. The incident is upgraded to SEV-1. Initiated the full cluster recovery procedure.
01:30 PM	Resumed payment processing with a partial capacity
02:45 PM	Payments are fully functional; (No dropped messages since 02:33 PM;)
05:41 PM	We have identified a problem with Sparta and started rolling the deployment of the patch. [We had to decide between rollback of Sparta or to patch; We chose to patch]
07:10 PM	We completed the patch rollout without any observed issues.
07:45 PM	The cluster went bad again and the choked the logs and metrics pipelines; We lost visibility into the cluster.
08:35 PM	We blocked all traffic and declared a SEV-0
10:10 PM	We regained visibility and started the rollout of another patch
10:30 PM	The rollout of a patch to Sparta is completed
11:15 PM	We are back to capacity
11:25 PM	Resumed processing transactions
11:45 PM	Recovered to full capacity

Verification and Observations after the fix

From 12:00 AM to 3:30 AM on 10 Oct 2020, we performed various maneuvers at the backend to verify that the current deployment is stable. Our current setup is stable.

We declared that everything is back to normal by 3:30 AM on 10 Oct 2020.

Short Term Solution

Fix the bug in Sparta.

We have applied the required patches in production
We will merge to master and clean-up excessive login and land this to production this week.

Long Term Solution

Sparta should not be a single point of failure for the entire cluster.

We have initiated a project called Olympus 2.0 that is meant to address this requirement. The project is a major architecture upgrade of our production environment. We have been working on it for close to a year now and we are scheduled to migrate all services from the old architecture to Olympus 2.0 architecture by the end of Oct 2020. Continuing with this plan will ensure that we will not have this service impacting the full cluster.

Incidents of this nature are never acceptable. I understand the dire consequences it may have had for our customers. I sincerely apologize for the same.

We have been working on various projects that would substantially alter the resilience and reliability characteristics of our services. Several of the enhancements are in practice for our newer products. We will ensure we migrate all our relatively old services also to this new model very soon.

Sincerely,
Ramki Gaddipati
CTO & Co-founder, Zeta

^{1^{Sparta and Spartan were inherited from Flock’s stack. The microservices orchestration was developed in 2009 when the now popular orchestration models are not yet widely discussed in the public domain.}}

^{2^{Our consistent hash ring contains 1000 nodes per instance. We call each such load defining node a bucket. The topology information is propagated with buckets as the constituent units. In a full-mesh of 200 services, (200*1000) * (200*1000) = 40 billion points of information is exchanged. While there are multiple optimizations in place, a full cluster start with a full mesh of services is quite a busy, noisy operation.}}

^{3^{About a year ago we decided to move to a substantially enhanced deployment and production environment architecture. Among many enhancements, the cluster operations are fully automated and driven through GitOps practices.}}

^{4^{Some of our newly launched products are out of this cluster. Ideally, they shouldn’t have been impacted. However, several of our customers rely on SMS-based OTPs or server-push-based Swipe2Pay as the second-factor. The delivery of these notifications relied on services running in our old cluster. So these customers were also impacted.}}

^{5^{We reached the maximum read throughput of allocated shards. Kinesis throttles all further read volume even if we were to increase the reader’s capacity.}}

Engineering @Zeta