INC-3418-SEV0-Race condition in Sparta led to a frozen production cluster

Posted on October 12, 2020 by Ramki Gaddipati

Incident Summary

All Applications in production were impaired due to a race condition bug in Sparta leading to a cluster-wide outage. All transactions on the Zeta Platform were failing.

Incident Started : 12:43 PM 9th Oct 2020 IST
Incident Detected : 12:48 PM 9th Oct 2020 IST
Incident Resolved : 11:45 PM 9th Oct 2020 IST
Impact duration : ~5 Hours 42 Minutes

SEV-1 between 12:43 PM to 02:45 PM;
SEV-0 between 07:45 PM to 11:25 PM;

Resolution

During the Sev1 Incident in the Afternoon We,
- Performed a rolling restart of Sparta, Helios, Proteus followed by forced restarts of all the Application which had null allocation and Dropped messages. This enabled us to get our Payment stack fully functional.
- We identified a problem with Sparta and started rolling the deployment of the patch. [We had to decide between rollback of Sparta or to patch; We chose to patch]

During the Sev0 Incident,
- We blocked all traffic and declared a SEV-0. We regained visibility of the issue and started the rollout of another patch for Sparta as a short term fix

Root Cause

A bug in Sparta service;

A race condition bug in Allocation.java and excessive logging in AllocationWorker.java in Sparta.

Trigger: At Oct 9, 2020, @ 12:43:08.850 two services got disconnected from one instance of Sparta and reconnected to another instance. This is expected to be a benign operation. But it led to cluster destabilization.

Analysis: A detailed analysis of the events is available here.

Detailed Root Cause Analysis

TLDR; of RCA

We discovered a race condition in a critical service that led to a frozen production cluster.
Our logging and monitor infrastructure got under severe stress due to full cluster outage.
We had to wait until we could recover the logging infrastructure.
We had to apply two patches to the critical service to fully resolve the issue and thus had to restart the full cluster twice.
This is an extremely rare incident and is likely never going to reoccur.
We are moving to a new production environment architecture by the end of October 2020. This single point of failures would be eliminated in this new architecture.

The Detailed Story

In a microservices architecture, it is common to have a service that provides service discovery and traffic shaping services. In our world, we call that service Sparta. Sparta not only provides information about various instances of services running in the cluster but also provides a mechanism to distribute load across these services. We use a distributed hash map with consistent hashing using the resource identifiers as keys to determine the instance to service a specific request. This is akin to how redis and memcached distribute the keys across their cluster. The approach requires a lightweight client component in each of the services that need to make requests to other services. We call this client Spartan. Spartan relies on the topology information it receives from Sparta to determine the destination for each request. For a given topology, the service instance that is expected to handle a request is unique and deterministic. Sparta orchestrates the load allocation of all instances as the topology changes due to the introduction or removal of instances into the cluster. Our applications rely on the load allocation characteristics to simplify data caching and to implement reactive programming patterns with minimal overheads. Given the reliance our applications place on this unique load distribution choice, our clusters are not tolerant to split-brain scenarios. The information exchange protocol between Sparta and Spartan has a strong defense against split-brain problems. When Spartan cannot confidently assert that its state is consistent with that of Sparta, it evicts itself out of the ring and tries to rejoin the ring from an empty state. This behavior makes the availability and consistency of Sparta very critical for the operation of the cluster. This is similar to how Zookeeper or etcd is put to use in the more well-known open-source world. A failure of such services could have catastrophic consequences on the cluster. Such services could often turn out to be a single point of failure. They are designed and operated with extreme caution. We have been operating the Sparta cluster with relatively insignificant incidents for over 11 years¹ now.

We discovered a race condition in Sparta that led to a frozen production cluster.

Over the last couple of years at Zeta, we have been running 200+ microservices within a single Sparta cluster. We have added these services organically and the growth was accommodated without any hiccups. However, if we were to bring up the same sized cluster from a clean state, we have to resort to a slow start and ramp-up procedures. Generally, the introduction of a new service is a noisy process with a lot of information exchange across services that form the cluster and services that use the cluster. The noise multiplies with each additional concurrent instance being started². Therefore, it is not practicable to bring up the entire 200 instances cluster in simultaneously. Given how rare it is to ever bring the entire cluster, we did not automate this. Bringing up a cluster requires our engineers to orchestrate the process³. Having to stagger the deployment of these services spreads the deployment over a period of 40-60 minutes.

We had to restart the full cluster⁴.

Sometimes when things are to go wrong, anything can go wrong! The logging and metrics pipeline is critical to understand the status of the cluster. We use AWS Kinesis to gather logs and metrics from all the services and provide a single consolidated view to our production engineers. When some of the critical services went silent, the number of log lines due to errors and warnings thrashed the Kinesis pipeline. Kinesis buffered enough log lines that it would take 45 min to read them⁵. We would be quite handicapped to understand what’s happening in our cluster without these. Adding capacity wouldn’t change much as Kinesis can’t provide more throughput for data already ingested into the existing shards. While we added additional shards for future logs and metric data, we had no way but to wait for about 40 minutes to understand the consolidated status of the cluster.

The observability of our cluster is severely impaired.

Running a payment platform taught us the criticality of incident management quite early on. We have a sophisticated setup to help us deal with incidents. Our video wall with a dozen 24” monitors would show several facets of the cluster in real-time, if we were on our production operations floor. Thanks to COVID-19 now we are all working from home. Almost all of the team involved in managing the incident has just their laptop monitor to go by. For a SEV-0 it’s too little to work with.

We are under severe constraints.

At the first instance of the issue observed at 12:48 PM on the 9th Oct 2020 IST, we recovered the cluster through a predefined process. However, after the cluster is fully recovered, we noticed that the initial cluster freeze was triggered as the spartan clients saw that the Sparta server sent a load allocation version less than what the client was expecting. Not dealing with this appropriately could lead to a similar situation in the future. We didn’t have appropriate loglines at the server to identify why it could have happened. We had to choose between a rollback or a patch. We couldn’t rely on rollback as the Sparta service is rarely changed and the last change was a few months ago when the cluster size was smaller. The running version had changes required for the larger cluster.

I was personally acting as the incident commander. Based on what I noticed, I wasn’t sure if the bug didn’t exist in the older version. As we identified one issue that could cause this situation, I decided to patch. The patch was meant to be rolled out slowly on a stable cluster. We didn’t anticipate any issue but to identify and recover if the server goes wrong. However, the recovery attempt had a tight loop to recover from a situation where the server had a lower allocation version than what the client expected. After about 30 minutes of safe rollout across all instances in the Sparta cluster, the services broke. The tight recovery loop aggravated the problem as it logged heavily. The clients started thrashing the Sparta servers, increasing the total sessions at Sparta by 4x. Each client was logging excessively about them not being able to establish a session with Sparta. This led to a complete cluster collapse.

I misjudged the consequence of a patch.

All of the above led to a near-disaster situation. Of course, nothing here impacted data or any of our data infrastructure components. However, our applications were not able to process any requests.

As new incoming requests were further aggravating the pressure on the logging systems, we had to stop all traffic.

As a team worked on recovering the logging and monitoring infrastructure, another team worked on the issue with the patch. We could get to the root cause and were able to deliver another patch.

Incident Timeline

Date: 9th October 2020

Time in IST	Events
12:48 PM	Health check alerts went off for over two dozen services.
12:55 PM	Our first response operations based on standard runbooks were initiated.
01:19 PM	It is observed that the issue is further spreading across the cluster. The incident is upgraded to SEV-1. Initiated the full cluster recovery procedure.
01:30 PM	Resumed payment processing with a partial capacity
02:45 PM	Payments are fully functional; (No dropped messages since 02:33 PM;)
05:41 PM	We have identified a problem with Sparta and started rolling the deployment of the patch. [We had to decide between rollback of Sparta or to patch; We chose to patch]
07:10 PM	We completed the patch rollout without any observed issues.
07:45 PM	The cluster went bad again and the choked the logs and metrics pipelines; We lost visibility into the cluster.
08:35 PM	We blocked all traffic and declared a SEV-0
10:10 PM	We regained visibility and started the rollout of another patch
10:30 PM	The rollout of a patch to Sparta is completed
11:15 PM	We are back to capacity
11:25 PM	Resumed processing transactions
11:45 PM	Recovered to full capacity

Verification and Observations after the fix

From 12:00 AM to 3:30 AM on 10 Oct 2020, we performed various maneuvers at the backend to verify that the current deployment is stable. Our current setup is stable.

We declared that everything is back to normal by 3:30 AM on 10 Oct 2020.

Short Term Solution

Fix the bug in Sparta.

We have applied the required patches in production
We will merge to master and clean-up excessive login and land this to production this week.

Long Term Solution

Sparta should not be a single point of failure for the entire cluster.

We have initiated a project called Olympus 2.0 that is meant to address this requirement. The project is a major architecture upgrade of our production environment. We have been working on it for close to a year now and we are scheduled to migrate all services from the old architecture to Olympus 2.0 architecture by the end of Oct 2020. Continuing with this plan will ensure that we will not have this service impacting the full cluster.

Incidents of this nature are never acceptable. I understand the dire consequences it may have had for our customers. I sincerely apologize for the same.

We have been working on various projects that would substantially alter the resilience and reliability characteristics of our services. Several of the enhancements are in practice for our newer products. We will ensure we migrate all our relatively old services also to this new model very soon.

Sincerely,
Ramki Gaddipati
CTO & Co-founder, Zeta

^{1^{Sparta and Spartan were inherited from Flock’s stack. The microservices orchestration was developed in 2009 when the now popular orchestration models are not yet widely discussed in the public domain.}}

^{2^{Our consistent hash ring contains 1000 nodes per instance. We call each such load defining node a bucket. The topology information is propagated with buckets as the constituent units. In a full-mesh of 200 services, (200*1000) * (200*1000) = 40 billion points of information is exchanged. While there are multiple optimizations in place, a full cluster start with a full mesh of services is quite a busy, noisy operation.}}

^{3^{About a year ago we decided to move to a substantially enhanced deployment and production environment architecture. Among many enhancements, the cluster operations are fully automated and driven through GitOps practices.}}

^{4^{Some of our newly launched products are out of this cluster. Ideally, they shouldn’t have been impacted. However, several of our customers rely on SMS-based OTPs or server-push-based Swipe2Pay as the second-factor. The delivery of these notifications relied on services running in our old cluster. So these customers were also impacted.}}

^{5^{We reached the maximum read throughput of allocated shards. Kinesis throttles all further read volume even if we were to increase the reader’s capacity.}}

Failure of MasterCard transactions

Posted on September 30, 2017 by Ramki Gaddipati

My sincere apologies to all users who are affected by the current outage on Mastercard transactions.

Zeta is connected to the Mastercard network through a prominent card processing service provider, which unfortunately is facing a serious outage that has lasted for over 28 hours now. The service provider is also unable to provide a disaster recovery path due to reasons unknown to us. This is absolutely unanticipated and unacceptable. However, at this moment, we can do little as there is no other practical alternative to reach the Mastercard network.

We are closely monitoring the development at our service provider’s end and will keep you posted on the progress.

The issue is really painful to all of us at Zeta. There could not have been a worse time for this, especially given the festive season. We are deeply sorry for the inconvenience it is causing you. We always strive to make your Zeta experience outstanding, please do not allow this to disappoint you.

We are grateful for your patience and support.
Wish you a joyful Dussehra.

Best wishes,
Ramki
CTO, Co-Founder

Passwords as Second Factor: To mitigate risks of password data compromise

Posted on May 20, 2017 by Ramki Gaddipati

Password based authentication is known to have many weaknesses. The weaknesses are chiefly attributable common human behavior of reusing the same password across services, using passwords that can be easily remembered, etc. There is a lot of advice in information security books on how to safeguard password data so that even if a service is compromised the attacker will not be able to retrieve passwords of its users and then use them elsewhere. Some of the good services follow the textbook advice of using key derivation functions and enforce password complexity. However, for some time now it is known that password complexity and key derivation functions are not sufficient defenses. Large databases of passwords used by users across services are easily available to attackers. This really simplifies the password matching. It is also observed that about 13,000-15,000 patterns represent close to 100% of the passwords used by users. In one of the Fortune 100 companies, 47% of the passwords used matched top 5 patterns. So, system architects cannot assume that:

Users will choose strong passwords
They are immune to password compromises on other sites
If their data is compromised, attackers will not be able to derive user passwords

Therefore, many services started adopting second-factor authentication. This additional factor acts as a defense when user’s password is compromised. However, the attacker will be able to confirm the password of the user before the second-factor kicks in. Also, the second factor does not eliminate or reduce the risks due to compromised data of the service.

At Zeta, we handle very sensitive information for our users. It is crucial to protect access to this information. Given the limitations of passwords discussed above, we realized that we have to go beyond the traditional practices to protect our user data.

It is common to use a one-time token (TOTP, HOTP, SMS/EMAIL OTP) as a second factor after the user is authenticated using a password. We believe OTPs are not widely used as the first factor because:

of the low level of randomness, they offer
people historically used passwords as the first factor and continued with it
it introduces a dependency on additional devices and alternative channels of communication
the potential for abuse of Email/SMS communication by attackers
they are not suitable for use as the only factor of authentication

The advantage of short-lived tokens delivered to phone or Email is that the service will be able to enforce a certain minimum level of randomness. Such tokens are also not vulnerable to compromise of user’s data at other services. Given the ubiquity of SMS-capable mobile phones and accessibility of email on the web, we felt we could rely on SMS and email delivery. Therefore, we devised a mechanism using short-lived OTPs as first factor and relatively static passwords as the second factor.

In our approach, a user is partially authenticated with an OTP. After OTP authentication, we provide a cryptographically secure random salt to the user agent to perform client-side key derivation with the salt and the password provided by the user and use the derived hash as the user’s password on the server. This avoids the inherent weaknesses of the passwords used by users.

[The following content assumes prior knowledge of SHA-256, HMAC, scrypt and ECDSA]

Our Scheme

When a user enters her identity (usually an email address or a phone number) to log in, the user agent generates a public and private key pair using a secure random generator on the device and communicates identity and public key to the server requesting for an authentication session.
After successful initiation of the authentication session, the user is prompted for an OTP delivered to Email/SMS or a TOTP generated using software/hardware token.
When the user provides the OTP, the user agent makes a request to validate this first factor provided by the user. This request for validation is digitally signed using the private key generated in step 1.
On successful verification of the OTP, the first-factor validator sends a signed request to a second-factor validator. This request includes the public key shared by the user agent.
The second factor authenticator passes a user specific 128-bit salt (UserSpecificSalt), to the user agent. The user agent is expected to prompt the user for the password. After accepting the password, user agent generates an HMAC SHA256 of the password using UserSpecificSalt as the key. A validation request is made to second-factor validator with the computed HMAC as the user’s password. [Given that UserSpecificSalt is cryptographically secure, the derived HMAC is also equally random, irrespective of the password used by the user]
The second-factor validator considers the HMAC provided by the client as the user’s secret and performs scrypt using another user specific 256 bit key as salt (ScryptSalt). (The derived hash from the scrpyt is stored and authentication is performed against this has for all requests).
Once authenticated, second-factor validator issues a digitally signed certificate for the public key presented by the user agent in the step-1, associating the public key to the user. The authentication session initiated in step-1 ends. The resultant certificate can be used to establish a data and transaction session.

Note:

Salts and keys are generated using the most secure source of random data available in the host environment.
UserSpecificSalt, ScryptSalt and the result of scrypt are encrypted using AES 256 and stored in separate data zones.
ECDSA with 256-bit private keys is used for all signatures and certificates. The private keys used by first-factor validator and second-factor validator for signing certificates are backed by hardware security modules.
Password check attempts of a user are rate limited to 5 in a window of 5 hours.
TLS 1.2 with Diffie-Hellman is used for data exchange between all entities involved.

Summary

The above scheme ensures that:

A reusable/static password cannot be used directly on Zeta to verify if it is the user’s chosen password.
Irrespective of the complexity of the password chosen by the user, the password used for authentication at the server is cryptographically secure.
Zeta servers never see the password of the user.
No information stored on any of the data centers is sufficient to arrive at the user’s passphrase or to impersonate the user.
Compromise of all data from all servers is also insufficient to arrive at user’s chosen password.

As with almost all client-server authentication schemes, a compromised or non-compliant user agent or a compromised network will make the scheme vulnerable to a variety of threats. However, the impact of any such exploit will be limited to the user accessing the systems from such compromised environment.

Please share your comments on our approach.

Cloudbleed

Posted on February 24, 2017 by Ramki Gaddipati

We use Cloudflare for DDoS mitigation and certain other benefits it offers. Like millions of websites that rely on Cloudflare, we are also susceptible to #Cloudbleed. Ever since the details emerged from Clourdflare, we have started our analysis of impact this may have on our services.

So far these are our observations:

We have received a confirmation from Cloudflare that our domain names are not found in any of the crawler caches they could look into.
We have verified that the access security mechanisms adopted by our end-user products applications can defend the sort of data leaks possible due to Cloudbleed vulnerability.
We assessed that the risk of any privacy breach is also extremely low.

However, some of our web properties are susceptible to breach from such an exploit. We currently assess the probability of such breach is negligible. We are further investigating and analyzing the matter. We will keep all our users posted as we discover anything relevant or when we conclude our investigation.

From what we understand thus far, there is no cause of concern for your data, passwords or access security to your accounts.

=====

Given the negligible probability of our services being impacted, we have concluded that we will not be doing anything specifically for this vulnerability. We have a roll-out of a new multi-factor authentication model planned for all of our web properties. This is expected to reset all current auth tokens and schemes. Therefore, any unknown minor impact this breach might have had, we expect, will be subsided in a short period of time.

[Updated: 5th March]

SecureShield: Campaign for Secure Transactions

Posted on November 26, 2016 by Ramki Gaddipati

As Indians, we are desensitized to certain social evils. Corruption. Dowry. Black Money. We see them often, and yet we may not really process them, until they injure us or our loved ones directly. I feel there’s another social evil, one that’s heavily hushed and downplayed by the industry and media: terrible practices in electronic payment security.

When I talk about this to my friends who use cards, I see the same level of indifference one might see about corruption. But once in awhile, I talk to someone who had to request a chargeback. Their story, of course, is very different. They quickly relate the pain of investigative questions that they were subjected to and the countless, diligent follow-ups they had to make before the money reappeared in their account. And it is far worse if they’d been travelling or if that was the only money they had.

At Zeta, we take security very seriously. We don’t allow users to add money to their Zeta account without a second-factor authentication (usually, OTP sent by the user’s bank like ICICI or SBI). And yet, every day, we get half a dozen messages from our users telling us that someone cheated them of their PIN, OTP and card details, charged their cards and added funds to some Zeta account. We process dozens of chargeback requests a day.

Honestly, far more people pay using cards than the number of people who use Zeta. If we see so many cases of card payment fraud on a daily basis, we can only imagine how many people lose money every day because of this.

In my previous post on payment security, I have described in detail card payment insecurities and how the prevailing practices are fundamentally flawed. For example, it is not uncommon for banks to say the following:

campaign_for_secure_transactions_-_google_docs

But every user knows that they cannot do a card transaction online or at a store without giving the card or card information to the merchant. The user has limited control over where he is entering the PIN, OTP or other details.

This must change. By default, users should have secure transactions. Nothing that the user is asked to share should make him vulnerable to fraud.

We built Zeta Super Card with this fundamental rule. We have now made it available for everyone.

Zeta Super Card with SecureShield

Zeta Super Card is a prepaid card that you can load at your will. It is available in digital form the moment you install our Android or iOS apps. You can request for a plastic card on our website or through our apps. You can use this card online for eCommerce transactions or at any stores with a swipe machine that accepts MasterCard. Although it can work at any card accepting merchant as any normal debit or credit card, it is the SecureShield feature of the Super Card that makes it incomparably more secure than traditional magnetic stripe or chip cards with PIN.

turn-off

With SecureShield we have brought security into your hands and delivered unprecedented controls that prevent card frauds without compromising your convenience. It acts like a remote control for your card.You can turn your cards on and off anytime. When the card is off, no transactions will go through! You can turn it on whenever you want to pay.

superpin

SuperPIN

You can use dynamic SuperPIN in place of traditional 4 digit PINs. This ensures that even if the waiter at the restaurant speaks/shouts out your PIN or a CCTV camera watched you enter the PIN, you needn’t be worried. SuperPIN is valid only for 2 minutes and exactly for one transaction. SuperPIN will be available on your phone even when you are offline. You can safely transact any time at all shops that accept cards.

settingssecureshield_1

LocationShield

With LocationShield on, you can be sure that no fraudster can transact using your card details even if you inadvertently shared them over phone or email. The system will allow transaction only from machines close to you. If you are in Bangalore and if a fraudster in Mumbai gets your card details and OTP or PIN, he will still not be able to transact using them unless he is also in Bangalore. Irrespective of whether the fraudster is trying to do an e-commerce transaction or doing a transaction with skimmed card on a POS terminal, the transaction will be rejected and you will be notified!

Swipe2Pay

If you transact on ecommerce sites, you would be happy to know that you need not wait for OTP and enter your passwords. By now, you might know the vulnerabilities in that process. With Swipe2Pay, you just need to swipe on the secure dialog presented by Zeta on your phone to complete ecommerce transactions. It is fast, convenient and insanely secure.

Traditional PIN

If you still want to use a traditional 4 digit PIN, you can do so with an increased level of security offered by quick and instant PIN change mechanism available in your app.

secureshield Tracking Score

The SecureSheild also keeps track of the security strength of your settings. You can tweak them as you like and know how secure your card is, at any point in time.

Some of these security features are firsts of their kind in the world. But more importantly, they are made to be friendly and easy-to-use, not only for people like our friends but for people like our parents and grandparents, who are not technically inclined.

You must start using Zeta Super Card over any other card if you care for the safety of your funds. You should also give it to your mom, dad, grandparents or anyone else whose financial security you care about. Get your cards today!

Join the campaign of secure electronic transactions! #SecurityFirst @ZetaIndia

Today, Payments are generally insecure

Posted on October 23, 2016 by Ramki Gaddipati

Have you ever been asked by a waiter at a restaurant for your card PIN? Has he ever written it down on a piece of paper? Have you ever noticed a CCTV camera watching you enter the PIN at a shop? Have you ever used an app to buy something, and got redirected to your bank’s web page to enter the PIN? Have you ever given permission to an app to read all your SMS messages, some of which may contain payment OTPs?

By now, you might have guessed that your PIN and OTPs are easily compromised.

The many technological advancements since the time card based payment systems were introduced have made the systems defenseless. Fundamentally, card based systems require every machine and all intermediaries involved in the transaction to be secure. There are way too many of those machines and intermediaries across the world to ascertain security of them. Many people who use those machines, like merchants, have little idea about their role in securing the entire system. Any breach at any one machine can cause unrecoverable damage to several users. The card details once captured can be reused several times for many transactions.

To overcome some of the limitations of the payment cards, a PIN is mandated for every transaction. A 4 digit PIN offers protection against many trivial threats. But given the reusable nature of the PIN and also the poor choice of PIN made by users, it is not a strong enough defense.

It is hard for the user to know if the card machine at a retail store or the ATM on which she is entering the PIN is tampered with. Banks providing these machines can’t practically keep a watch, as several machines are deployed in unmonitored physical environments.

In case of personal devices like phones, users will have far better visibility and control on tampering. However, most of these devices and the software that’s running on them are also vulnerable to a range of threats. Android users can easily notice that many programs can read the SMS messages on their phone. These programs can also export the OTPs in messages to a fraudster’s machine without user’s knowledge. Also, almost all mobile apps that accept card and netbanking payments can read and store the passwords, OTPs and PINs entered by the user. It may appear to the user that she is entering the PIN on the bank’s payment page. It is nearly impossible for the user to know if her password or PIN is captured and stored by the app.

It is not easy to trace the fraud back to the fraudulent app or merchant. The information captured by one app can be used by anyone from anywhere in the world. If one ATM is compromised, the cards used on that machine can be replicated and used on any other ATM. These vulnerabilities are just the low hanging targets for a fraudster.

To overcome vulnerabilities in operating systems, applications, hardware components and to defend against powerful machines and programs that can exploit the minutest of the vulnerabilities, it is essential to make security a foundational aspect of the system. The payment system providers should be fanatic about security.

Secure Foundation for a Modern Payment System

When we started work on Zeta in April 2015, we had the opportunity to consider a great expanse of devices and applications the modern connected world is currently using and is likely to use. We studied the state of the art secure protocols, services and algorithms and designed several core components of our payment system quite differently from that of legacy bank or card systems. We spent the first 3 months of our time in arriving at a secure architecture at the core. We used asymmetric encryption and signature algorithms to defend our systems against large scale fraud.

We arrived at the following ground rules to secure transactions against many of today’s threats:

3rd party devices: System should not rely on any non-personal devices of the user for transaction security. The card reading machines, POS machines, ATMs or any other machines that don’t belong to either the transacting user or zeta should be assumed as vulnerable.
User provided data: Users are prone to use easily retrievable personal data for passwords, pins or for such other key material. System should not solely rely on such information to secure transactions.
Data Persistence and Transmission: Every transaction must require information that is never persisted on any device involved in the transaction. Information transmitted for a transaction should not be reusable to do any other transaction.
Visible Information: If information is visible to user, we assume that it can be shared or can be stolen through techniques like social engineering. No data the user may be able to share during the course of a transaction or otherwise should be sufficient to do a new transaction.
Data Location: No amount of data at rest on any one machine, or if possible in any one location (data center), should be sufficient to complete a transaction
Hardware Security Modules: Always use hardware security modules to secure the key material. On users’ devices that don’t support access to secure elements, use encrypted stores for key material.
Channel Sensitivity: System should strive to use the highest entropy algorithms and keys suitable for each transaction channel.

These rules guide the protocols and algorithms we use on Zeta. They influence many user interactions on our mobile and web applications as well.

Securing Card Payments

We provide Super Cards to many of our users. These are traditional plastic cards with magnetic stripe, provided to make payments to millions of card accepting merchants. This is our approach to make Zeta backward compatible with the present day payment systems. However, we didn’t want to limit ourselves to the insecure practices widely prevalent across industry. Using the ground rules discussed, we have adopted a different security model for Super Card transactions.

In-Store Payments

Dynamic OTP on the phone can be used as PIN

When a user transacts using his Super Card at a retail store, instead of a 4 digit static PIN, she can use the dynamically generated Zeta Code as her PIN. Using the Zeta Code ensures that even if the card reading machine or any of the intermediaries are compromised, no fraudulent party can make use of the PIN for another transaction.

Online Payments

User need not enter OTP for online payments

When a user starts a payment on an ecommerce website, she will be greeted by our custom SecureCode* page. Parallely, she is prompted with a dialog on her phone with an option to ‘swipe to pay’. If the user wants to confirm the payment, she swipes the slider on the screen and enters her PIN through a custom secure keyboard provided on the phone. The payment page on the browser proceeds to the merchant website with payment success. The user wouldn’t have entered any input on the SecureCode page in the browser. As the user need not read and enter the input, we can now use significantly more secure algorithms like digital signatures to represent authorisation for the payment. This approach also offers a better user experience and quicker payment completion.

If the user is offline on her phone when she is on our SecureCode page, she can generate a Zeta Code on her phone and enter that on the SecureCode page. Each such 9 digit Zeta Code is valid only for 2 minutes. This eliminates the need for static passwords, PINs and SMS messages and thus their vulnerabilities.

Security in Mobile App Interactions

We believe relying on user’s personal device is better for security than relying on any 3rd-party device. However, it is important to defend the information on the device not only against the possible loopholes of the hardware and the software, but also from the user’s ignorance. Failing to do so could allow for exploits in the form of malware and social engineering.

Therefore, when defining the security model of Zeta apps we assumed that on users’ phones:

Keyboards are compromised
SMSes are compromised
Data at rest is vulnerable
(Even though stored in application specific stored provided by OS)
Text labels on the screen are compromised

We built a model that can maintain security and integrity of the transaction against such compromised components. A specific example is the Zeta code that is generated on phone to make a payment.

To generate a Zeta code, user has to press and hold on an area on the screen until a ring is fully drawn.
When the device is online, the code is generated against a digitally signed request from the user. We use ECDSA to achieve reliable encryption strength.
The private keys used for signing the code generation request are encrypted and stored on disk. On phones that don’t have a secure element backed user authentication mechanism, the encryption key is derived from user’s PIN. On phones that have a secure element on, the keys are generated using high entropy sources available on phone and are stored in a secure element backed storage+.
When we ask for PIN, we take user input through a custom keyboard that obfuscates every key press. The labels of the on-screen keyboard do not represent the actual data captured on each key press. Thus, the PIN of the user is virtually never captured in the form user remembers or reproduces it. Also, the derived input from those key presses never leaves the phone.
Once the code is fetched from the server, although the code appears like text, it is rendered on screen as an image. No malware or screen reader in the device will be able to trivially parse the text on the screen.

Summing up the Card Payment Security

We must acknowledge that the card based payments can be made significantly more secure than how they currently are. MasterCard, Visa and Rupay have provisions for better security, but many of the prevailing implementations do not make good use of these provisions. We hope all of the industry players will adopt more secure options in the wake of the recently observed 3.2 million debit card compromise.

Although the recent breach is related to in-store and ATM transactions, we also hope that it serves as a wake-up call to reconsider some of the insecure practices related to online card payments as well.

We must constantly strive to earn the confidence that our users place in us. In the payment industry, it is extremely hard to win back a user’s trust after it is lost. We at Zeta will be happy to help everyone in the industry in this endless and rewarding journey of building and running secure payment systems.

* SecureCode is a trademark of MasterCard.
+ Our Android/Windows apps currently do not recognise secure element. The data is protected using keys derived from user’s PIN.

Digital Meal Vouchers

Posted on July 24, 2016 by Ramki Gaddipati

Maintaining an account of some value and supporting debits and credits on that account is an elementary exercise that a computer science student picks up in a Database course. Hundreds of accounting systems have been built across the world with about as many different implementations of these basic rules, and a proposal for a brand new accounting system could be considered as reinventing the wheel. But new accounting systems were built — Bitcoin, dozens of Altcoins, interbank distributed ledger systems like Ripple — achieving the same purpose of maintaining a ledger of accounts, but in fundamentally different ways. Across the three different systems of Bitcoin, Ripple and an Islamic Accounting System, the definitions of a ledger, a transaction and the transaction rules have very little in common. Their governing principles are so distinct that they needed fundamentally different systems.

Compliance with unique laws and regulations demand unique software systems.

The Meal Vouchers as defined in the Income Tax Act Rules, 1962 and the various prevalent interpretations of the same led us to build a one of its kind system to hold funds and enable transactions. The prevalent rules state that an organisation can provide tax benefit on an employee’s meal expenses if and only if the value is provided in the form a ‘paid voucher’. Such a voucher should also be non-transferable and only a fixed value of Rs. 50 per meal may be given.

At Zeta, one of our objectives is to eliminate paper transactions. We’ve tried to identify an available electronic system that can adhere to the requirements of Income Tax Rules for Meal Vouchers. We investigated stored value electronic cards like Smart/Chip Cards and NFC Cards, conventional mobile wallets, prepaid and core banking systems and the cards backed by such systems, and concluded that none of them could comply with any acceptable definition of a ‘voucher’. None of these systems hold value in discrete units of Rs. 50 each, like how a paper voucher can hold. Interpretations that an electronic payment card complies with the definition of the voucher in the Income Tax Rules have been vehemently contested and a majority of reputed auditors have taken a position to the contrary. The Income Tax Rules published in 2009 have solidified the distinction between voucher and an electronic payment card by providing different set of rules for electronic cards from that of vouchers. The current rules do not even recognise electronic cards. If electronic cards do not meet the requirements of a meal voucher then any other traditional mobile wallet or NFC card will also not meet those requirements as the principles of accounting and transactions are the same among all the systems. From our discussions with several auditors and compliance officers of various companies, we learned that they want a real paper voucher like solution to meet this requirement. So we compiled all the laws defining and governing electronic documents, negotiable instruments and payment systems in India and created a Digital Voucher system that forms a crucial building block of the solution that complies with Income Tax Rules.

Any of the general, legal, wikipedia and other dictionaries define Voucher as a written document with a certain value that can be exchanged for goods or services. To digitise a written document and to provide the credibility offered by a signature while being compliant with the definition of written document, we had to rely on Electronic Documents and Digital Signature definitions as specified in Information Technology Act 2000. To define an instrument that can represent a promise of certain value, we had to consult Negotiable Instrument Act (Amendment) 2015. To define a method of exchange of such documents and to issue documents representing a value, we had to refer to Payment and Settlement System Act, 2007 and the Guidelines to Issue Prepaid Instruments published by Reserve Bank of India.

By drawing from all the relevant laws of India, we created a Digital Voucher – a digitally signed, electronic document bearing a unique identifier and representing a specific value in Indian Rupees, issued by a Bank to a beneficiary unambiguously identified through a phone number, email address or any other such reliable, unique identifier.

We have built a system that mints Digital Vouchers on demand, using hardware security modules to create secure digital signatures as defined by Certifying Authority of India. We built a wallet that can hold these vouchers of a beneficiary. We developed a transaction system that can transfer vouchers from a beneficiary’s wallet to that of a merchant’s after due authentication and authorisation in compliance with the Reserve Bank of India’s guidelines. We created a settlement system that can interoperate with IMPS and NEFT systems of NPCI and also with card associations like MasterCard.

Using this unique system of Digital Vouchers, we built a Meal Voucher system wherein no voucher issued is more than Rs. 50 each and is non-transferable. We added multiple subsystems which identify the nature of the business of the merchant involved in a transaction and to disallow a meal voucher transaction if the merchant is not a recognised food seller. We built a system to allow full and partial refunds that work with vouchers. As the network connectivity proved unreliable even in the metros of the country, we also devised a mechanism to enable offline transactions using these vouchers.

These unique characteristics of our system make our Digital Vouchers as usable as paper vouchers:

Without Internet Connectivity
Without Power
Without Smart Devices
Without a steep learning curve

In addition to retaining the advantages of the paper, our system provides compliance controls that are unimaginable with paper – the details of the beneficiary of a voucher are embedded at the time of creation of voucher, in every transaction the beneficiary is authenticated with multiple factors or the vouchers can’t leave the user’s wallet. Thus making our vouchers non-transferable, mitigating one of the major weakness of paper vouchers.

To summarise, Zeta Meal Vouchers are digital vouchers generated per each employee and are immutable and non-transferable. Each voucher bears value of Rs. 50 or less per meal. Vouchers are distributed to employee’s wallet on the cloud. Employee can access and use them for food through Mobile App, Super Card, Super Tags and other means after due authentication.

To deliver the most visible benefit of convenience with our Digital Vouchers, we had to tame the devil in the details of compliance. In our interactions with various companies, we realised that different companies have different views and policies for meal voucher benefit to employees. There are several companies who never adopted paper vouchers because they believed that they are non compliant – they are not non-transferable and the usage is not truly restricted.

We made the entire system completely configurable per company. Companies can choose if the usage of the meal vouchers have to be restricted to specific vendors in their office, vendors they have tie-ups with, vendors Zeta has tie-ups with, merchants recognised as food sellers on card networks, Zeta certified food sellers or all recognised food sellers or any combination there-of. Companies can restrict the days and time during which the meal vouchers should be usable, maximum value of the transaction permissible and how to treat the unutilised funds in the meal vouchers at the end of the financial year. That is, we have given all possible levers and knobs for the companies to define the system as per their policy. Better yet, the system can interoperate with several payroll softwares and the value of meal vouchers to be disbursed in any given month can be computed and administered seamlessly.

Each of the subsystems like Minting, Wallet, Transaction Processing, Core Accounting system, Card Network Gateway, Merchant Curation, etc., that compose our Digital Voucher System deserve an exclusive post each to discuss the unique advantages they bring and the deviations we had to make from traditional approaches to achieve them.

NOTE: Voucher is only a foundational element in compliance with the Income Tax Rules, 1962 Section 3 (7) (viii). We don’t intend to favor any interpretation of the rule over others. Our objective is to provide all the necessary ingredients to companies to compose a solution as per their policies. Over 200 companies have chosen to use this solution with at least 2 dozen unique policies that are different in many important ways.

Trusted Contacts

Posted on July 17, 2016 by Ramki Gaddipati

Protecting access to sensitive information and providing strong authentication for all financial transactions is expected of any financial app. However the hardware limitations, platform semantics and the lack of knowledge of most users make the hard problem of securing access harder.

Authentication is usually done based on either what user knows (password, PIN, etc.), what user has (token, chip card, etc.), or what user is (biometric data). The password based model that is still being used by many financial apps and services is fraught with security vulnerabilities. Services often rely on additional, second factor authentication in the form of a one time password (OTP) sent through SMS. This model is reasonably secure in the desktop browser world, but it is broken in the android world, since most apps and most human attackers can access the data stored on an Android device, such as the SMS messages.

We believe that the password model, the OTP model and models that combined the two are unreliable for a financial application. Biometrics based approaches are impractical for a large number of Android users. We couldn’t always rely on what user knows, as it was impractical to expect all the users to be cautious in both specifying what they know and to not share the same information with others. We brainstormed a few approaches that rely on what user has and what user is. Trusted Contacts is one such model that relies on both what user has and what user is in an interesting and simple way.

Trusted Contacts relies on humans as the challengers and the verifiers of identity. Humans naturally use biometrics and other ambient signals to authenticate people in everyday interactions. This process is too complex for a machine to emulate and is practically infeasible. With Trusted Contacts, we wanted to rely on this natural power of humans and convert the result of such authentication to a value which is easily representable, transmittable and has sufficient entropy. Such a value, can then be presented to the system and be used for authentication by the user who was challenged.
A user first sets up his Trusted Contacts by nominating a set of at least 3 contacts identified by their mobile numbers. The nominated contacts need not be users of our system. The contacts are informed about their nomination and their role in securing the user’s account. They are told that when the user attempts to re-login from a clean state they will receive a code from Zeta. The contacts are requested to ascertain the identity of the user before they share the code they received.

Select a subset of contacts you Trust to Setup "Trusted Contacts"

When an existing user attempts to login, he is challenged with OTP based first factor. After successful authentication with an OTP, if the user has his Trusted Contacts setup, he is challenged to provide authentication codes from at least 2 of the nominated contacts. The User indicates the contacts he would like to rely on in this particular instance. Zeta sends codes to those contacts along with a brief reminder about their role in this exercise. The user is expected to get in touch with his trusted contacts and receive the codes. If the user provides the right codes, he is given access to his account with Zeta.

Select the Trusted Contacts to authenticate with

Enter code given by Trusted Contacts to complete authentication

During authentication, the interaction between User and his contacts is unknown to the system. The system does not provide any hints other than the name (and possibly a picture, in the future) of the contact as known to Zeta, on how to contact and/or how to establish identity. Authentication codes are securely generated 4 digit random numbers and are specific to the current authentication session of the user. As of now, the authentication codes are sent to contacts by SMS. If the contact is a Zeta user as well, we intend to use secure channels to communicate authentication codes. We intend to provide hints to the contacts on how they could challenge the user, so that they are alerted to the sensitivity of the role and can fulfil it meaningfully.

If the contacts play their role reasonably well, we believe this model will be highly reliable. Given the required human interaction involved, it will also be a very unattractive target for computerised attacks. While the contacts are also likely to have Android phones, we have tried our best to make the auth code SMS messages not usable by malware and spyware.

The success of this feature and thereby the reliability of the security it can offer hinges a good deal on the user education and the overall user experience. We have done several iterations to arrive at the current interactions. We do believe that there is a lot of room for improvement in the overall Trusted Contacts experience. However, we think it is sufficiently ready for initial adopters and to gather feedback. We would love to hear what you think!
NOTE: Trusted Contacts in isolation is not sufficient security measure or a complete authentication approach. This is only one of the elements of a more elaborate system and application security model and relies quite heavily on other modules of the model to provide the security we rely upon.

First Post

Posted on January 29, 2016 by Ramki Gaddipati

When I started conceptualising Zeta, it appeared like a great startup opportunity. Once Bhavin and I started working on it, the canvas began to expand everyday. A great many doors opened to us, each one enabling us in new ways, and soon I stopped seeing this as merely an opportunity. A few months into Zeta, I realized that this is my life’s work.

The impact that Zeta can have in the lives of millions of indians is massive, and we at Zeta want to make the journey itself count, and not just its end. Irrespective of commercial success, it is going to be a memorable one. 🙂

People at Zeta have built custom secure kernels to run our servers, a one of its kind non-repudiable transaction system, algorithms to optimally deliver location based services in a bandwidth constrained country, and a disintermediated transaction system that works completely offline. However, we believe this is just the beginning. We are building the foundation for what we believe could be a platform that will help India and the world transact digitally, easily and securely.

Zeta is a technology company. Not e-commerce, not banking, not travel, not food, not entertainment and not many other things. We are just a serious technology company. As such, we attack problems at the fundamentals. We work on cryptography, complex systems and algorithms. We build simple and elegant software that solves large, everyday problems. Some of these problems have been untouched for decades, and some have been attempted by hundreds of companies with no appreciable result. We believe that solutions to such problems require not only great technology but also great empathy for the users and intimate knowledge of building beautiful and highly usable interfaces. At Zeta, we have gathered some of the most talented people in the country to build products, systems and interfaces. We constantly seek to bring great people into the team who will make us learn more, do more, strive more.

The nuts and bolts of serious technology is usually hidden from the users. I hope that this blog will be a place where we write about the technology we are building, using and learning. Here we would like to discuss some of the intricacies of our products, hear feedback, collaborate with our users and engage with the engineering community at large.

Wow, it feels real special to be writing this first post on zeta.tech blog. 🙂

Wish us good luck! 🙂

Architect and co-founder of Zeta,
Ramki

Engineering @Zeta

Author: Ramki Gaddipati