Engineering @Zeta

Cipher CIAM for a reasonably complex enterprise AAAC scenario

Cipher is Zeta’s mobile-first advanced Access Control Server (ACS) that offers seamless and secure payments with best-in-class success rates. Superior integration, risk-based authentication, and superior OTP experience guarantee increased customer stickiness and higher profits. Cipher also offers powerful APIs, advanced SDKs, and innovative features like Swipe2Pay and SuperPIN to provide an unparalleled payment experience.

Cipher does not require or store any payment sensitive data, like Card PAN, Expiry, CVV to perform authentication. It requires minimal data depending on the various modes of authentication you may want to enable.

Cipher CIAM (Customer Identity & Access Management) is a point for multiple Sign In and SSO use cases externally and within Zeta. Cipher CIAM is OAuth 2.0 and Open ID Connect 1.0 compliant. The primary objective of Cipher CIAM is to reduce the complexity of access management by 10X. In this blog, we’ll be covering 3 key concepts:

  • Access management in a role-based world
  • Cipher CIAM’s approach to access management
  • Access management with Cipher CIAM

Without Cipher CIAM

Let’s look at a scenario without Cipher. Consider a hypothetical organization named Theta (a series A funded B2B startup). The company has a prospective client who is interested in their fully cooked product. With a team of talented engineers to handle all the coding requirements, the only thing needed from an authorization standpoint is ensuring access to the repository rests only with the engineering team. They need to ensure that the product can be accessed only by the prospective customer in a secure way. Additionally, if there are exits/additions to the engineering team, the admin i.e founders in this scenario need to ensure access to the repository has been revoked/granted respectively.

Roles within Theta can be divided into admin and developer. Furthermore, roles within the product are divided into admin and users.

As Theta grows in size, they have teams of HR, IT, finance, engineering, and an expanding customer base. The functions within the company have changed over time. HR requiring payroll access translates to the need for an authorization structure. The authorization scope is now isolated as there are groups of customers each requiring a tenant. Each team is further divided into specific roles i.e HR Admin, HR Update, and HR viewer.

Theta decides to launch another product with a new sales and engineering team. This product comes with a unique set of customers. As the company continues its expansion, the roles start getting increasingly complex.

System integrators and mobile app development partners enter the foray at some point, requiring access management. At this stage, Theta delivers products to customers who in turn deliver them to their end-users. Additionally, development partners need to be given access to APIs, specific SDKs, tokens, etc., thus increasing the load on the authentication and authorization system.

Fast forward a few years, Theta decides to go public and begins the process of setting up a team in the US. The company, at this point, has 200+ roles!

Using Cipher CIAM

Enter, Cipher CIAM(Customer Identity and Access Management) approach (previously called Cipher SSO).

Here, we look at how Cipher is about to solve the complexities in roles every time a new set of customers are onboarded.

The improvised journey ….

This is how Theta’s access management would look like on Cipher. For every business unit and functional unit whose access controls need to be assigned together, Theta can create a Sandbox. Within the Sandbox, they can further create object types or actions and assign specific roles to people. And, this doesn’t stop here! Customers too can create a Sandbox to manage employees (like Sodexo).

Here is a list of Pros and Cons of doing authorization and access control through the Cipher Model.

Pros

  • Managing roles is isolated within a container.
  • Apps/Products need not be re-written with growth in every new dimension.
  • Flexibility to design authorization and access control in isolations.

Cons

  • Relatively higher learning curve.
  • Not suitable for simple AAAC situations (ex: Theta in Series A and B times).
  • “Roles bloat” is possible if the sandbox is not used with careful thought.

Sample Use cases used for narration:

  1. HR wants new employees to get access to the payroll system faster as the investment declaration deadline is approaching
  2. IT Security team wants ex-employee access revoked on last day

Use case 2 — Company functions grow

  1. As a small startup grows and hires a Finance Controller, manager wants to ensure only his team can access fin data and not everyone in Tech team

Use case 3 — MNC

  1. India and US admins are able to provision artifacts on their time zones with no need for email/coordination etc

Thank You

Speaker: Bharathi Shekar , Bharathi Shekar

Edited by: Phani Marupaka

Tuning Cipher to 1M TPS for ease of scaling of online authentications (Part 2)

We went through the overall idea behind the demo in Part 1. Part 2 will inform us about how we achieved the feat, highlighting the technologies that we used along the way. Heads up, it is going to be engineering focussed. 

Quick recap: Cipher demonstrated the successful handling of a chart-busting 1 Million Transactions Per Second of online authentication requests utilizing a prudent cost of $200/hour of AWS cloud infrastructure. The simulated authentications are 240x higher than the authentication output put together by all of the big players in India during online payment transactions. 

Things we did engineered a nutshell

  1. Optimized systems for a minimum unit for processing a desirable number of online authentications 
  1. Built the infrastructure 
  2. Setup the simulation environment and node distribution mechanisms
  3. Processed successful authentications for the transactions
  4. Scaled the minimum unit for the desired load capacity of 1 Million TPS

Infrastructure used

  • Nginx – A web server used as an ingress controller (to accept HTTP requests).
  • EKS (Elastic Kubernetes Service) – An AWS managed service to run Kubernetes.
  • Prometheus – An open-source monitoring system with a dimensional data model, flexible query language, efficient time-series database, and modern alerting approach.
  • Grafana – An open-source analytics & monitoring solution for databases and microservices.
  • Metrics server – A cluster-wide aggregator of resource usage data for auto-scaling horizontal pods. 
  • Microservices
    • Edith
    • Cerberus
    • MasterCard Cipher
    • We’ve presented JVM stats, HTTP requests & connection stats from each of these apps via metrics endpoint and made this a scrape target in Prometheus.
    • We’ve also enabled the JMX port on each of these microservices to monitor them in real-time.

Node distribution

‘Node’ is a container of a single server where multiple microservices can be run and monitored to successfully process a definite amount of authentication requests. We configured each of these nodes at 64GB RAM & 16 GB processor and deployed them on Kubernetes. 

Each node had several pods running within it, which hosted the microservices required for authenticating online transactions. Amongst the microservices, there are three notable ones. 

  1. Mastercard Connector – To handle the ACS protocol specific nuances and serve as a connector for the Mastercard card scheme
  2. Edith – To orchestrate the intricate authentication plans
  3. Cerberus – To serve as the Identity Provider and the core authentication engine

Simulation environment

Although processing 1 Million TPS was the objective, we also had to generate that much amount of transaction load for doing that. So, we used Gatling as the load generator. We let the Gatling Master spread across four zones, spread across in Mumbai and Singapore, for running 50 tasks of individual test scripts to authenticate bank transactions coming in from card networks. 

The Gatling injector simulated the interactions that a card network usually has with its ACS provider while authenticating online transactions. It also reproduced the cardholder interactions required for authentication. The system-level simulations are – 

  1. VEReqs on behalf of the card network
  2. PAReqs on behalf of the card network
  3. Challenge-response submissions as performed by cardholders to prove their authenticity

The simulator routed these requests to auto-scalable clusters of Cipher microservices, which were configured with dummy BINs. 

Authentication flow

These are the steps for a successful authentication flow. 

  1. The simulator verifies (with VEReq) if the card under authentication is enrolled with Cipher. If it is, the simulator initiates the pair authentication request (PAReq) and gets the authentication URL from Cipher.
  2. Gatling simulates the Swipe to Pay (S2P) interaction, generating the necessary credentials required for completing the authentication. 
  3. Gatling submits the S2P challenge and receives the authentication response.
  4. It redirects the control back to the card network module that’s simulated.

Minimum unit

A minimum unit consisted of a specific number of instances of every microservice. The function of this minimum unit was to handle a definite number of transactions. 

To come up with this unit, we tested each microservice individually to get a measure of the number of authentication requests it could serve in a given time period with specific resources allocated to it. The minimum unit that we came up with successfully handled 20K authentication requests.

1 minimum unit = 3 units of Edith, 2 units of Mastercard Connector, 2 units of Cerberus

Here is the resource utilization data when Cipher was comfortably serving 20K authentications. 

Scaling of the minimum unit

Now that we had a stable unit that could serve 20K TPS comfortably, we scaled it to accommodate additional authentication requests. We used this unit as the base reference for serving transaction requests from one issuer. 

With this logic in place, we scaled the minimum unit up to 50 such units, handling 20K TPS each, to serve 50 issuers, leading up to 1 Million TPS. The 4 Gatling servers simulated the load for 500 distinct BINs spread across 50 issuers, each having 1,00,000 cards. The entire setup, consisting of 4 Kubernetes clusters, spanned across 2 AWS regions with two availability zones, respectively.

With AWS, the nodes aren’t scalable within minutes. But since we had to attain that, we warmed up the nodes by scaling up the pods. We used 350 of the pods within 100 nodes to handle the heavy load of 1 Million authentication requests. When the load was less, we scaled down the nodes by cutting down the pods based on the resource utilization factors using Horizontal Pod Autoscaler. 

Footnote:

  1. The authentication’s write I/O was done asynchronously considering the purpose of the live demonstration.
  2. Three requests effectively make one authentication.

If you’d like to check out the test data that we used for simulating authentications, it is available for reference here.

Conclusion

It took our team of 8 people only 9 days to successfully deliver the desired results. We’re glad to have made some key decisions during this process that allowed us to quickly achieve what we wanted – like choosing AWS, Kubernetes, and Gatling Enterprise. We hope to work on several such innovative initiatives that redefine the future of payments in our country and beyond. Thanks for reading and hope you found this helpful!

Credits

Developers who made this feat possible – Mrinal Trivedi, Amit Raj, Ramki, Dipit Grover, Amit G, Shubham Jha, Vivekanand G, Mohd. Tanveer, Shaik Idris

Author – Preethi Shreeya, Phani Marupaka

Achieving 1 Million TPS with Zeta’s Cipher – A new benchmark in the Payments Industry (Part 1)

Introduction

We marked 22nd Jan 2020 as a historical day of significance for Zeta and the future of payments! We were able to successfully showcase 1 Million Transactions per Second (1M TPS), illustrating the scalability and elasticity with which banks can handle online transactions.

Background

In December 2019, we launched Cipher, a cloud-based Authentication as a Service solution that provides an easy way for the issuing banks to participate in online transactions, adhering to the 3D-Secure protocol as laid down by EMV Co. 

With Cipher, the issuing banks can provide modern, cutting-edge, and secure card authentication services to their cardholders. To unveil the robustness of the service, i.e., the amount of authentication load that can be handled at the highest success rate (99.9%), we planned to simulate 1 million online authentications. 

To put things in context, currently, the maximum online transaction volume across India is approximately 4160 per second, considering the transactions processed across all of the electronic payment channels like UPI, Visa, Mastercard, Rupay, NEFT, and IMPS. You can find the reference calculation here

When we say Cipher can handle 1 million authentication requests, it is 240 times more than the number of payments being processed in India by all of the big players put together.  

This estimate gives us an idea of the incomparable scale at which Cipher can handle authentications, relieving issuers and merchants from the persistent problems of scalability and reliability during flash sales and other peak load scenarios. Besides, We have built Cipher in such a way that it is elastic, so no resource gets wasted. When there is an increase in the request volume, the systems self-provision the resources from cloud providers and relinquish the same when the request volume decreases.

Importance

Why is this important to us? The first reason is that we want to enable issuers to authenticate as many transactions as possible, with the highest success rate and reduced revenue loss. Naturally, issuers will be able to handle any amount of eCommerce sales at all scales that merchants can imagine. When Cipher can do all of the heavy liftings when it comes to online authentications, issuers have the comfort of focusing on their core bank offerings rather than worrying about server overload and authentication failures.

The second reason is more personal. We want to convey that ‘scale’ is a solved problem with Zeta. As more and more commerce progresses online, it is natural to expect that the payment authentication solutions will run at this internet-scale. 

We believe that all of the consumer-facing banking and payment services should be as scalable as Google Search at peak loads. We want to exemplify this with our Cipher demonstration. We want to assure the banks that they don’t have to worry about scale anymore.  

Demo Details

In the dashboard, the number on the left indicates authentication requests being handled by Cipher in reqps. The middle number shows the maximum number of requests at the current computing capacity in the selected time window (5 minutes). Towards the right, we have the number of pods (systems) of Cipher that are up and running to handle the required load. In the central region, we have the success rate of the authentication requests.

As one can notice, during the video

  1. The number of requests handled per second increases from 18 reqps to 1 Million reqps in about 3 minutes.
  2. The success rate of handling the authentication requests stays constant at 100% throughout the demo.
  3. When the load is lightened, the number of computing pods required for handling 1 Million TPS shrinks back to a smaller number showcasing how Cipher auto-scales.

As Cipher’s systems are horizontally scalable, they are limited by the computing and storage units built into them. So, by any means, this high load of 1 million TPS does not indicate the maximum ‘capacity’ of the systems. Cipher can extend to the internet-scale and shrink to a single node footprint, based on the needs. We think we demonstrated that alright. 🙂 

Scope Inclusions

In the above demonstration, we have simulated:

  • Mastercard Card Network to demonstrate the authentication initiation requests (as per the 3DS authentication protocol) that get dispatched to the Cipher system.
  • Swipe2pay authentications – to simulate user engagement while fulfilling the authenticating challenges.

Scope Exclusions

  • User interactions for authentications to do away with the additional time
  • Risk and fraud checks

Key metrics

  1. Resources used (Cipher)
    1. Units: 350 server instances (all microservices incl.); (Kube cluster running on 200 m5.xlarge EC2 instances)
    2. Memory: 8 * 360 GBs
    3. CPU: 4 * 360 vCPUs
    4. Bandwidth (if available)
  2. Resources used (Load Generator)
    1. Units: 50 * c5.4xlarge EC2 instances
    2. Memory: 32 * 50 GBs
    3. CPU: 16 * 50 vCPU
    4. Bandwidth (if available)
  3. Verification mechanism used
    1. JSON web signature.
  4. Test Data Structure
    1. Banks: 50
    2. Card BINs: 500
    3. Cards per BIN: 10000
  5. Average response time per request: 100ms
  6. Hops per each request within Cipher cluster: 
    1. VEReq: 1 hop
    2. PAReq: 2 hops
    3. Submit Swipe2Pay challenge: 3hops 

Conclusion

We honestly believe that the 1 million TPS demonstration has set a new benchmark for payment authentication solutions. We are also delighted that we were able to pull this off in a short period of 9 days. Although we didn’t have enough time to explore a lot of other frameworks for additional optimizations, we think we did great in the time we had! 

It’s also worth mentioning that our systems are capable of scaling more than 1 Million TPS when necessary. We hope that cloud technologies get widely adopted by issuers in the Payments Industry to be able to improve their offerings for their customers significantly.

We have a part two which delves into the core engineering aspects of the demo for the relevant audience. Thank you!

Part 2 of the blog

Credits

Developers who made this feat possible – Mrinal Trivedi, Amit Raj, Ramki, Dipit Grover, Amit G, Shubham Jha, Vivekanand G, Mohd. Tanveer, Shaik Idris

Author – Preethi Shreeya, Phani Marupaka