Engineering @Zeta

Program Management: A Beginner’s Guide

Rohit Kamat, Program Manager at Zeta, talks about his experience as a program manager and key learnings from his time at Zeta. He talks about topics such as stakeholder management, defect management, and key learning from sprint and scrum planning.

Stakeholder Management

When working on a project, managing your stakeholders is just as important as managing your deliverables. Unless you identify, analyze, plan, and review your stakeholders, you would not know who these people are, what they are responsible for, and what their expectations are from the project and you.

Similar to how you do not want to over-assign tasks to people working on your projects, it is also important to not burn yourself out talking to a lot of people. As a program manager, you must know how to prioritize stakeholders and derive the most value from interactions with them.

There are multiple approaches to identify and prioritize stakeholders, let us take a look at 2 of these.

  • Power Interest Grid
  • Stakeholder Salience Model

Power Interest Grid

This chart maps the authority or power of stakeholders against how much interest they take in the project. Based on the X and Y axis, we can decide what type of information a stakeholder would need and what type of engagement we would want to have with them.

Stakeholder Salience Model

This chart maps the types of stakeholders against power, legitimacy, and urgency.

Explainer video- https://zeta.ap.panopto.com/Panopto/Pages/Embed.aspx?id=de7b601b-1906-45e3-8481-ad21007ef87f

Key Learnings from Sprint and Scrum Planning

We know how important sprint and scrum planning is in any organization. What we don’t talk about is its implementation. Poorly planned and executed sprints could hamper progress, rather than streamline it. For example, having someone who is not part of the team plan a sprint could be a recipe for disaster.

Explainer video — https://zeta.ap.panopto.com/Panopto/Pages/Embed.aspx?id=d114e9e9-c3b1-4f95-8c22-ad21007f4b01

Below are some key learnings from sprint planning:

  • To change the system you have to be part of the system.
  • Reflect → Tune → Adject. Revisit your scrum practices often and make changes as needed. If something works well, figure out why it has worked well and how it can help other aspects of the scrum. If something is not working well, fix it.
  • The key to early and continuous delivery is communication and transparency.
  • Your scrum or sprints do not have to align with those of your clients. Do not force this.
  • Do not divide scrum teams into smaller teams.
  • Achieving a velocity target should not be the objective of a scrum. It should be to find new/different ways to achieve it.
  • Rituals do not make you agile.

Defect Management

How well you manage defects defines your last-mile delivery success. Listed below are some ways you can track defects.

  • JIRA Dashboards help you list and track defects.
  • Triages help you understand the priority of defects.

Explainer video — https://zeta.ap.panopto.com/Panopto/Pages/Embed.aspx?id=a0b1ac42-a7df-40e3-8b14-ad21007f4f9d

Conclusion

Program management is a critical role in any organization that requires you to interact with various stakeholders across multiple disciplines. Your ability to obtain the required information from these stakeholders while keeping them informed about progress is essential to the smooth functioning of a team and ensuring deadlines are met.

Speaker: Rohit Kamat

Blogged by: Hemchander Gunashekar

Wallet 2.0

Wallets are the most basic thing a person has had since early times. We needed something to put our money in clusters for our daily utilities, one cannot access an ATM for every single time of purchase. Now, a digital version of it came called e-wallet, which can provide aggregation capability for payment-related instruments for ease of use, which allow customers to pay for purchases digitally.

Wallet 1.0

Let’s discuss a basic model of Wallet 1.0. It provides an aggregation over a bunch of supercards and accounts. Supercards are the debit or credit cards managed by Zeta and can be either physical or virtual.

The wallet is also associated with accounts (or cloud cards), some of the examples of cloud cards include meal cards, cash cards, and communication cards, which you can see in the Zeta app. Debit or credit occurs at these accounting instruments, and these accounts are hosted in the Aura** domain for Wallet 1.0.

Coming to the payment plan generation, when any transaction is initiated by the customer using their Super card, first, the wallet linked with the super card is identified. Now, let’s say there are three accounts that are linked with the wallet: meal, communication, and cash card. Consider a case where we initiate a transaction for 100 bucks using a Supercard at Mcdonald’s. First, the wallet linked to the Supercard is identified. For that wallet, let’s say, there are three accounts linked: — corresponding to the cash, communication, and meal card. To identify which of a user’s account to debit, by how much and in which order, wallet generates a payment plan. On the basis of various parameters in the incoming payment request and the applicable account selection rules, eligible accounts are filtered. Some reasons why an account can be filtered out include: exceeding the max daily or monthly spend limit, exceeding the max transaction limit or payment may only allow a particular currency account to be used. In our example, let’s say due to certain restrictions, the communications type account was filtered out.

After this filtering, the accounts are ordered on the basis of the suggestion strategy. This ordering decides how one or more accounts are to be debited for the payment. So, while purchasing food items, it might be preferable to debit the meal card first and then debit the remaining balance (if required) from the cash card. So, for the transaction at Mcdonald’s, in case our meal card does not contain sufficient funds, the generated plan may suggest something like debit $90 from the meal account and $10 from the cash account.

Once this is done, a final ordered sequence of accounts, along with the proposed debit amount, is presented to the payment engine, which further acts based on this plan.

Why Wallet 2.0?

  • Wallet is an aggregator of supercards only rather than a generic pool of payment instruments
  • Violating separation of concerns
  • Wallet shouldn’t be responsible for aggregating payment instruments
  • Wallet is performing payment instrument authentication
  • Wallet’s payment accounts are only from Aura** domain
  • Wallet is tightly coupled with #Zetauser
  • No wallet product specification exists

Wallet 2.0 Design

A wallet product acts as an umbrella or specification for a wallet which is a materialization of the wallet product and inherits its properties from the product.

The definition of wallet product provides the capability to manage wallets — their selection rules, suggestion strategies, and other flags and attributes at a top-level.

Similarly, a payment account product governs the definition/schema for payment accounts. Each payment account product is linked with a payment account provider which acts as an interface to the actual account provider (like Google Pay/ PayTm).

We can perform operations such as to get account balance for example through this interface.

Coming to the payment side of the picture, let’s say the user in BigBasket is saving a VISA card, a corresponding payment account would be created for it. Then, he/she adds another MasterCard, corresponding payment account would be created and added. The Payment Account Product here could be modeled as payment types such as VISA and MasterCard as the semantics of authentication and transaction workflow depends on these. Now, for a VISA payment account, when say get balance is needed, it would reach out to the Payment Account Provider which could be Aura if we are maintaining the account here at Zeta or Shadowcard which would reach out to the VISA network to get the balance (detailed example).

Now, having all things together, when BigBasket asks for a payment suggestion for a user, the configured selection rules and suggestion strategies are applied for that wallet’s payment accounts while taking the help of Payment Account Providers in order to generate the payment plan which can be passed back to BigBasket.

Thank you

Notes:

** Aura cluster is the core digital accounting system also termed as the heart of the financial systems of Zeta. Any financial transaction to be effected would be accounted for and recorded in Aura. It helps to keep track and manage the financial information of Zeta’s Business Clients. Aura has the ultimate authority in approving/rejecting a financial transaction.

Speakers — Siddharth Sharma & Praveen K L

Edited by Phani Marupaka

Tuning Cipher to 1M TPS for ease of scaling of online authentications (Part 2)

We went through the overall idea behind the demo in Part 1. Part 2 will inform us about how we achieved the feat, highlighting the technologies that we used along the way. Heads up, it is going to be engineering focussed. 

Quick recap: Cipher demonstrated the successful handling of a chart-busting 1 Million Transactions Per Second of online authentication requests utilizing a prudent cost of $200/hour of AWS cloud infrastructure. The simulated authentications are 240x higher than the authentication output put together by all of the big players in India during online payment transactions. 

Things we did engineered a nutshell

  1. Optimized systems for a minimum unit for processing a desirable number of online authentications 
  1. Built the infrastructure 
  2. Setup the simulation environment and node distribution mechanisms
  3. Processed successful authentications for the transactions
  4. Scaled the minimum unit for the desired load capacity of 1 Million TPS

Infrastructure used

  • Nginx – A web server used as an ingress controller (to accept HTTP requests).
  • EKS (Elastic Kubernetes Service) – An AWS managed service to run Kubernetes.
  • Prometheus – An open-source monitoring system with a dimensional data model, flexible query language, efficient time-series database, and modern alerting approach.
  • Grafana – An open-source analytics & monitoring solution for databases and microservices.
  • Metrics server – A cluster-wide aggregator of resource usage data for auto-scaling horizontal pods. 
  • Microservices
    • Edith
    • Cerberus
    • MasterCard Cipher
    • We’ve presented JVM stats, HTTP requests & connection stats from each of these apps via metrics endpoint and made this a scrape target in Prometheus.
    • We’ve also enabled the JMX port on each of these microservices to monitor them in real-time.

Node distribution

‘Node’ is a container of a single server where multiple microservices can be run and monitored to successfully process a definite amount of authentication requests. We configured each of these nodes at 64GB RAM & 16 GB processor and deployed them on Kubernetes. 

Each node had several pods running within it, which hosted the microservices required for authenticating online transactions. Amongst the microservices, there are three notable ones. 

  1. Mastercard Connector – To handle the ACS protocol specific nuances and serve as a connector for the Mastercard card scheme
  2. Edith – To orchestrate the intricate authentication plans
  3. Cerberus – To serve as the Identity Provider and the core authentication engine

Simulation environment

Although processing 1 Million TPS was the objective, we also had to generate that much amount of transaction load for doing that. So, we used Gatling as the load generator. We let the Gatling Master spread across four zones, spread across in Mumbai and Singapore, for running 50 tasks of individual test scripts to authenticate bank transactions coming in from card networks. 

The Gatling injector simulated the interactions that a card network usually has with its ACS provider while authenticating online transactions. It also reproduced the cardholder interactions required for authentication. The system-level simulations are – 

  1. VEReqs on behalf of the card network
  2. PAReqs on behalf of the card network
  3. Challenge-response submissions as performed by cardholders to prove their authenticity

The simulator routed these requests to auto-scalable clusters of Cipher microservices, which were configured with dummy BINs. 

Authentication flow

These are the steps for a successful authentication flow. 

  1. The simulator verifies (with VEReq) if the card under authentication is enrolled with Cipher. If it is, the simulator initiates the pair authentication request (PAReq) and gets the authentication URL from Cipher.
  2. Gatling simulates the Swipe to Pay (S2P) interaction, generating the necessary credentials required for completing the authentication. 
  3. Gatling submits the S2P challenge and receives the authentication response.
  4. It redirects the control back to the card network module that’s simulated.

Minimum unit

A minimum unit consisted of a specific number of instances of every microservice. The function of this minimum unit was to handle a definite number of transactions. 

To come up with this unit, we tested each microservice individually to get a measure of the number of authentication requests it could serve in a given time period with specific resources allocated to it. The minimum unit that we came up with successfully handled 20K authentication requests.

1 minimum unit = 3 units of Edith, 2 units of Mastercard Connector, 2 units of Cerberus

Here is the resource utilization data when Cipher was comfortably serving 20K authentications. 

Scaling of the minimum unit

Now that we had a stable unit that could serve 20K TPS comfortably, we scaled it to accommodate additional authentication requests. We used this unit as the base reference for serving transaction requests from one issuer. 

With this logic in place, we scaled the minimum unit up to 50 such units, handling 20K TPS each, to serve 50 issuers, leading up to 1 Million TPS. The 4 Gatling servers simulated the load for 500 distinct BINs spread across 50 issuers, each having 1,00,000 cards. The entire setup, consisting of 4 Kubernetes clusters, spanned across 2 AWS regions with two availability zones, respectively.

With AWS, the nodes aren’t scalable within minutes. But since we had to attain that, we warmed up the nodes by scaling up the pods. We used 350 of the pods within 100 nodes to handle the heavy load of 1 Million authentication requests. When the load was less, we scaled down the nodes by cutting down the pods based on the resource utilization factors using Horizontal Pod Autoscaler. 

Footnote:

  1. The authentication’s write I/O was done asynchronously considering the purpose of the live demonstration.
  2. Three requests effectively make one authentication.

If you’d like to check out the test data that we used for simulating authentications, it is available for reference here.

Conclusion

It took our team of 8 people only 9 days to successfully deliver the desired results. We’re glad to have made some key decisions during this process that allowed us to quickly achieve what we wanted – like choosing AWS, Kubernetes, and Gatling Enterprise. We hope to work on several such innovative initiatives that redefine the future of payments in our country and beyond. Thanks for reading and hope you found this helpful!

Credits

Developers who made this feat possible – Mrinal Trivedi, Amit Raj, Ramki, Dipit Grover, Amit G, Shubham Jha, Vivekanand G, Mohd. Tanveer, Shaik Idris

Author – Preethi Shreeya, Phani Marupaka

Achieving 1 Million TPS with Zeta’s Cipher – A new benchmark in the Payments Industry (Part 1)

Introduction

We marked 22nd Jan 2020 as a historical day of significance for Zeta and the future of payments! We were able to successfully showcase 1 Million Transactions per Second (1M TPS), illustrating the scalability and elasticity with which banks can handle online transactions.

Background

In December 2019, we launched Cipher, a cloud-based Authentication as a Service solution that provides an easy way for the issuing banks to participate in online transactions, adhering to the 3D-Secure protocol as laid down by EMV Co. 

With Cipher, the issuing banks can provide modern, cutting-edge, and secure card authentication services to their cardholders. To unveil the robustness of the service, i.e., the amount of authentication load that can be handled at the highest success rate (99.9%), we planned to simulate 1 million online authentications. 

To put things in context, currently, the maximum online transaction volume across India is approximately 4160 per second, considering the transactions processed across all of the electronic payment channels like UPI, Visa, Mastercard, Rupay, NEFT, and IMPS. You can find the reference calculation here

When we say Cipher can handle 1 million authentication requests, it is 240 times more than the number of payments being processed in India by all of the big players put together.  

This estimate gives us an idea of the incomparable scale at which Cipher can handle authentications, relieving issuers and merchants from the persistent problems of scalability and reliability during flash sales and other peak load scenarios. Besides, We have built Cipher in such a way that it is elastic, so no resource gets wasted. When there is an increase in the request volume, the systems self-provision the resources from cloud providers and relinquish the same when the request volume decreases.

Importance

Why is this important to us? The first reason is that we want to enable issuers to authenticate as many transactions as possible, with the highest success rate and reduced revenue loss. Naturally, issuers will be able to handle any amount of eCommerce sales at all scales that merchants can imagine. When Cipher can do all of the heavy liftings when it comes to online authentications, issuers have the comfort of focusing on their core bank offerings rather than worrying about server overload and authentication failures.

The second reason is more personal. We want to convey that ‘scale’ is a solved problem with Zeta. As more and more commerce progresses online, it is natural to expect that the payment authentication solutions will run at this internet-scale. 

We believe that all of the consumer-facing banking and payment services should be as scalable as Google Search at peak loads. We want to exemplify this with our Cipher demonstration. We want to assure the banks that they don’t have to worry about scale anymore.  

Demo Details

In the dashboard, the number on the left indicates authentication requests being handled by Cipher in reqps. The middle number shows the maximum number of requests at the current computing capacity in the selected time window (5 minutes). Towards the right, we have the number of pods (systems) of Cipher that are up and running to handle the required load. In the central region, we have the success rate of the authentication requests.

As one can notice, during the video

  1. The number of requests handled per second increases from 18 reqps to 1 Million reqps in about 3 minutes.
  2. The success rate of handling the authentication requests stays constant at 100% throughout the demo.
  3. When the load is lightened, the number of computing pods required for handling 1 Million TPS shrinks back to a smaller number showcasing how Cipher auto-scales.

As Cipher’s systems are horizontally scalable, they are limited by the computing and storage units built into them. So, by any means, this high load of 1 million TPS does not indicate the maximum ‘capacity’ of the systems. Cipher can extend to the internet-scale and shrink to a single node footprint, based on the needs. We think we demonstrated that alright. 🙂 

Scope Inclusions

In the above demonstration, we have simulated:

  • Mastercard Card Network to demonstrate the authentication initiation requests (as per the 3DS authentication protocol) that get dispatched to the Cipher system.
  • Swipe2pay authentications – to simulate user engagement while fulfilling the authenticating challenges.

Scope Exclusions

  • User interactions for authentications to do away with the additional time
  • Risk and fraud checks

Key metrics

  1. Resources used (Cipher)
    1. Units: 350 server instances (all microservices incl.); (Kube cluster running on 200 m5.xlarge EC2 instances)
    2. Memory: 8 * 360 GBs
    3. CPU: 4 * 360 vCPUs
    4. Bandwidth (if available)
  2. Resources used (Load Generator)
    1. Units: 50 * c5.4xlarge EC2 instances
    2. Memory: 32 * 50 GBs
    3. CPU: 16 * 50 vCPU
    4. Bandwidth (if available)
  3. Verification mechanism used
    1. JSON web signature.
  4. Test Data Structure
    1. Banks: 50
    2. Card BINs: 500
    3. Cards per BIN: 10000
  5. Average response time per request: 100ms
  6. Hops per each request within Cipher cluster: 
    1. VEReq: 1 hop
    2. PAReq: 2 hops
    3. Submit Swipe2Pay challenge: 3hops 

Conclusion

We honestly believe that the 1 million TPS demonstration has set a new benchmark for payment authentication solutions. We are also delighted that we were able to pull this off in a short period of 9 days. Although we didn’t have enough time to explore a lot of other frameworks for additional optimizations, we think we did great in the time we had! 

It’s also worth mentioning that our systems are capable of scaling more than 1 Million TPS when necessary. We hope that cloud technologies get widely adopted by issuers in the Payments Industry to be able to improve their offerings for their customers significantly.

We have a part two which delves into the core engineering aspects of the demo for the relevant audience. Thank you!

Part 2 of the blog

Credits

Developers who made this feat possible – Mrinal Trivedi, Amit Raj, Ramki, Dipit Grover, Amit G, Shubham Jha, Vivekanand G, Mohd. Tanveer, Shaik Idris

Author – Preethi Shreeya, Phani Marupaka