Optimizing Event parsing data pipeline

From my experience working at Simpl.

Also, got my blog published at Simpl’s engineering blog(s); For ref: Optimizing Event parsing data pipeline

The Data Platform team at Simpl is committed to delivering high-quality data with optimal latency to support critical business operations. Our ecosystem processes multiple batch and near real-time data pipelines that serve as the foundation for downstream teams including Analytics, Data Science, Fraud Detection, Risk Management and Business teams.

In this article, we will discuss about a critical data pipeline that serves as the backbone of our data flow— its past, present & future.

At Simpl, we have built our infrastructure around an event-driven architecture, which provides us with the necessary flexibility and scalability. To give a flavour around what happens behind the scenes, consider a real-world scenario:

A customer opens and browses through merchant mobile application (like Zomato, Swiggy, Myntra, etc.), and proceeds to place an order. During the checkout process, they can select Simpl’s “pay later” or “pay-in-3” option to complete their purchase.

Throughout this user’s purchase journey, a lot of action happens in the background to manage the Simpl payment experience.

Pre-Transaction Phase:

Transaction Initiation:

Transaction Processing:

Post-Transaction Activities:

Each of these interactions generates specific event data that helps us monitor the entire payment ecosystem and ensure a seamless user experience.

{
  "event_id": "017309f7-c126-44f8-9333-1896e5ecd2df",
  "event_timestamp": "2025-06-01T07:14:13.185474522Z",
  "event_name": "UserTransactionCompletedEvent",
  "user": {
    "id": "17ztr78q-9ad5-753f-aa98-f1e6g5543c98",
    "phone_number": "9999999999"
  },
  "transaction_details": {
    "total_amount": 56000,
    "merchant": "Myntra",
    "payment_status": "successful",
    "failure_reason": null
  },
  "payment_mode": "simpl-pay-later",
  "version": "1.0.0"
}

The event data is later relayed to downstream systems and pipelines that perform ETL operations and enrich the silver layer tables - maintained by Data Platform team which we will discuss next.

High Level overview & components of the pipeline

kafka-kafdrop

airflow dag

databricks-workflow

At a 50,000ft view, it may seem like a simple event parsing pipeline, but the same pipeline used to run for over 8Hrs in the past to enrich the silver layer. The team worked upon optimising the entire pipeline and brought down the runtime to an hour.

The Past: Legacy Pipeline Architecture

old-pipeline

Pain Points

The complete pipeline required approximately 8 to 9 hours to process a full day’s worth of data. As our data volume increased over time, the runtime also increased proportionally, creating significant impacts on downstream systems that depended on timely data availability.

The Present: Optimized Pipeline Architecture

current pipeline

We addressed the identified pain points through a pipeline redesign over the time.

kafka-consumer-infra

What’s Next: Future Optimization Roadmap

Key Takeaways

This journey highlights several key principles for large-scale data pipeline optimization:

“Excellence is never an accident. It is always the result of high intention, sincere effort, and intelligent execution”