How to Design an Event-Driven Analytics Pipeline — A System Design Deep Dive
Interview prep series: real problems, real trade-offs, no hand-waving.
The Problem Worth Solving
Here's a scenario straight from a fintech startup I worked at.
We had field agents responsible for onboarding customers, managing investments, and handling deposits and withdrawals. Management wanted to shift from a fixed salary model to fixed + commission — where commission depended on each agent's monthly performance.
The immediate instinct was: just write a SQL query. And that's exactly what we did, initially. But it quickly fell apart:
The query joined across 7+ tables representing different workflow states
It had to be re-run every time someone asked "how is agent X doing?"
Any change to the pay structure meant rewriting the query
It didn't scale — every CXO-level ad hoc report became a fire drill
The underlying issue wasn't the SQL. It was that the system was storing events but not state — and retrieving current state required replaying all those events manually every single time.
This is the problem the design solves.
Defining the Requirements
Before jumping to architecture, get the requirements sharp.
Functional:
Track every meaningful user action across all front-ends (web app, customer app, agent app)
Generate scheduled reports (daily customer activity, monthly agent performance)
Support on-demand report generation via API
Archive old data cost-effectively
Non-Functional:
The analytics pipeline must not slow down the core application
Reports can tolerate some delay — this is not a real-time dashboard
The system should be pluggable — drop it into any service without re-architecting that service
Data older than 12 months should move to cheaper storage automatically
Notice what's not in the requirements: low latency reads, strong consistency, sub-second aggregations. This is an eventually consistent, batch-oriented analytics system. That shapes every decision downstream.
Estimation
Quick back-of-envelope to understand the data shape:
Assume ~50 agent actions per agent per day, ~200 agents → ~10,000 events/day
Each event ~2KB (enriched JSON with context) → ~20MB/day at raw staging
Monthly reports cover ~300,000 events → manageable for Lambda batch processing
After 12 months: ~7GB of raw event data → S3 archival is cost-effective at this scale
At this scale, you don't need a distributed streaming platform. Kafka is present here, but for a different reason — we'll get to that.
The Architecture: Three-Part System
The design splits into three loosely coupled parts. This is the plug-and-play property: each part can fail or be replaced without breaking the others.
Part 1 — Event Capture and Staging
Every front-end (web, mobile, agent) fires events on meaningful user actions. The back-end receives these events, enriches them based on event type (adds agent ID, customer context, timestamp, session metadata), and writes them to the existing RDBMS.
Why RDBMS at this stage? Because each event is:
Written once — no updates, no reads until the cron picks it up
Structured — enrichment at write time means the schema is predictable
Transactional — you want writes to succeed or fail atomically
A cron job runs at a configured interval, reads the unprocessed events, copies them to AWS S3, and on success marks them for deletion. A second pass deletes the marked records. This two-step (mark → delete) pattern is intentional — if the S3 write fails mid-batch, nothing gets deleted. You get at-least-once delivery to S3, which is fine for analytics.
Data lifecycle: After 12 months in S3, events are moved to local NAS (or cold storage like S3 Glacier). This keeps active storage costs low without destroying the audit trail.
[Web App / Mobile / Agent App]
|
▼
[Back-end Service] → enriches event → [RDBMS Staging Table]
|
[Cron Job]
|
[AWS S3 Bucket]
|
(after 12 months) → [NAS / Cold Storage]
Key interview insight: The cron-based approach is deliberately simple. In a higher-volume system you'd replace it with a CDC (Change Data Capture) mechanism using Debezium or similar — but at this scale, simplicity wins.
Part 2 — S3 Processing and Report Generation
Once data lands in S3, report generation becomes a data transformation problem. Python scripts and AWS Lambda read from S3, apply aggregations, and write results (CSV, JSON) back to a designated S3 output directory.
Lambda works well here because:
Reports are batch jobs — they run on a schedule, not continuously
Individual report jobs are stateless and bounded in time
Lambda scales to run multiple reports in parallel without managing servers
Generated reports are stored in S3 and a notification (email with a download link) is sent to the requester. The report is never pushed directly — only a link. This decouples generation latency from delivery.
Part 3 — The Controller (On-Demand API + Scheduling)
This is the operability layer — and it's often missing from system design answers.
The controller is a small Java module that:
Schedules when to move data from staging to S3 (configurable intervals)
Schedules when to trigger which reports (daily, weekly, monthly)
Exposes a REST API for on-demand report generation
The on-demand API is fire-and-forget. The client calls POST /reports/generate with parameters, gets back a 202 Accepted immediately. The request is pushed to a Kafka topic. A consumer picks it up, processes the report, writes to S3, and emails the link.
Why Kafka here, not a simple task queue or direct Lambda invocation?
Durability: Kafka is log-based — messages are persisted and replayable if the consumer crashes mid-processing
Backpressure: If report demand spikes, Kafka buffers the requests; consumers process at their own pace
Replayability: If a report generation bug is found, you can replay the topic to regenerate affected reports
[Client] → POST /reports/generate
|
[REST Controller] → 202 Accepted
|
[Kafka Topic: report-requests]
|
[Report Consumer]
|
[S3 Output] → [Email Notification]
Trade-Off Discussion
These are the questions an interviewer is likely to probe:
Why not stream events directly to S3 instead of staging in RDBMS first?
You could use Kinesis Firehose or Kafka directly. But at this scale, staging in RDBMS first keeps the architecture within the team's existing operational expertise. The cron approach is easy to monitor, debug, and replay if something goes wrong. Streaming adds operational overhead that isn't justified until volume demands it.
Why not use a dedicated analytics DB (Redshift, BigQuery) instead of S3 + Lambda?
For ad hoc SQL queries and dashboards, a warehouse would be better. But the requirement here is structured reports on a schedule, not exploratory analytics. S3 + Lambda is cheaper, has no idle cost, and fits the access pattern exactly.
What happens if the cron fails halfway through a batch?
The mark-then-delete pattern handles this. Undeleted records will be picked up in the next cron run and re-written to S3 (idempotent if keyed properly). At-least-once delivery is acceptable for analytics.
What if report volume grows 100x?
Replace the cron with CDC + Kafka Streams for near-real-time event flow. Move report generation from Lambda to a Spark or Flink job on EMR for large aggregations. The S3-centric design makes this migration possible without changing the data model.
What This Design Gets Right
Three things worth remembering from this design:
Decoupling at every boundary. The front-ends don't know about S3. The cron doesn't know about report formats. The controller doesn't know about Lambda. Each part can be replaced independently.
Matching storage to access pattern. RDBMS for write-once structured events. S3 for batch reads and cost-effective archival. Kafka for durable async processing. No single store is doing all three jobs.
Operability is part of the design. The controller module — configurable schedules, on-demand API, email notifications — is what turns a pipeline into a product. Systems that work but can't be operated in production are incomplete designs.
Interview Cheat Sheet
| Question | Answer |
|---|---|
| Why stage in RDBMS before S3? | Write-once, structured, transactional — fits the access pattern; simpler than streaming at this scale |
| Why Kafka for on-demand reports? | Durable, replayable, handles backpressure — fire-and-forget API needs async processing guarantee |
| Why Lambda for report generation? | Stateless batch jobs, pay-per-use, no idle cost — matches the workload perfectly |
| How do you handle cron failure? | Mark-for-deletion before delete — unprocessed records are re-picked on next run |
| How do you scale this? | CDC + Kafka Streams replaces cron; Spark/Flink replaces Lambda at high volume |
| What's the data lifecycle? | Hot (RDBMS staging) → Warm (S3, 12 months) → Cold (NAS/Glacier) |
Closing Thought
The best system design answers don't start with "let's use Kafka and microservices." They start with the problem, reason through the constraints, and arrive at the simplest architecture that meets the requirements — with a clear articulation of what you'd change if those constraints shifted.
This pipeline is deliberately simple. And that's the point.
Found this useful? I write about distributed systems, fintech engineering, and senior-level interview prep at aliasgarmkantawala.hashnode.dev. Drop a reaction if you'd like more designs in this series.
