Kafka + ClickHouse High-Performance Analytics in Practice — Performance Challenges and Solutions Amid Data Growth | RaytonX | RaytonX

As our business continued to grow, user scale and data volumes expanded exponentially. The original monolithic synchronous write approach began to expose significant performance bottlenecks. This post shares how we evolved from a single MongoDB architecture to a Kafka + ClickHouse analytics platform in production, and the lessons we learned along the way.

1. Business Background and Context

Our system is responsible for recording and analyzing user behavior and business data. As data volumes increased, simultaneous writes and analytical queries placed growing pressure on the same MongoDB cluster, resulting in issues such as:

Longer API response times that degraded user experience.
Analytical queries blocking online writes, causing occasional request timeouts.
Increasing complexity when scaling storage to keep up with growth.

To address these challenges, we needed an architecture that could support high-concurrency writes while also powering complex analytics workloads.

2. Overview of the Original Architecture

In the early stage, our approach was quite simple:

All data was synchronously written to MongoDB.
The admin dashboard queried MongoDB directly using aggregation pipelines.
Both write and query workloads shared the same database.

This solution was efficient and cost-effective at first, enabling rapid development. However, as daily active users and data volumes grew, the architecture reached its limits.

3. Manifestations of Performance Bottlenecks

The main bottlenecks were:

Increased Write Latency
- Under high-concurrency writes, pressure on MongoDB nodes grew significantly.
- Index updates coupled with writes led to steadily rising response times.
Declining Query Throughput
- Aggregation queries consumed large amounts of CPU and I/O, blocking writes.
- Query performance fluctuated unpredictably during peak traffic.
Limited Scalability
- As data volumes increased, sharding and scaling MongoDB became more complex.
- Operational costs kept rising.

4. Goals of the Refactoring

Given these challenges, we set out clear goals:

Reduce API latency by decoupling writes and analytics.
Improve analytical capabilities by adopting a storage engine optimized for large-scale aggregation.
Ensure data consistency across multiple storage systems.

5. New Architecture Overview

To address the issues above, we designed a new architecture:

Kafka Integration
- Backend APIs switched from synchronous writes to asynchronous writes to Kafka.
- This decouples production and consumption and helps smooth traffic spikes.
Dual Writes to MongoDB and ClickHouse
- Kafka consumers ingest data from the queue.
- Data is persisted in both MongoDB (business storage) and ClickHouse (analytics storage).
Dedicated Analytics API
- We built a standalone analytics API service using NestJS.
- This service exclusively powers high-performance queries for the dashboard.

With this design, writes and queries are fully decoupled, significantly improving scalability and stability.

6. Principles Behind Technology Selection

Throughout the architecture design process, we followed several core principles:

Decoupling: Kafka ensures production and consumption are fully separated.
Scalability: Both Kafka and ClickHouse are horizontally scalable.
Low Latency: Asynchronous writes dramatically reduce API response time.
Consistency: Idempotent consumers and scheduled verification processes help ensure data integrity.

7. Conclusion

Evolving from a monolithic MongoDB architecture to a distributed Kafka + ClickHouse platform was a key milestone in addressing data growth and performance bottlenecks. By decoupling writes and queries, we not only improved system performance but also laid a solid foundation for future scalability and maintainability.