High-Performance Analytics with ClickHouse – How We Sync MongoDB Data to ClickHouse for Real-Time Analysis | RaytonX | RaytonX

In our project, we needed to sync business data from MongoDB into ClickHouse to support multidimensional, low-latency queries and analytics. Given the client’s requirements for stability and ease of maintenance in the message queue layer, we ultimately chose RabbitMQ to build the data pipeline, along with in-house tools to support both historical and real-time data sync.

1. Data Pipeline Goals

Our primary objectives included:

Supporting historical data synchronization from MongoDB;
Enabling real-time ingestion and processing of new data;
Ensuring high availability, traceability, and recoverability;
Supporting dimension enrichment for multidimensional analytics.

Since the legacy business system relied heavily on MongoDB and wasn’t fully event-driven, we designed a hybrid sync approach combining partial historical sync with real-time updates.

2. Data Sync Pipeline Design

1. Data Ingestion Strategy

Historical Data Sync

We pulled historical data from MongoDB using the createdAt field in segmented batches;
The scripts supported paginated processing and cursor-based reading for scalability.

Change Data Capture (CDC)

For modules that support async events, we published RabbitMQ messages from the business system;
For immutable or hard-to-refactor modules, we used updatedAt for periodic incremental sync;
Each collection had a corresponding consumer service subscribed to a dedicated queue.

2. Data Writing Strategy

Defined a field mapping config between MongoDB and ClickHouse for each business table;
Used JSON configs to handle type conversions (e.g., dates, booleans);
Applied a CacheService for dimension enrichment, such as department or business unit names;
Leveraged ClickHouse’s batch INSERT INTO for optimal write performance;
Failed records were routed to a dead-letter queue for review.

3. ClickHouse Table Design

1. Raw Fact Tables (`fact_xx`)

Auto-generated DDLs via scripts based on field mappings;
Most tables used ReplacingMergeTree with updated_at for deduplication;
Tables requiring historical versions used MergeTree.

2. Materialized Aggregation Tables (`mv_xx`)

Aggregation states (e.g., argMaxState) were used for efficient dimension-aware materialization.

3. Final Views for Business Queries (`latest_xx`)

Views merged aggregation states (e.g., argMaxMerge) to ensure accurate, real-time results.

Conclusion

Without introducing Kafka, we built a lightweight real-time analytics pipeline using MongoDB, RabbitMQ, and ClickHouse. This setup balances system stability, flexible data control, and high-performance analytics — meeting both the agility and reliability demands of our data infrastructure.