In our project, we needed to sync business data from MongoDB into ClickHouse to support multidimensional, low-latency queries and analytics. Given the client’s requirements for stability and ease of maintenance in the message queue layer, we ultimately chose RabbitMQ to build the data pipeline, along with in-house tools to support both historical and real-time data sync.
1. Data Pipeline Goals
Our primary objectives included:
- Supporting historical data synchronization from MongoDB;
- Enabling real-time ingestion and processing of new data;
- Ensuring high availability, traceability, and recoverability;
- Supporting dimension enrichment for multidimensional analytics.
Since the legacy business system relied heavily on MongoDB and wasn’t fully event-driven, we designed a hybrid sync approach combining partial historical sync with real-time updates.
2. Data Sync Pipeline Design
1. Data Ingestion Strategy
Historical Data Sync
- We pulled historical data from MongoDB using the
createdAtfield in segmented batches; - The scripts supported paginated processing and cursor-based reading for scalability.
Change Data Capture (CDC)
- For modules that support async events, we published RabbitMQ messages from the business system;
- For immutable or hard-to-refactor modules, we used
updatedAtfor periodic incremental sync; - Each collection had a corresponding consumer service subscribed to a dedicated queue.
2. Data Writing Strategy
- Defined a field mapping config between MongoDB and ClickHouse for each business table;
- Used JSON configs to handle type conversions (e.g., dates, booleans);
- Applied a
CacheServicefor dimension enrichment, such as department or business unit names; - Leveraged ClickHouse’s batch
INSERT INTOfor optimal write performance; - Failed records were routed to a dead-letter queue for review.
3. ClickHouse Table Design
1. Raw Fact Tables (fact_xx)
- Auto-generated DDLs via scripts based on field mappings;
- Most tables used
ReplacingMergeTreewithupdated_atfor deduplication; - Tables requiring historical versions used
MergeTree.
2. Materialized Aggregation Tables (mv_xx)
- Aggregation states (e.g.,
argMaxState) were used for efficient dimension-aware materialization.
3. Final Views for Business Queries (latest_xx)
- Views merged aggregation states (e.g.,
argMaxMerge) to ensure accurate, real-time results.
Conclusion
Without introducing Kafka, we built a lightweight real-time analytics pipeline using MongoDB, RabbitMQ, and ClickHouse. This setup balances system stability, flexible data control, and high-performance analytics — meeting both the agility and reliability demands of our data infrastructure.