In a gradual Shopify-to-Medusa migration, the data synchronization layer is the foundation of the entire first phase. Before replacing any business module, products, orders, and customers must be continuously synchronized from Shopify into your own infrastructure.
The entry point of this synchronization pipeline is the webhook.
For any Shopify integration, webhooks are the starting point of the entire data flow. Whether a product is updated, an order is created, or inventory changes, Shopify actively pushes an HTTP request to your service. Although it appears to be just another API endpoint, building a production-ready synchronization system requires much more than exposing a simple /webhook route.
This article walks through the complete architecture of a reliable webhook pipeline, covering request verification, event persistence, asynchronous processing, idempotency, retry mechanisms, dead-letter queues, state tracking, and event replay.
Challenges
A production-grade webhook pipeline must answer several important questions:
- How do we verify that a request actually comes from Shopify?
- How can we respond within Shopify's timeout requirements?
- What happens if Shopify delivers the same event multiple times?
- How should failed workers be retried automatically?
- What should happen after all retry attempts are exhausted?
- How can we track the processing status of every event?
- How can production failures be recovered precisely without replaying everything?
Overall Architecture
The core principle is simple:
Webhook reception and business processing must be separated.
The webhook receiver is responsible for accepting events reliably, while asynchronous workers are responsible for processing them. Event persistence and message queues bridge these two responsibilities.
Shopify expects a webhook endpoint to return an HTTP 200 response within 5 seconds. Otherwise, the delivery is considered failed and will be retried automatically, with up to 19 attempts over a 48-hour period.
If business logic such as database queries, field mapping, or downstream synchronization is executed directly inside the webhook endpoint, response latency becomes unpredictable and downstream failures immediately affect Shopify's retry mechanism.
Instead, the receiver performs only one task: accept the event safely, enqueue it, and return immediately.
Shopify
│
▼
Webhook API
│
HMAC Verification + Idempotency Check
│
Persist Raw Event (pending)
│
Enqueue BullMQ Job
│
Return HTTP 200 Immediately
│
▼
Async Worker
│
┌──┴─────────────────────┐
Success Failure
│ │
done Exponential Backoff Retry
│
Retry Limit Exceeded
│
Dead Letter Queue
│
Status → failed
Trigger Alert
Step 1: Verify Request Authenticity
Every Shopify webhook includes an HMAC signature in the X-Shopify-Hmac-SHA256 header.
After receiving a request, the server recalculates the signature using the configured shared secret and compares it with the header value. If they do not match, the request is rejected immediately, preventing forged requests from entering the system.
One common pitfall is that HMAC verification must be calculated from the raw request body. If the framework parses the JSON payload before calculating the signature, the generated digest will differ from Shopify's original signature, causing legitimate requests to fail verification.
In NestJS, this typically requires preserving the raw request body instead of relying solely on the default JSON parser.
Step 2: Perform Idempotency Checks
After HMAC verification succeeds, the next step is idempotency validation.
Shopify webhooks follow an at-least-once delivery model rather than exactly-once delivery. Network interruptions, temporary service outages, or Shopify's own retry strategy may cause the same event to be delivered multiple times.
Without idempotency protection, the same order could be processed twice or the same product could be updated repeatedly, resulting in duplicate records or inconsistent state.
Each webhook contains a unique identifier in the X-Shopify-Webhook-Id header. Before persisting the event, the receiver checks whether this ID already exists in the database.
- If it exists, the event has already been received and the service returns HTTP 200 immediately.
- If it does not exist, processing continues.
Performing idempotency checks at the receiver layer is more efficient than doing so inside workers. Duplicate events are filtered before entering the queue, avoiding unnecessary queue storage and worker execution.
Step 3: Persist the Raw Event
After passing idempotency validation, the entire webhook event is persisted to the database.
Typical fields include:
- Event ID (Shopify Webhook ID)
- Event Topic (
products/update,orders/create, etc.) - Shop Domain
- Raw Payload (stored without modification)
- Processing Status (
pending) - Received Timestamp
- Retry Count
- Error Message
The raw payload should always be stored completely.
The database is not merely acting as temporary storage—it serves as a durable event log.
Failed events can be replayed, historical events can be inspected during troubleshooting, and newly introduced business logic can process historical events without requiring Shopify to resend them.
The queue is responsible for scheduling, while the database is responsible for durability. Even if queued jobs are lost, events can always be recreated from the persisted event log.
Step 4: Enqueue and Return Immediately
Once persistence succeeds, a BullMQ job is created containing only the event ID, and the webhook endpoint immediately returns HTTP 200.
The receiver is designed to complete its work within 100 milliseconds, providing a significant safety margin below Shopify's timeout requirement.
Only the event ID is stored in the queue rather than the full payload. When processing begins, the worker loads the original payload from the database. This keeps queue messages lightweight and ensures all processing is based on the same persisted event record.
Separate queues are maintained for products, orders, and customers, each with dedicated workers.
This isolation provides several advantages:
- Webhook response time remains independent of downstream services.
- Workers can scale horizontally without affecting the receiver.
- Traffic spikes are naturally absorbed by the queue.
- Failures in one business domain do not block processing in others.
Step 5: Process Events Asynchronously
Workers consume events from the queue and execute the actual business logic.
Typical processing includes:
- Loading the persisted payload
- Transforming and mapping fields
- Writing data into the target database
- Updating processing status
A typical lifecycle looks like this:
pending
│
▼
processing
│
▼
done
Because workers are completely decoupled from the receiver, they can be deployed and scaled independently.
If order volume increases dramatically, only order workers need additional capacity. Likewise, fixing or redeploying a worker does not interrupt webhook reception.
Step 6: Retry Strategy and Dead Letter Queue
Worker failures are inevitable in production environments.
Temporary database outages, third-party API timeouts, unexpected edge cases, and software bugs can all cause processing failures.
Exponential Backoff Retry
Transient failures should be retried automatically using exponential backoff.
Each retry waits progressively longer than the previous one, preventing continuous pressure on downstream systems and giving dependent services time to recover.
However, not every failure should be retried.
Network timeouts and temporary infrastructure failures are often recoverable, while validation errors, malformed payloads, or business rule violations will never succeed regardless of how many retries are attempted.
Permanent failures should be sent directly to the dead-letter queue.
Dead Letter Queue
The dead-letter queue (DLQ) is the final safety net rather than a dumping ground.
When retry attempts exceed the configured limit, the event is moved into the DLQ and an alert is triggered.
Engineers can inspect the original payload and failure history, fix the underlying issue, and replay the event back into the normal processing pipeline.
A DLQ containing messages indicates that manual investigation is required. Alerting ensures these failures never become silent data loss.
Step 7: State Management and Event Replay
Every event is managed through an explicit state machine.
pending
│
▼
processing
├────────────► done
│
▼
retrying
│
├────────────► processing
│
▼
failed (DLQ)
Workers update event status throughout the processing lifecycle, making the entire pipeline observable.
At any point, operators can identify:
- Events currently being processed
- Successfully completed events
- Failed events
- Retry history
- Error messages
- Affected business data
Event replay is built on top of this state model.
When a production issue is fixed—for example, after correcting a processing bug—failed events can be selected and re-enqueued without affecting events that have already completed successfully.
This enables precise recovery instead of relying on risky full-scale reprocessing.
Observability is not a luxury for production systems. The ability to identify and recover from failures within minutes fundamentally changes operational efficiency.
Conclusion
The seven stages of this pipeline all address the same fundamental goal:
Ensure every Shopify event is received safely, processed reliably, and recoverable when failures occur.
The receiver acknowledges events quickly without performing heavy business logic. Idempotency filtering prevents duplicate processing at the entry point. Raw events are persisted as a durable event log. Message queues decouple reception from processing. Retry mechanisms and dead-letter queues prevent data loss. State management and event replay make the entire pipeline observable and recoverable.
A webhook is more than an HTTP endpoint—it is an event-driven data pipeline. Only by designing reliability into every stage of reception, persistence, scheduling, and processing can a gradual Shopify-to-Medusa migration operate safely and consistently in production.