Why BullMQ
In the Node.js ecosystem, BullMQ has become one of the most widely adopted production-ready job queue solutions. Built on Redis, it provides robust queue management, Worker concurrency control, retry strategies, and lifecycle event hooks.
For our Shopify-to-Medusa integration, BullMQ offers several key advantages:
- Persistent jobs: Tasks are stored in Redis and survive service restarts.
- Flexible retry policies: Different retry strategies can be configured for different failure scenarios.
- Built-in failure handling: Failed jobs are tracked natively, making it easy to integrate with a Dead Letter Queue (DLQ).
- Lifecycle events: Every stage of a Worker's execution exposes events that can be connected to status management and monitoring systems.
Within NestJS, @nestjs/bullmq integrates seamlessly with the framework, allowing Queues and Workers (Processors) to be managed as injectable services with a clean project structure.
Queue Isolation: Separate Event Types, Separate Queues
A straightforward approach is to put every Shopify webhook into a single queue. While this works for small systems, it introduces a fundamental problem: failures in one event type can block the processing of all others.
For example, suppose product synchronization encounters an unexpected edge case and starts generating failed jobs that are continuously retried. If order and customer events share the same queue, they end up waiting behind those failed jobs, delaying unrelated business processes.
Instead, we create dedicated queues for each event type:
webhook-products → ProductsProcessor
webhook-orders → OrdersProcessor
webhook-customers → CustomersProcessor
Each queue operates independently, ensuring that failures in one data domain do not impact the others.
This design also enables independent configuration of concurrency, retry policies, and priorities based on the characteristics of each workload.
Choosing the right concurrency level.
Higher concurrency does not always mean better performance. The optimal value depends on the capacity of downstream systems.
If the downstream dependency is PostgreSQL, excessive concurrency may exhaust the database connection pool. If the Worker communicates with Medusa APIs, high concurrency may trigger rate limits.
A practical starting point is to estimate concurrency based on database connection limits and API rate limits, then fine-tune it through load testing.
Order processing usually involves more complex business logic and database operations, so a conservative concurrency setting is appropriate. Product synchronization is generally more independent and can safely run with higher concurrency. This level of workload-specific optimization would not be possible with a single shared queue.
Retry Strategy: Exponential Backoff Instead of Fixed Intervals
BullMQ automatically retries failed jobs. The retry strategy has a significant impact on how the system behaves during failures.
Why exponential backoff?
With fixed retry intervals, a downstream service outage causes all failed jobs to retry at the same frequency, continuously bombarding an already overloaded system.
Exponential backoff gradually increases the delay between retries, giving downstream services time to recover before receiving additional requests.
A typical configuration includes:
- Maximum retries: 3 to 5 attempts. Too few retries reduce the chance of recovering from transient failures, while too many hide systemic problems and keep jobs stuck in retry loops.
- Initial delay: A few seconds to several seconds, balancing responsiveness with recovery time.
- Backoff multiplier: Typically 2, doubling the waiting time after each failure.
Which errors should be retried?
This is one of the most overlooked aspects of retry design.
Transient failures should be retried:
- Database connection timeouts
- HTTP 503 responses from downstream APIs
- Temporary network interruptions
These issues are environmental and often resolve themselves after a short period.
Deterministic failures should not be retried:
- Payload parsing errors
- Missing required fields
- Business rule validation failures
Retrying these errors produces the same result every time while consuming unnecessary resources. They should be routed directly to the Dead Letter Queue for manual investigation.
In BullMQ, Workers can classify exception types during processing. Deterministic failures can throw dedicated exceptions and be routed to the DLQ through custom failure handlers instead of entering the retry cycle.
Dead Letter Queue: A Safety Net, Not a Trash Bin
Once a job exceeds its maximum retry count, BullMQ marks it as failed. On top of this mechanism, we implement a dedicated Dead Letter Queue workflow.
Routing failed messages
When a Worker determines that a job has permanently failed, a failed event handler records essential information into a dedicated DLQ table, including:
- Original event ID
- Failure reason
- Retry history
- Final failure timestamp
This table serves as the primary source for troubleshooting.
Alerting
Every new DLQ entry should trigger an alert immediately.
A silent DLQ is dangerous because failed messages can accumulate unnoticed.
Alerts can be delivered through Slack or email, providing information such as the event type and a summary of the failure reason.
Manual recovery
After investigating and fixing the root cause, engineers can inspect the failed message through an administrative interface or scripts and re-submit it to the appropriate processing queue.
The replayed job starts with a fresh retry count and follows the normal processing pipeline again.
State Management: Tracking Every Message
BullMQ maintains runtime job states such as waiting, active, completed, and failed. However, these states live inside Redis and may disappear after completed jobs are cleaned up, making them unsuitable for long-term auditing.
Instead, we treat the webhook_logs table as the single source of truth.
Rather than mirroring BullMQ's runtime states, this table records business-level processing history, preserving complete lifecycle information even after Redis jobs have been removed.
The state transitions are straightforward:
- Queued:
pending - Processing started:
processing - Successfully completed:
done - Exceeded retry limit:
failed
The processing state provides another valuable capability: detecting zombie jobs.
If a message remains in processing for an unusually long period (for example, over ten minutes), it may indicate that the Worker crashed or terminated unexpectedly before completing the job.
A scheduled task can periodically scan these records and recover abnormal jobs automatically.
BullMQ manages runtime execution, while the database records business processing history. Together, they provide complete visibility into every message throughout its lifecycle.
Message Replay: Precise Recovery Instead of Full Resynchronization
Replay capability is the final piece of a reliable asynchronous processing system.
When is replay needed?
One common scenario is fixing a bug in the Worker logic and replaying messages that previously failed because of it.
Another scenario is recovering data after a database issue by replaying events from a specific time window.
Replay starts from the database state records.
Messages marked as failed are queried, their original payloads are loaded, and they are pushed back into the appropriate BullMQ queue with their status reset to pending.
Because the original webhook payload is permanently stored when received, replay does not depend on Shopify sending the event again or on any external system.
Idempotency
Replay essentially processes the same event more than once, so Worker logic must be idempotent.
Processing the same message multiple times should produce the same final result as processing it once.
In practice, this typically means using upsert operations instead of plain inserts, with the event ID or another business key serving as the idempotency identifier.
Conclusion
This BullMQ-based asynchronous processing architecture breaks reliability into several independently verifiable mechanisms:
- Queue isolation prevents failures from propagating across different business domains.
- Exponential backoff gives transient failures an opportunity to recover automatically.
- Deterministic errors bypass retries and enter the DLQ immediately.
- DLQ alerts ensure failures never go unnoticed.
- Database-backed state management provides complete processing visibility.
- Message replay enables precise and controlled recovery.
Together, these mechanisms solve a more important problem than simply ensuring messages can be processed—they ensure that when processing fails, the system can reliably detect, recover, and continue operating.
In production environments, that recoverability is what ultimately determines the trustworthiness of an asynchronous processing system.