MessageBus – How We Improved Managing Our Dead Letter Queue

Written by Martin Führlinger, Lead Engineer Backend

Introduction

In my previous posts about the message bus, I wrote about using RabbitMQ for decoupling our services and how we defined our message content, followed by an explanation of how we keep our receivers fast and resilient. Most recently in this series, I have described how we handle dead letters with the dead letter exchange. This post already gave insights in our current way of manually cleaning up that dead letter queue, and it contains some hints about a more sophisticated handling of dead letters. As some time has passed already, I am happy to announce that we actually improved handling that quite a bit.

RabbitMQ Management Interface

Since the RabbitMQ management interface is a very basic overview of some statistics about RabbitMQ, only showing the headers and the encoded payload of the queued dead letters without any ordering or grouping, it is pretty hard to get the information about which messages are currently in the dead letter queue. Also, requeueing a dead letter in a specific queue is not possible in that interface. So we decided to improve that and write our own small service to deal with dead letters.

Dead Letter Service

As mentioned in the last blog post, we wrote a simple script, which requeues the messages of the dead-letter queue, but because backend developers usually try to automate things, we wrote this new service to get rid of the script and the manual step.

This service basically just implements another consumer, listening to the dead-letters queue and storing all received dead letters in a MongoDB database. Besides storing the message payload and all necessary metadata like the headers or the routing key, we also store a:

  • unique ID for that message, containing the message ID and the queue it failed in
  • dead flag (true/false)
  • a date-time when it was marked as dead
  • a redelivery count value

As the message ID and the failing queue do not change when republishing a message, they can be used to identify a message in our database.

Retrying

A dead letter is automatically retried by this service, which means it is pushed a few times to the particular queue in which it failed before. Pushing to the specific queue is necessary, as publishing the same message with the same topic, would cause all consumers of that topic to receive that message again (see using RabbitMQ for decoupling our services about topic/routing_key usage). If the message cannot be consumed, it will be not-acknowledged and ends up in the dead letter queue which causes it to be received by our service again. To be able to push to a queue directly, without using the topic, you need to connect to the default_exchange, instead of the usual exchange you may use for message receiving (e.g. this is named production exchange in our case).

def exchange
  # default direct exchange which can
  # route to all queues via
  # routing_key == queue_name
  @exchange ||= channel.default_exchange
end 

def channel
  @channel ||= connection.create_channel
end 

def connection
  @connection ||= MarchHare.connect(
    host:      config[:host],
    ...
  )   
end 

Querying

Storing dead messages in a database also enables us to query on various attributes. We can for example list all dead messages from a single queue/topic or also check which messages died within the last 3 days, for example. You can imagine that a variety of interesting groupings and filters are possible. With that in mind, we implemented two views of the data. One of them is imitating sidekiq-cleaner or resque-cleaner views. It shows the number of dead messages grouped by queue and time periods.

The second view lists all queues and their respective dead and undead (successfully redelivered) message counts.

Clicking on the counts opens the corresponding index page listing all dead letters which are in that queue.

This page also enables us to requeue a single message. Clicking on the ID of that entry, which is a combination of the message ID (an UUID) and the queue the message was queued in, opens the detail page of the dead letter.

This page shows the detailed information about the dead letter, the headers, routing key, flags and maybe the most important: the payload of the message, which is usually the main reason why a message cannot be processed and is rejected.

Further Improvements

The current implementation already helps a lot in day-to-day business, but of course there are many possible improvements. As already mentioned, a more complex check for the retry would make sense, for example, to allow retrying messages of some queues more often than others. The following are just some of the many possible improvements:

  • More flexible retry conditions
  • Delayed retry using an asynchronous worker (e.g. sidekiq)
  • Automatic cleanup of redelivered messages (e.g. delete all successfully redelivered messages after a certain amount of time)
  • Bulk retry (e.g. retry all dead messages of a specific queue or topic)

Conclusion

Introducing this service enabled us to check both how many messages haven’t been delivered over time as well as inspect the content of these messages. This also allows us to requeue single dead letters with a few clicks instead of using a script. Looking toward the future, the new service is a great foundation for further improvements.

***

Tech Team We are made up of all the tech departments at Runtastic like iOS, Android, Backend, Infrastructure, DataEngineering, etc. We’re eager to tell you how we work and what we have learned along the way. View all posts by Tech Team