From the title of this article, you might be wondering Why the heck would someone purposely inject a latency factor into a queuing system?! After all, isn't the point of queues to get data to an application/component as quickly as possible? Well most of the time, the answer to that question is yes. However, as you'll see in this article, there are some special situations where manually creating a latency factor is desirable.
As is usually the case with large projects, everything went as well as something that large can from design until the first "beta". However, it was with the introduction of production-level data that we discovered a severe limitation with our design and the use of queuing systems in general.
Figure 1: This is an example of a vicious cycle where a continually failing, transactionable, high-priority message causes a bottleneck in a queued system.
Take a look at the problem that we faced in Figure 1 as you follow this pseudo codepath.
- To start, an NT Service retrieves a message from a transactional queue.
- An error occurs (such as a record deadlock) that prevents the successful processing of the message.
- As a result of this failure, the NT Service simply performs a rollback.
- Once the rollback is called, MSMQ re-queues the message.
This codepath worked great throughout most of testing. However, in the real world what would happen is that due to its being transactionable, the message would requeue back up to the top of the queue, the needed record would still be locked and the message would be re-queued all over again.
This obviously resulted in several instances of these "trapped" messages causing a bottleneck in the message queue and effectively shutting down most of the system (or at least resulting in unacceptable performance levels).
A (Generic) Solution
What was needed was a way to build a latency factor into the system so that these messages would not disrupt the flow of messages throughout. This was accomplished with the introduction of a database table and a new message queue. Let's take a look at the current (simplified for this article) system.
- The first change was that now when a failure occurs (such as a record deadlock), the NT Service (shown in Figure 1) instantiates a special "requeueMessage" component. This component is defined as "requires new transaction". We'll get to why shortly.
- After this component is instantiated, it is initialized by way of being passed the original message that could not be processed.
- The component then creates a new MSMQ message, copying the original message's contents and settings the new message's AppSpecific property to an id that uniquely identifies the original message.
- The new message is then queued into a queue called ErrorRecovery. Why not simply insert the original message into this ErrorRecovery queue? Because that message is part of a transaction that is going to be rolled back. Therefore, you would lose that message as soon as the NT Service rolled back the original transaction. This is also why the requeueMessage component must be defined as "requires new transaction". That way it remains unaffected by the original transaction and its imminent rollback.
- Now that the new message is in the ErrorRecovery queue, the requeueMessage component creates a row in a errorRecovery table in the database. Columns such as original message id and reason for failure are set. In addition, a very important column called "status" is set to 0.
- The requeueMessage component then commits
- From there, the NT Service rolls back and the original message gets placed back in the message queue.
Checkpoint: At this point, we have the original message back in its original queue, we have an error record in the database (pointing to the message id of the original message) and we have a message in an error message queue.
- Now we really don't want that original message in the queue any more. However, we had no choice about it being returned to the queue because the queue is transactional and we rolled back so MSMQ requeued it. What we really want is for an "error recovery" NT Service processing the ErrorRecovery queue to make decisions regarding whether a message should be requeued or not and when (hence the latency factor). Therefore we change the original NT Service to read the error table after receiving each message from the message queue. If a record is not found, then obviously this is the first trip for the message. However, if a record is found, then the status is checked. If the status is equal to 0, the status is set to 1 and the transaction is committed without anything else being done. If the status is set to 99 (which you'll see in the next step), the message is reprocessed.
- At this point, the only thing left to do is to code the "error recovery" NT Service. However, the NT Service doesn't just read messages from the message queue. Instead it reads records from the error table. In fact, it reads a view that selects only records whose status is set to 1. Once it finds a record whose status is equal to 1, it changes the status to 99 (so that it will be processed in step #8), updates the database, locates (and removes) the message in the ErrorRecovery queue and finally, requeues the message into the original queue.
Now what have we accomplished here? We have successfully designed an intelligent means of handling errors such that MSMQ doesn't simply stuff them back into the queue thereby causing a bottleneck. At this point, the next step becomes very problem domain-specific with regards to how the ErrorRecovery queue works. For example, you might want to add a column to the error table that specifies how many times a given message has failed. Then you could write the code so that when a certain "retry limit" has been reached, the message gets moved to a "dead letter queue" where you're basically stating that the message is undeliverable. Another possibility is that you can assign more status codes and use that to inject an ever-increasing amount of latency for a failing message. For example, let's say a message fails once. You might set its status to 2. You would then have your error recovery NT Service read the error records sorted by status so that you have your own very simple, yet functional prioritization scheme.
One last thing to add is that our version of this error recovery NT Service only handled one record at a time. In other words, the whole point was to manually create latency. Therefore, we certainly didn't want this Service processing as many records as fast as it could. Your mileage will vary depending on your specific situation, but we had the Service "wake up" every 5 seconds and then process a single record. This, in itself, might be enough to provide you all the latency you need to avoid the bottlenecks described in this article.
Obviously, this is a very advanced topic and extremely difficult to convey in a single article. Therefore, if you find you're having difficulty implementing this strategy, feel free to contact me. Time permitting, I'll help you as much as I can.