During the night between Friday the 29th and Saturday the 30th of May we detected a delay between the targeted processing time and the actual time the job was queued for processing.
Our investigation following the incident has determined that the problem only affected customers with non-standard processing flows. The majority of our customers are on the standard setup and did not experience any issues.
To understand the root cause of the problem you have to know: That all job queing happens in two groups, “standard" and "non-standard”. And all jobs in a group are queued sequentially and that the order are randomized between runs.
The problem originated in the handling of custom processing scripts used in non-standard setups. That is why only customers in the non-standard group were affected.
Some of the custom processing scripts generate an excessive amount of IO - file operations - on the storage server. This caused a delay and since jobs are queued sequentially any job following a delayed job will also be delayed
During the 30th the amount of IO reached a limit on the storage server causing abnormal delays.
By randomizing the running order we ensure that a single job that fails cannot prevent any other pending job from running - that could happen if the jobs always ran in the same order. In that case no jobs after the failing job would ever run. We, of course, want to ensure all jobs will run, but this has the side-effect that the running order does not take any delays into account. It is not a problem when everything is within normal parameters, but this caused additional delays for some papers, because they ended in the "wrong" end of the queue.
Saturday morning we alleviated the problem by moving some of the processing away from the contested resource.
We have done a series of improvements to the non-standard setup over the last couple of weeks to generally improve the setup for all customers using it. Monday the 1st of June we updated all custom processing scripts to limit the amount of IO generated. This will completely remove any risk of hitting the IO ceiling.
To eliminate any change of a reoccurrence we will add another level of monitoring and logging to the system queing jobs for processing. This will cover both standard and non-standard flows.
Thankfully all papers where processed before the morning rush and in the end the end-user impact was very low.