I’m very pleased to be able to announce that finally the backlog is beginning to clear! We’ve been doing a lot of work over the last week or so trying to analyse exactly what was going on to pinpoint the cause of the slowdown.
There were actually several problems and this sent us off on the wrong track initially. We have a submission queue which holds all of the individual submission jobs to be executed. As the number of members increases, the jobs in the queue goes up. As the number of Pro members increases, this REALLY goes up because Pro members submit a lot more jobs and of course since Release 2 there are now 4 submission tools. The result is that our queue had been steadily growing and its performance had been steadily declining!
The problem was, as long as the engine was able to process all the jobs on schedule, we never actually noticed that the performance was dropping. It was only when it reached that critical tipping point where the workload exceeded the capacity (which happened on the 7th February) that it began to fall behind and we started investigations.
The additional servers that we added at the weekend did not help because they were thwarted by the bottleneck in the database. Thankfully, optimising databases is a lot easier than setting up new hardware! Unfortunately, we wasted a lot of time setting up servers and tweaking things to try and improve performance before we realised that the major bottleneck was in fact in the database.
We’ve made some changes which have had a significant performance impact – enough to push the capacity beyond the workload so that now the backlog can begin to be cleared. However, an important lesson has been learned here and we are going to use the opportunity to continue to improve performance. We have at least another 4 areas that we think can improve performance even more and with such a major backlog on the queue, we have a ton of test data to work with!
That old saying ‘a chain is only as strong as its weakest link’ is significant here – it may well be that we can get further performance gains from better servers or a greater pool of IP addresses but if there is another area that is causing a greater bottleneck, then those gains cannot be realised. Now that we’ve solved the biggest issue, we can continue testing and move up the chain, improving as we go!
I’d like to thank you all for your support and patience over the last week or so!