Detailed postmortem of service incident November 1st 2014

2020-09-15 11:29:58 UTC

This report expands the preliminary postmortem we did Saturday the 1st of November shortly after the incident was resolved. It adds information about smaller incidents Saturday the 25th of October and the 2nd of November that we now know were directly related to the main incident.

Saturday the 25th of October

During peak traffic hours Saturday morning we saw less than ideal performance from our device front end causing apps to fail with a missing internet message. Despite the issues Saturday ended upon setting a new traffic high reaching 15.2TB outbound traffic.

Careful analysis by our team did not reveal any single cause nor could our provider offer an explanation. All though we were unable to pinpoint an exact cause, the symptoms indicated a performance issue. To prevent a repeat occurrence we had our supplier add an additional server to our front end pool.

Friday morning the 31st of October we observed that our servers for a short period were able to saturate the uplink to internet. We addressed the bandwidth deficit immediately by adding an additional 1Gb link the same afternoon.

Saturday the 1st of November

About 7:00 the first indications appeared in our performance monitoring that something was a foot. At that time end users were not affected.

Around 8:00 the problem escalated and between 8:00 and 11:00 services were severely affected. During that time both Visiolink employees and our supplier's experts worked frantically to discover the cause. Despite a huge bandwidth surplus, servers doing little work and a database almost idling all the symptoms indicated a capacity problem.

While our supplier worked to understand why end users were not getting served we disabled as many services as possible still working from the hypothesis that this had to be a capacity issue since that were the symptoms we saw. No matter what we tried we did not see any improvement thus disproving that hypothesis. Another problem was that there were no relation between the complexity of the running scripts and the response times. One moment the simplest script would take several seconds to complete the next the response would be instant.

The first valuable clue was that not all the front end machines were affected. That was very puzzling because the machines are configured to be exactly a like to ensure that they have the same behaviour.

Around 10:45 our supplier realized that a link between two switches was getting fully saturated. This was affecting both storage and database access for several of our front end machines explaining the lack of meaningfull patterns in the response times. They did not depend on what the server was doing, but on the availability of the link. It also explained why not all machines were equally affected not all machines accessed resources through the saturated link.

10:55 all front end servers were moved to the same switch removing the bottleneck. The effect was dramatic within minutes all services began operating within normal parameters and the throughput more than doubled and reaching the levels we normally see during peak hours Saturdays.

During the day the traffic has been higher than normal, and we are now within a few percentage of delivering all the newspaper data we normally deliver during a Saturday.

Since Saturday we have learned that the problems we tried to address on the 25th actually made the situation worse the following week. The reason for this was that the server our supplier added during the week was added to the wrong switch in effect stressing the already congested link even more. It requires a very high level of traffic and a pattern of traffic that stresses the caches of the front end to see the effect. This did not happen until Saturday morning.

The misconfiguration did not happen recently. It was introduced when we started upgrading our storage setup 18 months ago. Only very recently has the traffic increased to a level where it became noticeable.

Sustained traffic above 3.5Gb was needed to trigger the issue -  a level we have only reached a few times typically Saturday morning.

Sunday the 2nd of November

Due to the corrected internal network setup our device front end performed noticeably better than normal. It turned out that the suboptimal setup had been working as an unintended traffic throttle.

Removing the bottleneck we managed to saturate 3 of 5 1Gb links during a period with many newsstands deliveries. That caused traffic to slow down between 6:10 and 7:20. End users will mainly have experienced download as slower than normal.


Ultra short term

By adjusting the rate with which we send out newsstand notifications we have lowered the peaks ensuring the normal reading experience will not be affected by apps fetching data in the background.

Short term

Wednesday the 5th of November we start moving all traffic related to automatic downloads and newsstand off the device front end to a setup dedicated to that traffic. This will allow us to fully control the amount of bandwidth used for background downloads. We want to make sure we can deliver newsstand traffic in a timely fashion without impacting the experience of the readers. Any changes will of course be done server side and completely transparent for the end users.

Middle term

To eliminate any risk of a reoccurence our supplier will replace the switches with new 10Gb equipment, and we will move from a 2 switch setup to a 4 switch setup. Used correctly the current equipment is fully capable, but replacing it allows us to add capacity for future use. Adding an extra set of switches also gives more redundancy.

The main challenge for us in this series of issues has been the fact that the root cause was in a part of the setup that is the sole responsibility of our supplier. Even though the setup was clearly broken, it has been running rock solid for a very long time and would have continued to do so, if the traffic had not increased beyond the limit of the 1Gbit link. 

Our supplier will still have full responsibility for the network performance but going forward we will make sure that there is no more hidden performance bottlenecks, and we will have access to a direct measurement of the performance of all potential bottlenecks ourselves.

So far our supplier has added performance graphs to all our network ports, and we are in the process of examining the many hundreds measurements already in place looking for areas with missing coverage.

Long term

We have for a while worked on designing a new setup. The main purpose of the design is to ensure a very high availability of the most recent papers by moving the delivery of the most recent papers to servers on different locations. One of the side effects will be that the main hosting will have to carry less of the load. We are planning to begin testing the new setup during Q1 2015. 

For any questions or concerns please do not hesitate to contact

Var denne artikel en hjælp?
0 ud af 0 fandt dette nyttigt
Har du flere spørgsmål? Indsend en anmodning