April 14th, 2016 we received reports of errors when end users tried to download the latest available publication on iOS devices. The symptoms were rendering errors on the publication’s front page and blank pages when swiping through the publication. Android seemed unaffected or at least affected to a much lesser degree. Desktop HTML users reported problems to a lesser extent. This report outlines the events surrounding the incident and the actions taken on Visiolink’s side to avoid reoccurrences.
Upon receiving the first report around 6:30 our team performed our standard checks. Nothing out of the ordinary was found and the problem was believed to be isolated to that customer and handled accordingly. Soon after further reports arrived and the problem was escalated to be a suspected system-wide incident.
The standard checks had not revealed anything out of the ordinary and a more thorough inspection confirmed that all frontend servers were performing within normal parameters and handling the expected amount of request. Furthermore, the amount of traffic parsing through the network switches was in line with the expectations for a Thursday morning.
During the investigation, we noticed that one of our database servers were lagging behind the master server. We routinely diverted all traffic away from the lagging server.
At that time, the theory that best fitted the symptoms was a network problem outside our hosting center, but no such problem could be found and it did not address the fact that the problem seems to be isolated to iOS based apps. We would expect a network problem to affect all clients equally.
Having tried all other avenues of investigation, we returned to the problem of the lagging database server. If the lag grows too big, inconsistencies start to appear between the backend servers and the frontend that serves the clients. Such inconsistencies usually disappear as the lagging server catches up and the system is designed to allow for a small amount of lag.
While the server team works on the database problem, an iOS app was examined. It was determined that errors in the file that contain article content and the publications metadata could result in the symptoms we were seeing. Furthermore, it was determined that those errors would not prevent the app from downloading the pages, it only affected displaying them.
Combined with the server team’s investigation it was concluded that the problem was caused by errors in cached metadata served by the frontend servers. Solving the problem simply required us to reset the cache.
Our alarm system has two stages that are monitored - the alarm and a responding "recovery"-alarm.
The alarm system monitoring the database sends two alarms simultaneously Wednesday evening one warning about the high load on the database server and about the database server lagging behind. Shortly thereafter a recovery alarm was sent. The person monitoring the alarm did not realize that we never received a recovery message for the lag issue. That allowed the lag to build slowly through the night.
The root cause of the lag issue was a faulty configuration that ended up generating excessive updates to the database.
Most of our alarms repeat periodically until the problem is handled. The alarm warning about the lag issue did not since it was considered less critical when the alarms were configured. This has now been changed.
The problem persisted after the problem was solved because faulty data ended up being permanently cached. To prevent this we will change the caching policy to ensure that even supposedly static content will be updated.
The issue causing the lagging database have been resolved and a guard has been put in place that prevents reoccurrences.