Postmortem: System disturbances April 10 2013

2020-09-15 11:30:04 UTC

Please find a full postmortem of today's service disruption below. All systems are now fully operational - for further questions or concerns, please contact 

07:36 First warning from a front end server

07:40 Primary storage fails completely

07:45 Secondary storage is deployed in readonly-mode

07:48 Services on device is partially restored - 1 of 4 machines fails to recover automatically

07:50 The primary storage fails to reboot - secondary storage is switched to readwrite (and becomes primary storage)

07:50 E-pages is fully restored for catalogues outside a paywall

07:50 fails to recover - leaving E-pages behind the paywall inaccessible

08:45 The final device machine was manually recovered

08:53 We see random errors on device and is investigating

08:59 The device errors is caused by a malfunctioning data sync. This is regrettably due to human error during the earlier recovery.

09:03 is fully recovered

09:35 We have reverted device to a 07:30 snapshot and all papers within the affected time frame have been reprocessed to ensure all data is up to date. Device is fully operational and the are no further errors.

09:37 SQL is seeing excessive load due to caches gone cold during the downtime

09:40 SQL-traffic is normalizing but response times are still elevated

09:43 is warning about reduced performance - we determine the cause to be the slower SQL server

09:48 is running normally 

09:50 SQL load is completely normalized


Steps foreward

Normally we expect to be fully operational in a matter of minutes even after a storage failure. Today several things went wrong preventing a full recovery despite storage recovering within a normal time frame. We are going though each specific issue with our provider to identify causes and take steps to prevent a repetition. 

Further more we have been testing a new storage setup for a couple of weeks now after having evaluated different solutions since January. The planning so far have targeted a late Q2 roll out. We will determine if we can push some of this work forward, but will of course not jeopardize our testing procedures. The new setup handles failure in a way that is completely transparent for the servers depending on the storage. This eliminates most of the recovery steps we need to go through today and thus greatly reduces the risk for failures during recovery.

War dieser Beitrag hilfreich?
0 von 0 fanden dies hilfreich
Haben Sie Fragen? Anfrage einreichen