Please find a full postmortem of today's service disruption below. All systems are now fully operational - for further questions or concerns, please contact [email protected]
07:36 First warning from a front end server
07:40 Primary storage fails completely
07:45 Secondary storage is deployed in readonly-mode
07:48 Services on device is partially restored - 1 of 4 machines fails to recover automatically
07:50 The primary storage fails to reboot - secondary storage is switched to readwrite (and becomes primary storage)
07:50 E-pages is fully restored for catalogues outside a paywall
07:50 payment.e-pages.dk fails to recover - leaving E-pages behind the paywall inaccessible
08:45 The final device machine was manually recovered
08:53 We see random errors on device and is investigating
08:59 The device errors is caused by a malfunctioning data sync. This is regrettably due to human error during the earlier recovery.
09:03 payment.e-pages.dk is fully recovered
09:35 We have reverted device to a 07:30 snapshot and all papers within the affected time frame have been reprocessed to ensure all data is up to date. Device is fully operational and the are no further errors.
09:37 SQL is seeing excessive load due to caches gone cold during the downtime
09:40 SQL-traffic is normalizing but response times are still elevated
09:43 payment.e-pages.dk is warning about reduced performance - we determine the cause to be the slower SQL server
09:48 payment.e-pages.dk is running normally
09:50 SQL load is completely normalized
Normally we expect to be fully operational in a matter of minutes even after a storage failure. Today several things went wrong preventing a full recovery despite storage recovering within a normal time frame. We are going though each specific issue with our provider to identify causes and take steps to prevent a repetition.
Further more we have been testing a new storage setup for a couple of weeks now after having evaluated different solutions since January. The planning so far have targeted a late Q2 roll out. We will determine if we can push some of this work forward, but will of course not jeopardize our testing procedures. The new setup handles failure in a way that is completely transparent for the servers depending on the storage. This eliminates most of the recovery steps we need to go through today and thus greatly reduces the risk for failures during recovery.