Created Saturday, Mar 25th 2023 01:09Z, last updated Saturday, Mar 25th 2023 01:28Z
Yesterday, Mar 24th 2023, at about 19:00Z I noticed that the file system of the main server became faulty and crashed in the end, The Aviation Herald went offline for about 5.5 hours as result.

In the meantime I was able to identify that a harddrive, part of a RAID5 system, had failed in a very weird manner, so that the RAID controller didn't detect the fault and kept that drive with invalid data in the system creating all these file system faults. I removed that drive and put a new drive in, and with all stuff prepared to go for a disaster recovery (restore the server from scratch from backup entirely accepting the data loss during the day) I booted that server from its internal RAID system as a test.

Also the secondary server had gone down as result, inexplicably for me. I brought the secondary server up and, with all the file system checks ongoing, it was back online after an hour, and engaged in its tasks again.

And the main server booted as if nothing had happened, slower than normal as the RAID controller was already busy to rebuild the data and create data redundancy again, all data seemed there, however, I couldn't get the network to respond, no data seemed to go out or arrive, and I had no explanation at first. I checked whether one of the files might have become corrupted associated with the network, but everything looked good. After several hours of these checks on the server, which was the most likely source of that failure, I finally decided to declare the network switch suspect and reconnected the LAN cable of the server to a different switch, and voila. The server came online on the LAN. In the end I rebooted the network switch and everything worked again as before, including the server. I conclude that in the confusion caused by that failed file system the server must have confused the network switch to the extent, that it crashed internally also taking the secondary server offline (however, with a reboot the secondary server was able to come online again, other than the main server which could not reconnect until the reboot of the switch).

After the failure at about 19:00Z on Mar 24th 2023 the server came back online at about 00:30Z on Mar 25th. No data have been lost, fortunately. In the meantime the data on the new harddrive have already been rebuilt, and full redundancy has been restored again. Phew!

One good came out of this, too. A scheduled maintenance can now be cancelled. I had planned to upgrade the memory of the server, which has just been done as the server was down anyway.
Flight Delay Compensation up to 600€/$: Claim for delay, cancellation, or baggage issues >
Flight-schedule data supplied by Aviation Edge Real-time Flight Schedules API.