Created Thursday, Apr 17th 2025 11:28Z, last updated Saturday, May 10th 2025 19:58Z
Today (Apr 17th 2025) the server providing the avherald.com domain unfortunately crashed in a very unusual way at 02:51Z. When I was alerted to the outage and tried to reboot the server it couldn't come up and told me, that my main password was invalid, so I couldn't even log in to repair the server (I am currently in England and had to reboot the server in Salzburg remotely). In the end I needed to remotely boot the server via an USB stick into a repair utility, which successfully repaired the file system (that due to the crash had been damaged), thereafter also my password worked again and I could bring the server up at 11:04z after considering already my options of how to get to Salzburg as quickly as possible, thinking through the possibility of a disaster recovery, checking the last backups, and the like.

Again, there was no data loss involved.

I have no evidence that the crashes of Mar 5th and Apr 17th were caused by attacks against the server, it rather appears there may be a hardware fault involved. Hence I shall schedule a downtime in May when I am regularly back in Salzburg to replace the components that could have been at fault (my primary suspects are the DDRAM and the RAID controller based on the history of problems so far).

My apologies for the outage, I can't even promise now that we won't see another outage until I shall be in Salzburg on my regular schedule.

Update Apr 27th 2025 15:49 UTC: As expected there were a few more crashes in the meantime, I rebooted the server quickly each time. While I still have no evidence of an attack, still have more indications of a technical fault, I can not rule out a denial of service attack without leaving evidence and hence decided to do an update to our Apache Webserver in order to work with the latest SSL software (there was a remote possibility of a denial of service attack in the used SSL software), forcing us to go through a complete recompile and reconfiguration of the server. The new server software is already working fine on its testbed.

I suspect I need to shut the public server down for about an hour to perform the software update. I expect the update to commence any time soon today (Apr 27th 2025).

Update Apr 27th 2025 16:21 UTC: The software update has taken place and is completed, the server is up and running again and open for public access.

Update Apr 28th 2025 07:07 UTC: The server crashed again during the night, I just brought it back up. This now definitely rules an attack out. the only suspect now is the RAID Controller, which will be replaced in about 2 weeks when I shall be back in Salzburg.

Update May 3rd 2025: In the very early hours of May 1st 2025 the server crashed two times within a few hours, however, by sheer luck I was able to catch the first crash just as it developed and was able to collect evidence. The evidence now clearly ruled out a hardware fault and identified a third party library software being at fault causing the file system to degrade, go into read/only mode (causing the web server to throw errors) and finally fail with a "Bus Error" and "I/O Error" a few minutes later. During May 1st 2025 I was thus working all day to replace that library and find all possible applications that used that library and make sure, the apps all can work with the new library. This, fingers crossed, seems to have sorted the issue, as I hope. Yet, it is still too early to declare this episode of repeated outages over, though I am very hopeful and even quite confident by now. Keep your fingers crossed, please, that this is now the end of the crashes indeed!

Update May 9th 2025: Good News first: scheduled downtime for maintenance to replace the RAID controller tomorrow around about 20:00Z.

I had changed the logging strategies as I suspected crucial evidence might be lost due to the file system dropping read-only, but also noticing on May 1st that the network connections all remained fully alive. Coincidentally however, the software changes of May 1st kept the server up and running until May 5th, then the crashes began to resurface again.

Today there were two more crashes in short succession, this time I could collect all the evidence needed to finally fully confirm that the RAID controller is at fault. The crashes were initiated by write errors to the HDDs prompting the file system to become read only and the kernel to attempt a reset of the controller (the controller moaned about it that there was no reason for a reset). The controller had no faults logged in its internal memory.

Update May 10th 2025 about 15:30Z: The server is up and running again with the new RAID controller in place. Following a server crash just shortly after I arrived in Salzburg, I decided to go ahead with the replacement ahead of the originally scheduled time. All post maintenance checks are completed now. Crossing fingers again!

Flight Delay Compensation up to 600€/$: Claim for delay, cancellation, or baggage issues >
Flight-schedule data supplied by Aviation Edge Real-time Flight Schedules API.