Search Archive:
<< Previous      Next >>
161 Views 30 Sep - 1 Oct Outages [resolved]

Link: https://mesonet.agron.iastate.edu

Between about 9:10 AM 30 September and 11:00 AM 1 October 2021, the IEM website had various outages and service degradation due to a wild set of circumstances. This news item will attempt to summarize the fun in hopes that I can remember this situation in the future!

Just after 9 AM on 30 Sept, I started to notice that the IEM webfarm was emitting an inordinate amount of bandwidth vs the number of hits being generated at the time. Coincidentally, I was doing a major forklift of the webfarm to run a new python software stack and so assumed that I was introducing problems with this new deployment. I did a whole bunch of iterations on this stack and kept assuming that I had something fouled up.

Fast forward to about 8 PM on 30 Sept and I decided to pull the plug on this new software stack and revert back to what was working before. After I reverted, the same troubles were still present. I then audited some more things and realized that the troubles started about 30 minutes before I attempted the new software stack installation that morning. Oh dear!

Googling, I realized that at about 9:10 AM, the LetsEncrypt DST Root X3 SSL certificate expired. The IEM uses a LetsEncrypt certificate, but I thought it was updated to use the new signing chain. The IEM certificate was in good shape, but it became clear that some of the web clients were not happy about this situation.

I looked more closely at Apache and realized there were thousands of connections per second that were failing SSL handshakes with the TCP socket connection being left open in some unknown error state. Numerous hours of debugging and tweaking the SSL certificate chain ensued without any success. Looking more closely at these failing connections revealed they were coming from the same software platform running a 10+ year old SSL software stack that did not like having the expired CA in the chain. The current belief is that there are about 1,000 of these platforms in the wild perhaps running some kiosk.

This morning, I decided to abandon LetsEncrypt and get an Iowa State University provided In-Common signed SSL certificate in hopes to fix this. After some iterations with the SSL chain, a stable combination was found that allowed the misbehaving software platform to successfully make HTTPS requests against the IEM server and not generate denial of service amount of client traffic.

Woof, let us not do this again! Thanks for your patience and usage of the website.