The flow of NWS data stopped due to a problem at the NWS at about 8:30 AM on 29 March 2023. Will update this news item once the upstream issue is fixed. This impacts a large number of IEM products :)
Update 10 AM 29 March: The NWS data flow resumed at about 9:15 AM and thankfully no data loss appears to have happened with products queued for transmission.
A water leak drowned a rack of IEM servers and there's much pain to be had at the moment. I am unsure yet how I can get things back going, but am trying! Thanks for your patience.
Update 9 PM 24 March: Oye, what a horrible mess. Firstly, thanks for your patience today and for the help the folks within Agronomy IT provided me. Most things should be running OK now, but there are about four datasets and services that are degraded. Those include MRMS data, mtarchive stuff, an auxillary mtarchive data service, and some local research data service. Two network switches were shorted out and I am down a bunch of compute capacity at the moment. In general, please do reach out to me if you find something that is not working. I have deployed a boatload of band-aids over the past 11 hours, so we shall see how many of these hold up.
Update 8 AM 27 March: Recovery efforts will continue today and there will likely be some brief outages of this website and related services as repairs continue and temporary workarounds are undone. Will itemize progress here today. Thanks again for your patience.
Update 12 PM 27 March: The MRMS service has been recovered. There was some unavoidable data loss during the 24th, but NOAA now has a AWS Archive that is much more usable than mine anyway!
Update 10 PM 27 March: The MTArchive service has almost been fully recovered and should have a small hole repaired by mid-day tomorrow. It also had some unavoidable data loss during the outage on the 24th.
Update 11 PM 27 March: The SMOS dataflow has been fixed.
Iowa State University experienced a network routing issue between about 7:45 AM and 1:15 PM on 8 March 2023. The primary impact for the IEM was inability to reach some internet data sources to retrieve data for processing and archival. The website and NWS data flow operated normally during this time. Will continue to review to see if there are any data holes that can be repaired.
The National Weather Service (NWS) is having an ongoing data flow issue preventing the relay of observational datasets and a few others. This issue started around 6 AM 15 October 2022. I'll update this news item once the outage is resolved.
Updated: Data flow was restored just after 8:15 AM. It appears missed data was retransmitted, so no data loss.
The National Weather Service is experiencing an ongoing data flow outage impacting things like IEM NEXRAD composites, etc. No ETR at the moment and will update the news item as things progress.
Updated 8:40 PM: The outage lasted between about 3 PM and 5:15 PM CST. Depressingly, it appears all data was lost for that period. Will see what can be done though.
Final Update: Sadly, the data from this outage is generally unrecoverable. Sigh.
Iowa State University had a Domain Name Service (DNS) outage lasting between about 5 and 6 PM on 10 January 2022. This outage broke a number of data flows processed by the IEM, but services to external users should have still been available. Not much can be done in these situations other than thank you for your patience!
Since yesterday (5 January 2022), the NWS has been experiencing a significant outage of its College Park, MD datacenter. This outage is causing various NCEP model datasets and others to not be available for processing. It is difficult to quantify the troubles this is causing IEM processing, but please be aware if you see various products in an error state :( The current NWS ETR is tomorrow (7 January 2022) morning.
Updated 6 Jan 2 PM: The NOMADS NCEP site is working and I have moved some processing over to it, so more data is flowing now than before. The NWS has updated that the College Park outage is now hoped to be fixed by Saturday morning.
Updated 10 Jan 9 AM: Everything should be back to normal now.
There was a partial outage of the IEM website between about 7:30 and 8:15 PM on 23 September 2021. A likely innocent web API user sent a few thousand simultaneous requests at a brittle website component that ended up exhausting resources for the Fast-CGI PHP server. I have added some additional code to help monitor this and keep the brittle resource from being overwhelmed. Sorry about the outage and thanks for the patience!
There are two maintenance outages scheduled for 24 May 2021 as summer time means disruptive changes are done!
ISU Network Engineers plan to do network updates between 5-7 AM with possible network outages during this time. The primary impact for the IEM would be you can't reach the website and real-time data ingest may be delayed or fail.
The central storage mtarchive uses will be down for most of the day for yearly updates and patching. I have some workarounds in place to prevent data loss, but various services that use that archive will be failing. This is not too impactful for the IEM nor its services.
I'll update this news item on Tuesday as these windows are cleared and services get restored. Thanks for your patience.
Updated 9:30 AM: The network outage lasted between 5:30 and 6:10 AM this morning with no known impacts other than some lost data from live platforms, like web cams. The mtarchive downtime is now ongoing and will update this news item later today. Thanks for your patience.
Updated 11 AM: The mtarchive outage is over, but a filesystem scrub operation is ongoing and typically takes a few days to complete. The operation causes some performance issues in the iterim.
There was a significant National Weather Service network outage last night that impacted a number of data flows into the IEM. I am still attempting to access the extent of the various data holes, but am on the case! Will update this news item once more is known.
Updated 10 AM: Not much for details has been released for this outage. There are various data holes that are likely not repairable as there is no upstream source. Oh well.
There was an outage of the NWS NOAAPort broadcast (which the IEM uses as the data source for many NWS products) between about 2 and 4 AM this morning (20 March). Unfortunately, the NWS indicates that all data during this period was lost, so alas. I will try some tricks to fill in various holes left, but some of them are not fixable. Sorry for the outage.
The IEM processes one minute interval data from ASOS sites via two sources. The near real-time source is the MADIS One Minute ASOS feed which provides data at five minute intervals (confusing eh?). The slightly delayed feed comes from NCEI. Both feeds have been somewhat struggling over the past month, but the MADIS one has been mostly down.
A contact I have at MADIS said that the issue is upstream of them and they continue to actively request a resolution. I'll update this news item once I hear more information about this data feed.
Update 10 Dec 2020: The MADIS data flow appears to be more stable now, but the folks at NCEP IDP are unsure if the upstream issue has been fully resolved or not. They continue to monitor.
Update 14 Dec 2020: The MADIS data flow appears to have stopped again, no updates as to what is happening.
Update 29 Jan 2021: The NCEP IDP folks believe this issue to be resolved.
Nothing short of a crazy day today. I'll update this news item frequently as I make progress restoring the IEM and filling the data void left today due to the Ames/ISU power outage from the Derecho of 2020!
11:21 PM: Power is back and services are getting back on their feet. Lots of issues yet and data holes to fill.
Updated 11 Aug 12:04 AM: The IEM should be functioning again for realtime requests. The AFOS (text product) database hole should now be filled. This is the first step in repairing other datasets on this genre. The extent of the outage was from about 11 AM this morning until 10 PM this evening.
Updated 11 Aug 12:22 AM: Some ISU ITS managed virtual machines remain offline and there is cooling capacity issues in the central data center. This marginally impacts the IEM, but some archiving is not working at the moment due to it.
Updated 5:30 AM: No power at my house is creating for a slow go with getting more of the data holes repaired. Will do what I can and keep this news item updating.
Updated 10:50 AM: Maybe I now have the hole in the VTEC (NWS Watch Warning Advisory) database and Local Storm Report (LSR) database filled. It is difficult to tell with my poor network working conditions at the moment.
Updated 8:45 PM: The logistical nightmare continues and repairs to the holes is slow going. Perhaps I have the RIDGE imagery backfilled now and will soon have the NEXRAD composites repaired. It is hopeless for me at the moment to catch up on emails and twitter mentions. Thanks for your patience.
Updated 12 August 4:30 AM: I believe the hole with the ASOS/METAR data should now be filled. I am getting questions about peak gusts in Iowa for the event and I can not really answer them until I get these holes repaired. Filling the RWIS one may not be possible as I am unsure if the DOT, who is located in Ames, was able to collect data during the Ames power outage. I will reach out to them soon.
Updated 8:05 AM: Repaired NEXRAD storm attribute archive hole.
Updated 10:35 PM: Not much for updates to share. I continue to struggle with logistics due to no power and Internet at home. Thankfully, power and cooling at work have been stable.
Update 13 Aug 12:00 PM: I have power and internet at home now, so can get after this task some more!
Update 13 Aug 4:00 PM: A previous backfill of VTEC / Storm Based Warning products did not go as planned. This was cleaned up and should be all square now.
Update 16 August 2:20 PM: I have now repaired the hole with the SPC Day 1 outlooks and MCDs.
One of the servers in the IEM computing cluster suffered a fatal power supply failure on the evening of 7 April 2020. Sadly, it happened just after I went to bed and I slept through the alarms until I awoke at 2:30 AM (coronavirus quarantine and working from home has my sleep schedule all wonky). I trucked it into work and got it repaired shortly after 4 AM. I am still accessing the data holes that may be present from this outage and will update the news item with what I figured out. Sorry for the troubles!
Update 10 AM: The missed RIDGE imagery has been generally fixed with the most commonly used products processed.
The IEM website has suffered a number of puzzling outages over the past two days. I think I have finally isolated the problem to an Apache web server configuration issue. I recently migrated the web farm to Red Hat Enterprise Linux 8 and failed to apply the "event" MPM configuration vs the default of "prefork". Hopefully the bumpiness is behind us now and things will back to performant normal!
Thanks for your patience!
An important disk unexpectedly ran out of inodes and caused a strange cascading failure that knocked out the IEM website between about 5 AM and 8:30 AM on 16 December 2019. I thought I had things under control shortly after the issue started, but the cascading failure caused the NAT gateway to not properly operate and things went downhill for a Monday morning after that. Will boggle this all and try to figure out how to keep it from happening again! Thanks for your patience.
One of the back end file servers failed horribly just after 7:30 PM this evening (1 December 2019) taking with it a fair amount of data and processing services. Am attempting to get bandages put on the various data flows before accessing if there is any hope in the morning for fixing it. Will update this news item tomorrow with an update. Thanks for your patience.
Update 8:30 AM 2 Dec 2019: Thanks to some help from a collegue onsite last night, we were able to cold restart a file server and get it back online in a degraded state. This morning, I replaced the failed hard drive and the system is back rebuilding redundancy. Everything should be functioning normally, but if you see trouble, please let me know!
An important backend network file service jammed up early this morning and let to various cascade failures of IEM services. We should be back to normal now and will be reviewing any data holes for repair. Sorry for the troubles and thanks for the patience.
The backend database server for the IEM was updated yesterday to version 12.0 and PostGIS 3.0beta4. While the upgrade went smooth, there are some performance issues being had currently. I'll update this news item once the issues have been resolved. Thanks for your patience.
Update 11 AM 13 Oct 2019: Had some trouble this morning with the secondary database being overwhelmed with connections. I am still reorganzing the backend datases to workaround some troubles found with the recent upgrade. The fun never stops with this and I am thankful for your patience.
Update 9 PM 15 Oct 2019: Things have generally been stable, but am still not running the configuration I would like. Am awaiting for some new upstream packages to be made available and will see if the performance issues go away with them.
The IEM website was degraded from about 3 AM till 8 AM this morning due to an ugly cascading failure. The current theory is that a Red Hat Enterprise Linux 7 bug caused a NFS client with failed memory DIMM to lock up an important NFS server. This server then jammed until the client was physically restored by me. The fun never stops. Hopefully this won't happen again! Thanks for your patience.
ISU Networks, Operations, and Communications will be doing building network switch replacement between 3 and 7 AM on 26 September 2019. A number of network outages lasting up to an hour are expected during this window. I'll update this news item once the work is completed. Thanks for your patience.
Update 26 Sep, 7:11 AM: We are getting back to normal and will be attempting to repair various data holes while the network was out. Please let me know of any issues you are seeing.
At about 12:30 PM CDT on 20 July 2019, much of Ames and Iowa State University lost power. This took out all IEM computing resources :) At about 2:03 PM, power was restored and about 2:30 PM I was able to get all IEM computing back on its feet. There are still some sick data flows and services and will update this news item as things are repaired. Thanks for your patience.
Update 8:30 PM: After great gnashing of teeth, everything should be back to normal now. A data hole still exists that I will repair over the coming days. Thanks for your patience.
Update 22 July, 8:00 AM: Issues continue to be found and fixed, but please let me know if there is something still broken. At this point, figuring out what is broken is the toughest portion of the battle.
There will be a brief outage of Internet for the IEM starting at about 6 AM on 3 July 2019. The outage will hopefully resolve some bandwidth issues that have been plaguing the web farm and IEM services for a number of months now. Should be back up within 20 minutes. Will update the news item once completed. Thanks for your patience.
Update 6:21 AM: The outage was from about 5:57 AM till 6:08 AM. Sadly, the underlying issue was not resolved with this outage.
I just woke up and am trying to collect information, but ISU suffered some sort of power dump overnight and I have a mess to clean up. Will post updates here as I get things repaired!
Update 9:30 AM: The servers are all back on their feet, but I have a small data hole to plug with the HADS and METAR data. Will update this news item again once that is repaired.
Update 3 PM: The SHEF and METAR holes should be repaired and no known issues remaining.
The Iowa DOT provided RWIS data has not updated since Sunday (8 Apr 2019) morning. There is some issue with the data flow on their end that is being actively worked on. Will update this news item once the feed has been resolved.
Update 9 PM: Data flow was fixed about 3 PM this afternoon.
The Mesonet Level II service provides open access to the National Weather Service "Level II" NEXRAD data. This service is being moved on 8 January 2018 to a new virtual home in the data center.
Firstly, there will be an outage of some duration starting around 8 AM 8 Jan 2018 as a physical move happens and then network reconfiguration happens. Hopefully this outage can be limited in time, I will update this news item once done.
Secondly, the service will be available from a different IPv4 and IPv6 address. The domain name will not change, but perhaps some folks have IP based firewalls in place. The new IP addresses will be
188.8.131.52 and 2610:130:108:480::3.
This service will also default to HTTPS going forward.
Hopefully all of these changes are transparent to users. Famous last words! :)
Updated 8 Jan, 10:45 AM: Some network complications were found prior to the move starting, so the old service remains running for now until attempt number two is hopefully made later today. Thanks for your patience.
Updated 8 Jan, 2:45 PM: I believe we are up and stable with the new setup. If you are having trouble accessing, please let me know!
There is an ongoing outage of data from the National Weather Service. Will update this news item once it is resolved. No word from the NWS what the issue is or when it will be fixed.
Update 2:20 PM: The NWS reports the issue to be resolved, but there was a significant hole that will not appear to be possible to backfill.
The data flow from the NWS stopped at about 11:30 AM 17 July 2017. No ETA on a fix. This impacts lots of IEM services. Will update the news item once it is repaired.\n
Resolved as of 12:07 PM.
The flow of National Weather Service data has been down for the past few hours. This is a nationwide outage of their satellite system, so there are lots of folks in this same boat. Will update this message once it is resolved upstream.
Update 3:30 PM: Data is flowing again and the extent of the outage was from approximately 12 to 3 PM. I suspect some data is lost forever.
There has been a fire at some location important to the relay of METARs to the world. This relay is down until further notice, so it appears many METAR / Airport / ASOS+AWOS sites will be unavailable in the interim! Will update this NEWS item with any further details I get.
Here's a plot of my monitoring showing the downturn in available sites:
Updated 9 PM: Most sites are back now, but am unsure of when the full restoration will happen. There was a problem with a telco location in Omaha that caused this outage.
Iowa State University lost Internet access Sunday, 20 Nov 2016, between 6:10 and 8:21 PM CST. An outage of this duration does cause data loss for the IEM project, but I will make an attempt to repair some of the holes caused. No word on why the border routers failed.
ISU Network folks will be doing maintenance on Sunday, 25 September 2016. They expect a number of outages during the day as they replace routers, etc. So availability to the IEM will be up and down during the day with actual outages not expected to last for too long.
Update 10 PM 26 September: So firstly, an appology for the prolonged issues that occured. It did not help the situation that I was traveling and could not connect myself to the IEM servers during the outage. Full network service was restored around about noon today. On Sunday, the 1 hour initial outage in the morning was known and expected. What happened after about 11 AM was not. There was some router config issue that was preventing returning traffic from the IEM to reach the clients during this time. When these types of issues happen, I post updates to my @akrherz Twitter account.
There was an ISU Internet outage this morning that prevented most folks from accessing the IEM website between 5:00 and 5:27 AM this morning (5 April 2016).
The IEM was unavailable this morning (3 March 2016) between 4:30 and 5:03 AM due to scheduled network maintenance.
There was a rather large METAR data outage last night due to issues at the Federal Aviation Administration (FAA). The FAA collects the airport weather station data (in METAR format) and disseminates it to the National Weather Service, from whom the IEM collects the data from. It is doubtful that I'll be able to repair the hole in the archive from this outage.
The outage lasted from just before midnight this morning to about 5:30 AM. Here's a plot from my monitoring showing global METAR station counts with the number of stations reporting within the past hour plotted.
The IEM website was very slow or unavailable for a period between about 3:20 and 4:30 PM on Monday, 27 April 2015. This was due to a cascade failure as a backup database server flooded a file server with IO requests and that slowed down another process that reads data from that server. Oye. The primary database server is about an order of magnitude faster than the backup server, so write loads the primary server generates sometimes slows down the backup server.
I am moving the backup database instance to a different disk system to prevent this from happening in the future. Thanks for your patience.
The IEM website was very slow or unresponsive between 1-2 PM on Thursday, 16 April 2015. I was sitting in a meeting at the time and did not immediately notice that bad things were happening. The issue was a cascading failure of my tilecache service due to a failure to generate N0R RADAR Composites during this time. The lack of current radar data was causing most of the incoming requests to bypass a caching layer and instead hit the mapserver backend, which was quickly overwhelmed with work.
I have made some changes to hopefully prevent this from happening again, thanks for your patience.
There are unknown issues with ISU's network this morning causing trouble with various IEM services. I will update this news item as I find things out. The most noticable impact was for users being unable to connect to the Level II radar server.
Update 8 PM: It took a while to get everything back on the network after ISU had DHCP server issues over night. We should be back at full strength now. Thanks for your patience.
ISU lost Internet access for a number of minutes a bit after 2 PM today. The local IT folks say a soon to be replaced router failed.
Internet for the entire ISU campus was out between 2:47 and 3:22 PM today (10 October). No word on what the issue was or if we are stable now.
4:15 PM update A network misconfiguration was made resulting in the campus wide outage. We should be stable now though.
Our local network/Internet was down between 10:05 and 10:12 AM this morning. Unsure what is going on other than a local building router is sick.
Our upstream source of NWS RADAR data is currently off the Internets, so we are having an outage of RADAR products. Have not heard any details on what the issue is other than network outage!
Update 2:30 PM: The upstream source returned at 2:26 PM and I have mostly repaired the missed data during the outage.
There were two internet outages this morning as ISU continues work on the network backbone upgrades. There is a light at the end of the tunnel and the hope is that the work is mostly done now.
There was a brief Internet outage between 12:20 and 12:30 PM today as some local network issue occurred. I assume this is related to continued work on local upgrades.
We lost Internet connectivity around 10:10 AM. It is not clear what is currently broken, but I have hacked around it by hard coding a network routing path. Unsure how stable this config is, but things are working again at the moment (10:45 AM).
A routine ISU network maintenance event this morning for the Agronomy building did not go as hoped and resulted in a prolonged outage between about 6:50 and 7:30 AM this morning.