Skip to content

CHPC OUTAGE: Jan 15, 2021 - network interruption this morning; additional outages between 1-2pm Today

Posted January 15th, 2021

Update 3:22 pm

CHPC has completed the vendor recommended changes and access has been restored.  The issues of the last two days, caused by inconsistencies in the two redundant circuits (details below) have been resolved.
 
Note that we are still working with vendors regarding issues seen over the holidays.  They have identified bugs in firmware and are working with us to get new software.  We are in a stable workaround situation at the moment and have been trying to bring services back online (such as the redundant circuit) while we wait for the fix.
 
We thank you for your patience as we continue to troubleshoot the new network equipment and ask that you report any access issues to  helpdesk@chpc.utah.edu. The CHPC team will continue to perform quality assurance tests on all of the services that experienced problems.
 
Details: Two days ago, we brought up a redundant circuit that we had brought down due to troubleshooting isolation.  With that circuit off-line, some services started to have intermittent and random connectivity issues.   Therefore, we brought the circuit back down this morning to help mitigate the issues. Even though we had tested bringing the redundant circuits up and down previously and there had been no issues,  when we brought the circuit down this morning, random services started dropping or experiencing timeouts throughout all the infrastructure and not just to a few services.  We brought the circuit back up and it helped the majority of the services, but not all.  We kept digging and contacted the vendor.  They were able to help us isolate the problems to two inconsistencies between the routers.  We were able to repair these inconsistencies by removal of part of the configuration on one unit and re-instatement of the same config.  We also added some additional configurations.  These changes have reset and normalized the respective configuration data bases

 


CHPC has experienced two unplanned network interruptions this morning. 

The first was from approximately 7:00-8:45am as a result of removing a redundant link in order to address issues being observed on several isolated systems. As this change was not expected to cause an outage, we did not schedule a time or announce the change.

The second was at 11:45 am, duration of a couple minutes, due to a configuration change made while troubleshooting the first outage.

The CHPC networking team is in contact with the vendors of the networking hardware and are planning to make a vendor recommended change between 1-2pm today. There is the possibility that the change being made may temporarily break all connectivity.

We will send a  follow-up message once we are done with the changes.

Last Updated: 12/17/24