CHPC Downtime - OS kernel updates on clusters
Posted: September 11, 2019
CHPC is scheduling a downtime on the clusters to update the kernel. Redhat has indicated that the updated kernel version should resolve the issues we have observed recently where nodes are being left in a bad state (jobs hanging or remaining in the completing “CG” state at the end of a job) when unable to complete file I/O due to a disruption between the file servers and the node, requiring the nodes to be rebooted.
In addition to the kernel update, we will be updating the Nvidia drivers on kingspeak, notchpeak and redwood.
Part 1: Compute and interactive nodes on ember and notchpeak
Wednesday, September 25, 2019, starting at 7:30am
Reservations are in place to drain the batch queues by this time. We expect to have
the clusters back in service in the afternoon.
We will then deploy this update on the remaining clusters, provided no problems arise as a result of the update.
Part 2: Compute and interactive nodes on lonepeak, kingspeak, tangent, ash, and redwood. Includes the atmos and meteo nodes.
Tuesday October 8, 2019, starting at 7:30am
Reservations to drain the batch queues will be put in place as we approach this date.
Please let us know if you have any questions or concerns.