2012 CHPC Downtimes and History

Unexpected Power Outage in Komas Data Center - September 27th approximately 10:00 a.m.

Posted: September 27, 2012

Duration: Unknown Duration

Systems Affected/Downtime Timelines:
* Ember Cluster ~10:00-unknown
* Updraft Cluster ~10:00- ~ 4:20 p.m.
* Sanddunearch Cluster ~10:00 a.m. - ~4:20 p.m.

Arches Downtime Duration:

Around 10 a.m. on September 27th, a breaker tripped in the CHPC Komas data center, taking down half of the power to the room, affecting our main HPC clusters including ember, updraft and sanddunearch. There was maintenance being performed on the UPS at the time, which was not expected to cause any problems.

Sanddunearch and Updraft are back up (about 4:15 p.m.) but Ember is still having some issues. CHPC staff are troubleshooting as of 5:25 p.m.

Instructions to User:
* Ember Cluster ~10:00-unknown
* Updraft Cluster ~10:00- ~ 4:20 p.m.
* Sanddunearch Cluster ~10:00 a.m. - ~4:20 p.m.

Around 10 a.m. on September 27th, a breaker tripped in the CHPC Komas data center, taking down half of the power to the room, affecting our main HPC clusters including ember, updraft and sanddunearch. There was maintenance being performed on the UPS at the time, which was not expected to cause any problems.

Sanddunearch and Updraft are back up (about 4:15 p.m.) but Ember is still having some issues. CHPC staff are troubleshooting as of 5:25 p.m.


Short GPU node outage on ember cluster (em513-em524) Monday, August 13, from 3 - 4 p.m.

Posted: August 6, 2012

Duration: Monday, August 13th, 2012 from 3 p.m. until 4 p.m.

Systems Affected/Downtime Timelines: GPU nodes on ember cluster: em513 through em524

Arches Downtime Duration:

We will be updating the cuda version to 4.2 on all 12 GPU nodes.

Updates completed about 4 p.m. - users of the GPU nodes should consider recompiling for 4.2.

Instructions to User: GPU nodes on ember cluster: em513 through em524

We will be updating the cuda version to 4.2 on all 12 GPU nodes.

Updates completed about 4 p.m. - users of the GPU nodes should consider recompiling for 4.2.


EMERGENCY Power Outage affecting CHPC File Servers: starting immediately until repairs are effected

Posted: June 25, 2012

Systems Affected/Downtime Timelines:

**All CHPC File Services will be down until repairs are completed.

The SSB Machine room requires an emergency power outage to effect repairs to the flywheel generator. CHPC is taking down all file systems offline during this power outage to protect data.

The HPC clusters scheduling will be paused, but will remain up.

The CHPC VM farms will also be taken offline as they are also located in the SSB machine room.

Instructions to User:

**All CHPC File Services will be down until repairs are completed.

The SSB Machine room requires an emergency power outage to effect repairs to the flywheel generator. CHPC is taking down all file systems offline during this power outage to protect data.

The HPC clusters scheduling will be paused, but will remain up.

The CHPC VM farms will also be taken offline as they are also located in the SSB machine room.


CHPC Major Downtime: Tuesday June 19th, 2012 beginning at 7:00 AM - EXTENDED due to File Server hardware failure

Posted: June 5, 2012

Event date: June 19, 2012

Duration: From 7 a.m. June 19th until 8:30 p.m. June 20th

Systems Affected/Downtime Timelines:

During this downtime, maintenance will be performed in the datacenters, requiring many systems to be down most of the day. Tentative timeline:

  • HPC Clusters: 7:00 a.m. 6/19 - 8:30 p.m. 7/20
  • CHPC File Services: 8:00 a.m. 6/19 - 7:00 p.m. 7:20
  • Intermittent network outages: 8:00 - 9:00 a.m.

Instructions to User:

During this downtime, maintenance will be performed in the datacenters, requiring many systems to be down most of the day. Tentative timeline:

  • HPC Clusters: 7:00 a.m. 6/19 - 8:30 p.m. 7/20
  • CHPC File Services: 8:00 a.m. 6/19 - 7:00 p.m. 7:20
  • Intermittent network outages: 8:00 - 9:00 a.m.


CHPC *MAJOR* Downtime scheduled for March 13, 2012 - /scratch/ibrix file systems to be PURGED!! Please plan ahead.

Posted: February 16, 2012

Event date: March 13, 2012

Duration: Varied by service

Systems Affected/Downtime Timelines:

  • Intermittent network outages between 7:45 a.m. and 9 a.m.
  • The HPC clusters will be down most of the day to effect data center maintenance.
  • All /scratch/ibrix filesystems will be purged. Users have been given early warning and ample time to clean their files from /scratch/ibrix/chpc_gen, /scratch/ibrix/icse_cap and /scratch/ibrix/icse_perf. Expect these filesystems to remain down until Friday March 16th.
  • Groups who have been contacted by CHPC about migration of home directories, will not have access to their home directories until the final rsync is completed. We will be in contact with each of these groups with expectaions.

  • Desktops will see intermittent network outages in the early morning which should be completed by 9:00 a.m. at the latest.
  • Virtual Machines supported by CHPC will experience intermittent outages from 7:45 - 9:00 a.m.

Instructions to User:

  • Intermittent network outages between 7:45 a.m. and 9 a.m.
  • The HPC clusters will be down most of the day to effect data center maintenance.
  • All /scratch/ibrix filesystems will be purged. Users have been given early warning and ample time to clean their files from /scratch/ibrix/chpc_gen, /scratch/ibrix/icse_cap and /scratch/ibrix/icse_perf. Expect these filesystems to remain down until Friday March 16th.
  • Groups who have been contacted by CHPC about migration of home directories, will not have access to their home directories until the final rsync is completed. We will be in contact with each of these groups with expectaions.

  • Desktops will see intermittent network outages in the early morning which should be completed by 9:00 a.m. at the latest.
  • Virtual Machines supported by CHPC will experience intermittent outages from 7:45 - 9:00 a.m.