CHPC News Announcements
UPDATE on Major Campus Power Outage TONIGHT
Posted: May 15, 2013
Duration: Wednesday, May 15, 2013 at 11:59 pm to Thursday, May 16, 2013 at 6:30 am
As announced last week, there is a Campus planned power outage that will occur overnight that will affect the INSCC and SSB buildings.
JOB SCHEDULERS WILL BE PAUSED: Initially we announced that the clusters would not be impacted, but when reconsidering the potential impact of the equipment affected by the outage the decision was made to pause the schedulers on ALL clusters right before the outage starts. This means that no new jobs will be started during the outage, but that jobs already running will continue. The schedulers will be resumed once we receive notification that the power has been restored.
REMINDER: CHPC recommends that ALL tenants of INSCC and any other buildings impacted by this outage shut-down their desktop prior to leaving for the day on Wednesday May 15th.
Summer school courses at the CHPC
Posted: May 14, 2013
Similarly to previous few years, CHPC will be hosting two courses from the Virtual School of Computational Science and Engineering, in early and late July. We encourage everyone interested in the topics covered to register and attend these courses. They provide a unique opportunity to attend classes taught by nationwide leaders in the field.
For more details, see the Summer School local webpage at http://www.chpc.utah.edu/docs/news/news_items/vscse-2013.php
Feel free to forward this information to anyone who may be interested. The Summer School is open to everyone, not just University of Utah affiliates.
Posted: May 14, 2013
We have upgraded Matlab on our Linux clusters and administered desktops to version R2013a. The major change you will notice from the previously default version R2012a is a change in the GUI interface. There are additional new features that are listed at http://www.mathworks.com/help/relnotes/new-features.html.
If you encounter any problems, please, let us know at email@example.com.
Major Campus Power Interruption, Little Impact on CHPC Services (Wed 5/15 11:59 pm until Thur 5/16 6:30 am)
Posted: May 10, 2013
Wednesday, May 15, 2013 at 11:59 pm to Thursday, May 16, 2013 at 6:30 am
The Campus has planned a power outage of several buildings and has announced this outage is necessary to prevent further damage to equipment or a safety hazard to building occupants. Most CHPC services will not be impacted, however power will be out in the INSCC and SSB buildings. This power outage will affect CHPC services as detailed below:
INSCC Building data center (and building):
- No air conditioning in the INSCC data center room and the UPS will not sustain load for the entire outage. Any equipment housed in INSCC data center must be shut down before 11:59 p.m. on Wednesday May 15th.
- ALL tenants of the INSCC building should shut down their desktop prior to leaving for the day on Wednesday May 15th.
SSB Building data center:
- The data center part of SSB is expected to ride through this outage on UPS/generator.
Other buildings affected:
- CHPC recommends that ALL tenants of impacted buildings shut-down their desktop prior to leaving for the day on Wednesday May 15th.
CHPC Summer Downtimes and Data Center Move Schedule
Posted: May 9, 2013
Many of you have heard about the modern, off-campus data center that the University has developed in downtown Salt Lake City. Over the past year, CHPC has been planning its move to the new facility, which will bring our community many benefits, including more stable electric power and significantly more expansion capacity for rack space and power. Nevertheless, the move will require some significant disruptions to CHPC services at times over the summer. We ask for your patience and flexibility as we go through this process. By remaining flexible, we believe we can minimize the duration of the downtimes. We will provide frequent updates through email and also our new Twitter feed (@CHPCUpdates).
Here is the anticipated general timeline of the significant steps and milestones in the DDC move process:
- Configure and test new switch in DDC
- Receive new "Kingspeak" cluster (see below for a description) hardware and begin provisioning (with the upgraded Red Hat Enterprise Linux 6 operating system – RH6) in DDC
- Receive and install new CI-WATER storage in DDC
- Receive and install new Sloan Sky Survey storage in DDC
- Prepare for June equipment moves
- May 31: Allocation proposals are due for Ember and Updraft (Updraft only will be allocated through 12/31/2013)
- Continue receiving and provisioning Kingspeak; begin staff testing, software builds on RH6 (including new batch system software), and early user access
- June 4th: Regular CHPC Major downtime: Ember, Updraft, and Sanddunearch down for Komas machine room maintenance as usual
- Move Atmospheric Sciences cluster (atmos, meteo and wx, and nodes, except gl nodes) - expect an extended downtime for these servers of approximately 2 days beginning June 4th
- Move kachina.chpc.utah.edu and swasey.chpc.utah.edu - Expect extended downtime of 2 days
- Move phase I of VM Farm - No downtime expected
- Move of Apexarch cluster and homerfs – Expect extended downtime of 2 days
- UCS Nodes and attached storage – Expect extended downtimes of 2 days
- Batch system up - Kingspeak cluster will run in freecycle mode through October 1
- All users will be given access to the Kingspeak cluster in freecycle mode.
- Move Ember cluster - current downtime estimate is 3 +/- 1 weeks. This window will be more tightly specified based on move experience over the summer and more detailed work scheduling as this window approaches.
- August 31, 2013: Allocation requests are due for Kingspeak and Ember. No further allocations will be awarded on Updraft. September 2013:
- September: Ember will be brought up under RH6 and under the new batch system and will run in freecycle through October 1.
Please note that we will not be moving the Sanddunearch and Updraft clusters to the DDC, but instead will run them in place until December 31, 2013 or thereabouts. These nearly end-of-life clusters will be retired as the remodeling of the former Komas data center is scheduled to begin at that time. Also slated for retirement are /scratch/serial, /scratch/uintah, and /scratch/general file systems. These /scratch systems will not be mounted on Kingspeak or on Ember after it has been moved to the DDC.
Please relay any concerns about this planned work, particularly in regard to deadlines for conferences and grant proposals and other impacts.
Kingspeak cluster details (general nodes):
- 32 nodes (16 cores each) - 512 cores total
- 2 interactive nodes
- 2.6Ghz speed with AVX support: 10.6 TFlops max, (without AVX: 5.3 TFlops)
- Note that not all codes will be able to take advantage of the AVX support as this feature is dependent upon how well the codes vectorize.
- Also note that the general nodes on Ember run at a max speed of 9 TFlops
- Infiniband interconnect
- New /scratch space of approximately 150 TBytes
CHPC now on Twitter!
Posted: May 8, 2013
In anticipation of the impact of the move to the downtown datacenter (more on this will follow in the coming days) CHPC is adding Twitter as a mechanism to disseminate information to our users. We have established two feeds: @CHPCOutages for information on both planned and unexpected outages and @CHPCUpdates for all News items.
No Twitter account is needed and all information distributed in this manner will be redundant to information available on the CHPC website. You may bookmark the above webpages, follow the feeds if you are a Twitter user, or both. There will also be a link to these feeds on the CHPC main webpage.
The @CHPCOutages feed will be used to announce downtimes, both planned and emergency outages and hardware failures, and to provide updates. We will strive to post updates of the progress of planned downtimes to better provide users with the current status. For emergency outages and hardware failures we will strive to send out updates much more frequently – even if it is to let users know that there is no change in status. This feed will also be used to update users on the status of the move to the new Downtown Datacenter – a process that will occur over the next several months and that will require several disruptions in service. An announcement with our tentative timetable will be sent to users in the next few days.
The @CHPCUpdates feed will be used to distribute News and other Information about CHPC. This will include items such as CHPC presentations, short courses, new resources, and publications resulting from use of CHPC resources.
NOTE: Please do not use Twitter as a mechanism to report problems or ask questions. While we will monitor our feeds closely, we will use our jira system to track questions, and other issues needing our attention. We ask you please continue to send such concerns as usual to firstname.lastname@example.org, or post directly on jira.chpc.utah.edu.
/scratch/ibrix back on-line and available for use
Posted: May 7, 2013
The maintenance on the /scratch/ibrix system is now complete. Users are now welcome to once again make use of both the /scratch/ibrix/chpc_gen and the /scratch/ibrix/icse file systems.
If you have issues with a batch job accessing this space, please send us an issue report which included the job number. If you have issues accessing this file system from the interactive nodes or another location, please send us an issue report giving both the machine name and the time the problem with access occurred.
UPDATE on /scratch/ibrix
Posted: May 6, 2013
The work on the /scratch/ibrix file system is progressing.
While users may notice that the /scratch/ibrix/chpc_gen is mounted, the file system is NOT ready for use. Please do not access this space until you receive a notification that it is ready for use - this most likely will not occur today.
Work is also continuing on the /scratch/ibrix/icse file system.
/ibrix/scratch DOWN for SERVICE
Posted: May 6, 2013
The /scratch/ibrix/chpc_gen and the /scratch/ibrix/icse file systems have been taken off line for service, as was mentioned in the message sent on Thursday May 2, 2013. A HP engineer is on site to work to resolve issues that have been found on this file system.
At this time we have no estimate for the duration of this outage. During this outage the batch scheduler will remain active. Please do not submit jobs which use these scratch systems as they may hang the assigned nodes, making them unavailable for other jobs.
If you have any questions, please send them to email@example.com. We will send more information when it becomes available.
UPDATE on EMERGENCY OUTAGE of /scratch/ibrix
Posted: May 2, 2013
Here is an update:
/scratch/ibrix/chpc_gen has being brought back online - but will be taken down again next week, most likely on Monday. /scratch/ibrix/icse will remain down. The batch queues have been restarted and they will remain up even after /scratch/ibrix is taken back down next week. Please note that if you had a job running when the file system was taken down this morning it may have hung and might die when the file system is mounted. Also, if you find that there is a machine where the mount of this file system is missing, please send us an issue report.
On Monday a HP service engineer is expected to arrive to continue work to resolve the ongoing issues with the /scratch/ibrix file system. Unfortunately, we cannot make an estimate of how long this will take, however we expect it might take multiple days. At this point, HP engineering and CHPC staff have no indication that the data on /scratch/ibrix/icse is at risk.
When the engineer arrives /scratch/ibrix/chpc_gen will be taken down again. It will most likely be down for multiple days. We suggest that users SELECTIVELY move data that they need for the next few days from /scratch/ibrix/chpc_gen to other locations such as group file systems, /scratch/serial, /scratch/general, or home directories (listed in order of preference). Please keep in mind that these alternate locations are much smaller than /scratch/ibrix/chpc_gen space so we cannot have users moving all of their the data off the /scratch/ibrix/chpc_gen space.
The batch queues will not be paused when work is resumed on the /scratch/ibrix file system. Users can have jobs which use /scratch/ibrix/chpc_gen over the weekend but any use of this file system needs to be complete by Monday 8am as we will not be able to give advance notice of when the file system will be taken down. Any jobs that will run after Monday morning should not use this space as that will cause the job to die/hang when the system is taken offline.
EMERGENCY OUTAGE of /scratch/ibrx file system - ADDITION
Posted: May 2, 2013
The batch schedulers on the clusters have been paused to keep new jobs from starting at this time.
Redbutte File Server Outage: All clear as of 12:48 p.m. (3/18/2013)
Posted: March 18, 2013
The reboot is completed and all are now in a healthy state.
As always, please let us know of any issues or problems by sending email to firstname.lastname@example.org
Redbutte File Server: brief Outage beginning now (approx 12:30 p.m. on 3/18/13)
Posted: March 18, 2013
The redbutte file systems, after applying some routine operating system updates, were in a strange state which required us to reboot. We are in the process and don't expect it to take very long. The outage should be completed within the hour (by 1:30 p.m.)
We will alert you when everything is back in a happy state.
Clusters back online after the downtime
Posted: March 12, 2013
The maintenance on the cooling systems in the Komas data center has been completed. The affected clusters (ember, updraft and sanddunearch) are back online and the batch queues restarted.
Current information on Hardware Failure of Feb 25th and the Current Status of the Restoration
Posted: March 8, 2013
On Monday, February 25, CHPC experienced a major file system failure, which impacted the home directories for about 275 of our users. The initial report listed the groups involved. Below we provide additional information about this event and the current status of the file restoration.
CHPC still is working with the hardware vendor (HP) to determine the cause of the failure. So far we know that it was not a single failure, but a combination of failures in the controller and the disk. The analysis of the failure is ongoing.
Please note that the damaged equipment is not in service – the restorations are being performed to replacement hardware. Also all restored files are coming from the backup tapes, not from the damaged hardware; the integrity of the restored files has not been in question.
Here is an overview of the restoration process to date:
The majority of the users had their home directories back online by Saturday, March 2nd . These file systems were restored from the last full backup tape, which was started on Friday, February 22 and was still running when the disk failure occurred. In reviewing the logs of the restoration we noticed POTENTIAL missing files from some of these home directories- these users were notified of this fact on March 2nd (more below).
There was a subset of 40 users, whose home directories did not get restored at all on the initial attempt. Most of these have been traced to not being backed up on the last full restore (the weekend the failure occurred). All of these users were contacted over the weekend via e-mail. These users had a “new” home directory created with the standard CHPC dot files, and the restores were done to a different location and rsynced over when done. These restores were done up through the last incremental backup started on February 21. This process was completed Mar 7th and each user was notified when their home directory was completed.
As mentioned above, during this restoration process, some files were flagged with a “did not restore” message in the logs, and we now have a list of these files. Most of the users impacted by the failure are not on this list. Some that are on this list were notified over the weekend that the file restore may not be complete; others noticed that files were missing and have let us know. We are now focusing on restoring these files on an individual case basis. Anyone who discovers that they have missing file(s) should let CHPC staff know by opening an issue report.
Finally, for all involved, we are pulling and storing all tapes with the data lost by the hardware failure so that we will have them in case any user needs us to look for a missing file in the future. Backups will be continued starting this weekend on alternate tapes.
Komas Datacenter Downtime: Tuesday March 12, 2013 for critical service to the cooling tower
Posted: March 8, 2013
Event date: March 12, 2013
Duration: Clusters in Komas Datacenter will be down beginning at 7:00 a.m. until about 5:00 p.m.
Systems Affected/Downtime Timelines:
All clusters in the Komas Datacenter including Ember, Updraft and Sanddunearch will be down from 7 a.m. until about 5 p.m.. Scratch space will remain up unless the temperature gets too high during this maintenance, at which time we will also need to down these servers as well.
The folks that service our cooling system (CMMS) in the Komas datacenter have notified us that the cooling system is in critical need for service, and they are very concerned about the current situation. It is believed if we see temperatures above 60 to 65 degrees we may have an outage on the cooling system.
We have set the reservations to drain the queues on the clusters, and we expect to make it to Tuesday morning and a graceful shutdown. However there is a chance that if the cooling fails in the meantime, we'll have an emergency downtime before the scheduled time.
CMMS will take the coolers offline at 8 a.m. Tuesday morning, so we need to shut the clusters down at 7 a.m. They expect to be finished by 2:00 p.m. so we can begin bringing the clusters backup as soon as that is complete. It usually takes 2-3 hours for the clusters to be up and made available to users.
Please let us know if you have any questions by sending email to email@example.com.
**CHPC unexpected outage**: Home directories for a subset of users, 2/25/2013
Posted: February 25, 2013
Duration: Estimate - sometime 2/28/2013, individual file systems may be available more quickly
Dear HPC Users,
Update: As of 4 p.m. 2/27 we are finding the estimates for the restore very difficult to guess. We have come up with a way to make filesystems available as they complete and we will notify individual groups when their space is ready. At this point we guess it will continue to run at least through the night, and part way into the day tomorrow 2/28.
Update: The restore continues to run. The best guess for having file services restored for those affected is sometime tomorrow morning - 2/27/2013. Thank you again for your patience.
The groups home directories listed below are currently down, and will remain down for an extended period of time. We have had a disk failure and the built in redundancy measures we had in place also failed to work as expected. We are currently restoring from backup, but will not be able to bring anything back online until the restore has completed, which we expect to take a full day or more. We sincerely apologize for the inconvenience and will keep you posted as we progress and as time estimates improve.
CHPC Downtime has ended
Posted: January 15, 2013
The CHPC Downtime has ended. All of the clusters are back in service and running jobs. All of the home directories scheduled to be moved to the new filesystem have been moved. IF you have any difficulty in reaching your CHPC home directory or group space from your desktop we request that you see if a reboot solves the problem before sending in an issue report. As always, please let us know if you have any issues accessing or using CHPC resources.
CHPC Major Downtime: Tuesday January 15th, 2013 beginning at 7:00 AM - Unknown
Posted: January 8, 2013
Event date: January 15, 2013
Duration: From 7 a.m. January 15th: Clusters down most of the day. Other services, see below.
Systems Affected/Downtime Timelines: During this downtime, maintenance will be performed in the datacenters, requiring many systems to be down most of the day. Tentative timeline:
- HPC Clusters: beginning at 7:00 a.m. lasting most of the day
- File Servers: CHPCFS will remain up. While redbutte will stay up, there are a number of groups who will have outages to migrate home directories from oquirrh to redbutte. These particular spaces are:
- Network outages: No outage
- Virtual Machines: No outage
- Software License Server: 15 minute outage sometime between 8 and 10 a.m.
Instructions to User:
- Users involved the the home directory outages listed above should logoff of any Linux systems mounting their home directories. Groups will be notified as their home directory moves are completed, and will be told the new path for any samba or cifs mounts. At this point it is recommended you reboot your desktop prior to contacting CHPC with issues. After you have rebooted, please let us know of any issues by sending email:firstname.lastname@example.org.
- Please remember that /scratch is scrubbed of all files older than 60 days on a regular basis.
Emergency Cluster DOWNTIME: starting 1/8/2013 approximately 12:30 p.m.
Posted: January 8, 2013
The CHPC clusters went down approximately 12:30 p.m. (Tuesday, 1/8/2013) to protect the equipment due to a cooling problem in the Komas machine room. These clusters include ember, updraft and sanddunearch.
By about 2:30 p.m. the cooling issues were mitigated. Ice had built up and was diverting the water out of the cooling tower.
All clusters were back online and running jobs by 5:15 p.m. Please let us know if you see any issues by sending email to email@example.com.