2011 CHPC News Announcements

NEXT CHPC DOWNTIME - January 10, 2012

Posted: December 21, 2011

This is an advance notice that the next CHPC quarterly downtime will be on Tuesday January 10, 2012. This downtime is necessary for the routine maintenance on the cooling system in the Komas Datacenter which requires the clusters (Ember, Updraft, Telluride, Sanddunearch, Apexarch, meteo and atmos nodes) and associated scratch file systems to be powered down. At this time, we do not expect to take the fileservers down, meaning that once the networking updates are complete, most likely before noon, CHPC home directories will be accessible on desktops which mount this space. Additional details on the scope of the downtime will be posted as we get closer to the downtime.


Ember and Updraft service restored

Posted: December 19, 2011

Ember and Updraft are up and scheduling jobs now. Below is the summary of the problems and how your jobs may have been affected.
-The issue source was a failing InfiniBand (IB) line card that rendered the IB fabric down.
-Moab was disabled on ember and updraft to prevent new jobs from starting while leaving running jobs, that might not be using ibrix storage, running if they can. Those jobs are expected to have run to completion.
-IBRIX tried to, ungracefully, cope with the loss of the IB network (it's core network) and ran into issues that prevented us from mounting one of the 3 file systems. That's why some were not able to read data from ibrix.
-HP Support worked with CHPC to correct the issue preventing the mount of the 3rd file system as well as to review the health of the rest of the ibrix setup. We have all things cleared at this time.
-No ibrix file system corruption issues came up in this set of events.
If you notice any problems let us know via issues at chpc.utah.edu.


Ember and Updraft update

Posted: December 19, 2011

Ember and Updraft scheduling is still out due to problems with the IBRIX storage servers. CHPC staff are working with the manufacturer's engineering support to rectify the problem. As of now the time for restoration of services is still unknown. We will update once we have anything new. All the infrastructure that is independent of the IBRIX, such as Sanddunearch cluster, are working as they should.


Updraft and Ember Service Restored

Posted: December 13, 2011

We have resolved the issues we were experiencing with our Infiniband Fabric. Updraft and Ember nodes are now back and responding.

Please let us know if you notice any issues.


Login problems on updraft and ember.

Posted: December 13, 2011

We are currently experiencing some login/mounting problems on both the updraft and ember interactive nodes. CHPC systems staff are working on the problem. The interactive sanddunearch nodes seem to be fine at this time. Will advise when issues are resolved. --CHPC Staff


Komas clusters restored

Posted: December 11, 2011

CHPC staff has restored the services that have been affected by the unplanned power outage at Komas data center this morning. All clusters are up and scheduling jobs, several compute nodes that did not recover are offlined and will be looked at tomorrow. If you see any outstanding problems, please, report them to issues@chpc.utah.edu.


Power issues at Komas

Posted: December 11, 2011

Komas data center is experiencing power problems, which is rendering the clusters inaccessible. So far it appears it's only Komas, so, home directory file servers are unaffected. We will update once we know more about the problems.


All Clusters except APEXARCH back online

Posted: November 20, 2011

Power has been restored to Komas and after working through a number of issues the clusters have been brought back online. CHPC staff are currently testing each of the clusters and are releasing the reservations so that jobs in the queue can start running. The one exception is Apexarch, which will not be brought back online until sometime tomorrow.


Details about upcoming Komas Datacenter Power Outage - Saturday November 19, 2011

Posted: November 17, 2011

Duration: estimated 8 hours starting at 6am

As mentioned in the announcement this morning we have been notified that Rocky Mountain Power will cut power to the Komas DataCenter, where the CHPC compute clusters are housed, on Saturday November 19 at 8am. Rocky Mountain Power estimates their work will take about four hours. In order to prepare for this downtime, CHPC will power off the clusters housed in Komas (ember, updraft, telluride, sanddunearch, and apexarch as well as the scratch filesystems) starting at 6am that morning. Once power is restored these systems will be brought back online and users will be notified. Note that home directory fileservers are not housed in the Komas DataCenter, and as such they will not be affected by this outage. Reservations have been put in place to stop any new jobs from starting on the clusters if they will not finish before 6am on Saturday. Systems will be restored once power is restored and stable.


Upcoming Power Outage in Komas Datacenter - Saturday November 19, 2011

Posted: November 17, 2011

Duration: 8am to Noon

We have just been notified that Rocky Mountain Power will be taking down the power to the Komas Datacenter, where the clusters are housed, on Saturday November 19 starting at 8am, in order to do repair/maintenance work. The power outage is expected to last about four hours. Users will be notified of the time that CHPC staff will take the clusters down as soon as that decision has been made.


MPI upgrades on CHPC clusters

Posted: November 11, 2011

We have upgraded OpenMPI and MVAPICH2 to their latest versions on all three CHPC clusters. For details how to use them, please, see updated document: https://wiki.chpc.utah.edu/display/DOCS/MPI One thing that is different from previous times is that MVAPICH2 is now set up to use the Hydra process launcher, which means that there's no need for mpdboot followed by mpiexec. Instead, use mpirun with -machinefile, similar to what OpenMPI does. We have also extensively tested multi-threading within MVAPICH2 and OpenMPI. We have found problems with OpenMPI and MPI_THREAD_MULTIPLE mode so for threaded MPI applications which communicate from all threads, we recommend to use MVAPICH2. There are also a few caveats on setting up a multi-threaded run, see the above mentioned document for details. It probably would be a good precaution to recompile codes that use MVAPICH2, the OpenMPI upgrade was more minor so recompile probably would not be needed. If you have any problems, submit a report via issues at chpc.utah.edu.


Allocation reminder lettters: going green

Posted: November 10, 2011

Traditionally CHPC has sent hardcopy letters to faculty when their CHPC allocation renewals are due. Starting with Winter 2012, we will only be notifying PIs and the delegates via email.

Let us know if you have any questions or concerns.


Matlab upgraded to R2011b

Posted: November 10, 2011

We have upgraded Matlab to version R2011b. For list of changes see http://www.mathworks.com/products/new_products/latest_features.html. If you are using Distributed Computing Server, please, make sure to replace R2011a with R2011b in your parallel configuration. If you encounter any problems, let us know via issues at chpc.utah.edu.


Major CHPC downtime:Tuesday October 18th, 2011 beginning at 6:00 a.m.

Posted: October 5, 2011

Event date: October 18, 2011

Duration: 8-10 Hours

Systems Affected/Downtime Timelines: During this downtime maintenance will be performed in the datacenters, requiring many systems to be down most of the day. CHPC will take advantage of this down time to do a number of additional tasks, including work on the network and file servers. Tentative timeline:

  • HPC Cluster: 6 a.m. - 5 p.m.
  • CHPC File Services: 6 a.m. - 9:30 a.m.
  • Intermittent network outages: 6:30 a.m. - 8:30 a.m.
  • Restoration of all systems *except* HPC clusters - 10:30 a.m.

Instructions to User:

Expect intermittent outages of the CHPC supported networks until about 8:30 a.m.

All of the Desktops mounting the CHPCFS file systems will be affected until approximately 10:30 a.m. Those with Windows and Mac desktops should be able to function, but may not have access to the CHPCFS file systems.

CHPC recommends that you reboot your desktops after the downtime.

All HPC Clusters will be down most of the day.


Sanddunearch to be phased off allocation control

Posted: September 21, 2011

Event date: October 1, 2011

Beginning with next calendar quarter allocations (October 1, 2011), sanddunearch (SDA) will begin to be phased off allocation control. We did not accept or award any new allocations for this period. All prior awards will be honored.

By Summer calendar quarter 2012, SDA, will be completely off allocation control. What this means is that jobs will be run with a quality of service of "freecycle". The result will be that jobs will be scheduled in a FIFO like manner (first in first out) with some backfill (see the Note below for more info on backfill).

If you have an existing SDA award, when your allocation comes up for renewal, you will need to make your request on either the updraft or ember clusters.

Note: Backfill just means that the scheduler is smart enough to fill in idle nodes with small jobs that can fit, without delaying the start of the job who is holding the job for a reserved spot. So it won't be strictly FIFO. For example, if a job ahead of your job needs 100 nodes, and it is holding a reservation on nodes until there are 100 free, there will most likely be a set of nodes sitting idle until it can run the 100 node job. The scheduler is smart enough to look and see when the next jobs will finish, and predict that your job (only asking for 8 nodes) can fit on those reserved nodes (without delaying the 100 node job). So it will go ahead and run your job.


All Clusters back online and scheduling jobs

Posted: September 10, 2011

The CHPC systems team completed the process of bringing the clusters back online and scheduling jobs about an hour ago. As always, if you have any problems, please send a report to issues@chpc.utah.edu


POWER OUTAGES at Komas Datacenter

Posted: September 10, 2011

This afternoon starting at about 4pm there were a series of power bumps affecting the Komas Datacenter. According to Rocky Mountain Power, all is now stable; The air conditioning is back online and CHPC staff is starting to bring the equipment back up. A message will be sent when the clusters are back online.


www.chpc.utah.edu - Maintenance Complete

Posted: August 29, 2011

The maintenance on www.chpc.utah.edu is now complete. Please email us at issues@chpc.utah.edu if you notice any issues.


www.chpc.utah.edu Scheduled Outage - Monday, August 29, 2011 from 5:00 PM to 6:00 PM

Posted: August 29, 2011

Event date: August 29, 2011

We will be taking www.chpc.utah.edu down for about an hour this evening for scheduled maintenance. During the outage, any requests should be redirected to a reminder page. We'll send out an update when the maintenance is complete.

Our jira and wiki systems will be unaffected. Please send email to issues@chpc.utah.edu if you experience any unrelated issues during the maintenance window.


Proven Algorithmic Techniques for Many-core Processors

Posted: August 1, 2011

Event date: August 15, 2011

Duration: August 15 - 19, 2011, whole day, extact times TBD

CHPC is pleased to announce participation in the second course of this year's Summer School at the Virtual School of Computational Science and Engineering (www.vscse.org). This course will focus on heterogeneous programming on many core processors, mainly GPUs, and will be taught by prof Wen-Mei W. Hwu from UIUC and David Kirk from NVidia.

For details on this course, please, see:
http://www.vscse.org/summerschool/2011/manycore.html

To register, go to:
https://hub.vscse.org/

and choose University of Utah as your site. It asks for $100 fee to be collected later, we will not charge this fee. That means, this course is free and a great opportunity to hear the leaders in the field that is becoming the mainstay of high performance computing.

Before the course, we recommend to learn basics of CUDA. NCSA should provide an online course in late July, we'll post it once it becomes available. Alternatively, NVidia has a wealth of resources at http://developer.nvidia.com/cuda-education-training

Being free, this also means that we will only provide basic support and there will be no food/drinks during the breaks, so, please keep that in mind and prepare accordingly. Also plan on taking a laptop with and work on some hands on assignments.

We will announce the times and location in due time. Unless there is a very large interest, the venue will again be INSCC 284, and the time probably again 8am to 4pm.


Globus Online Presentation: A Hosted Service For Secure, Reliable, High-Performance Data Movement

Posted: July 7, 2011

Event date: July 20, 2011

Speaker: Steve Tuecke, Deputy Director, Computation Institute, Argonne National Laboratory and The University of Chicago
Date: July 20, 2011
Location: INSCC Auditorium (RM 110)
Time: 1-2 pm

Abstract:

Research often requires transferring large amounts of data among widely distributed resources including supercomputers, instruments, Web portals, local servers, HPC clusters and laptops/desktops. Traditional methods such as FTP and SCP are ill-suited to data movement on this scale due to poor performance and reliability, and custom solutions are costly to develop and operate. Globus Online is a new Software-as-a-Service (SaaS) solution that provides a robust, reliable, secure and highly monitored environment for file transfers, with powerful yet easy-to-use interfaces. Globus Online simplifies large-scale data movement by automating transfer management, re-trying failed jobs and making it easy to track status and results without requiring construction of custom, end-to-end systems. Globus Online is being used by hundreds of researchers at dozens of facilities like CHPC to easily and securely get their data where it needs to go. According to one user, "Globus Online is the most beneficial grid technology I have ever seen. We moved over 700 GB of simulation output from a supercomputer in Tennessee to one in Texas in just 90 minutes. The same transfer would have taken over 3 days with scp."

In this presentation, we will introduce Globus Online, walk through the process of signing up, and show audience members how to use both GUI and CLI interfaces to move data between two endpoints. We will also cover the Globus Connect feature, which allows users to transfer files between a GridFTP server (eg., the one in CHPC) and their local servers or laptops, even if behind a firewall, without the complexity of a full Globus install. At the end of the tutorial, attendees will be able to move large volumes of data in and out of CHPC trivially.

Note: If you cannot attend in person please feel free to join us remotely http://anl.adobeconnect.com/go-chpc/


CHPC DOWNTIME has been completed

Posted: June 28, 2011

The CHPC downtime is now over. The clusters are up and running jobs. If you have any problems, please send a report to issues@chpc.utah.edu


CHPC Major Downtime: Tuesday June 28th, 2011 beginning at 7:00 AM

Posted: June 14, 2011

Event date: June 28, 2011

Duration: 8-10 Hours

Systems Affected/Downtime Timelines: Brief network outages and HPC Clusters

Outage of CHPC supported networks 7:00-8:00 a.m. All HPC Cluster 8:00 a.m. until late afternoon possibly into the evening. Other than intermittent network connectivity between 7 and 8 a.m., we do not expect desktops to be affected for this downtime.


CHPC to Participate in World IPv6 Day June 8, 2011

Posted: May 31, 2011

Event date: June 7, 2011

June 8 is World IPv6 Day.

IPv6 is the newest version of the Internet Addressing Protocol. It is necessary because the previous version (IPv4) doesn't have enough addresses to represent all of the devices in the world.

We will be turning on IPv6 on our main web servers at 6 PM on June 7th and turning it off at 6 PM on June 8th.

You probably won't need to do anything to be able to keep using our web sites.

If you have any trouble, please email us at issues@chpc.utah.edu, call us at 801-585-3791, or come into our offices in INSCC Room 405.

We are participating in this test to help raise awareness of the need for IPv6 and to make sure our sites work correctly over IPv6.

CHPC does offer and will continue to offer IPv6 connectivity to its web server through the address www-v6.chpc.utah.edu. This address allows users to connect to our main web site via IPv6-only connectivity on a regular basis. The CHPC cluster icon blinks to show success of IPv6 connectivity.

More Details:

IPv6 is the exciting new evolution of the TCP/IP protocol stack that users utilize to communicate to their computers. IPv6 represents a huge leap forward in the number of addresses supported, as well as building in features such as IPSEC security. The traditional IPv4 contains only about 4.3x10^9 (4.3 billion) addresses, while IPv6 supports approximately 3.4x10^38 (340 undecillion) unique addresses! This increase is really important since IPv4 address space ran out in Feb 2011. Every new telephone and mobile device now requires an IP address to communicate so the move to IPv6 is imperative.

World IPv6 Day is being hosted by the Internet Society. The purpose of the event is to increase interest in the newest version of the Internet Addressing Protocol and to gauge the current status of IPv6-ready devices.

The Center for High Performance Computing (CHPC) will be participating. We currently offer IPv6 services on mirror.chpc.utah.edu (our mirror site), and have done so for a while without any issues. We will be offering IPv6 services on 3 additional web servers during the 24-hour period starting at 6:00 PM on June 7th and ending at 6:00 PM on June 8th (midnight to midnight June 8, UTC). The following 3 web servers will be part of the test: www.chpc.utah.edu (our main web site), jira.chpc.utah.edu (our problem-tracking system), and wiki.chpc.utah.edu (our wiki).

If you would like to test your IPv6 connectivity before June 8, here are a couple of pages to try. To test your IPv6 connectivity to our main web site, you can visit our IPv6 Diagnostics Page. For a more exhaustive test of your ability to use IPv6 in general, you can go to http://test-ipv6.com/.

For more information on the event (including a list of the participating organizations), please visit the World IPv6 Day Site.


Change in Batch policy for unitah and bigrun jobs on updraft - effective today (May 6, 2011)

Posted: May 6, 2011

Effective today “uintah” and “bigrun” jobs on updraft will have a max walltime of 24 hours. This is a change from the previous max walltime of 12 hours for “uintah” jobs and 72 hours for “bigrun” jobs. Note that this does not change the policy for “general” jobs which have in the past and will continue to have a max walltime of 24 hours. A full description of the batch policies can be found at http://www.chpc.utah.edu/docs/policies/batch.html


EMBER configuration changes: April 1st 2011 (no joke!)

Posted: March 24, 2011

Event date: April 1, 2011

Beginning April 1st there will be some significant configuration changes made to the ember cluster. Most significantly, the nodes purchased by Phil Smith’s research funds (nicknamed “smithp” nodes) will be run under a separate reservation in the system and will be limited to a max wall clock time of 24 hours (instead of the current 48 hour limit.) Phil is still very happy to allow users to run preemptable work on these nodes, but you will need to specifically target them, and will need to limit your wall time to the 24 hour maximum. The great news is that general users do not need to be out of allocation to use these nodes, and these hours will not go against your ember allocation.

General jobs with allocation will be run in the same configuration on the 53 general nodes as they are today except we will increase the max wall time limit to 72 hours. Out-of-allocation (freecycle) jobs will also be limited to the 53 general nodes, will also have a max wall time of 72 hours and will continue to be preemptable.

**AFTER APRIL 1st**

To submit jobs to Phil’s nodes you will need to request a different account. To do this, add the following line to your PBS scripts:

#PBS –A smithp-guest

NOTE: jobs will not be able to span between the general and smithp nodes.

Please let us know if you have questions about these or any CHPC policies by sending email to: issues@chpc.utah.edu.


CHPC Downtime - All Clusters Up and Scheduling Jobs

Posted: March 22, 2011

We have verified that all clusters are now up and scheduling jobs. If you see any issues, please email us at issues@chpc.utah.edu for assistance.


Maintenance complete on CHPCFS, the CHPC file server

Posted: March 22, 2011

Maintenance on the CHPC file server, CHPCFS, is complete (as of about 1 pm on 3/22/2011.) Desktops mounting CHPC space should again have access. If you have trouble with this, please try rebooting your desktop. If the problem persists, please contact us issues@chpc.utah.edu.

Work is continuing on other CHPC services. We will begin booting the HPC clusters as soon as the facilities maintenance in Komas is completed. We expect the cluster to be available late in the day.


Scratch space on the updraft cluster (/scratch/general and /scratch/uintah) down from about 2:30 p.m. until about 4 p.m. 3/16/2011

Posted: March 16, 2011

Duration: Less than 2 hours

The server managing these file systems kernel panicked around 2:30 p.m. and the server recovered the journal and came back up by about 4:00 p.m.


**CHPC Major Downtime CONFIRMED: Tuesday March 22, 2011 ALL DAY**

Posted: March 14, 2011

Event date: March 22, 2011

Duration: ALL DAY

Systems Affected/Downtime Timelines:

*** Note revised timelines ***

  • Outage of CHPC supported networks 6:30-7:00 a.m.
  • All HPC Cluster 7:00 a.m. until late afternoon possibly into the evening.
  • All other CHPC supported systems mounting CHPCFS file systems beginning at 8:00 a.m. until early to mid-afternoon.

Instructions to User:

  • Expect a network outage from 6:30 a.m. until 7:00 a.m. for CHPC networks
  • All HPC Clusters will go down about 7:00 a.m. and will be down most of the day and possibly into the evening.
  • Beginning around 8:00 a.m. until early to mid-afternoon all CHPC supported systems mounting the CHPCFS file systems will be affected. Plan for CHPCFS to be unavailable for a good part of the day. Those with Windows and Mac desktops should be able to function, but may not have access to the CHPCFS file systems. CHPC recommends that you reboot your desktops after the downtime.

During this downtime maintenance will be performed in the datacenters, requiring many systems to be down most of the day. CHPC will take advantage of this down time to do a number of additional tasks, including work on the network and file servers.

All file systems served from CHPCFS will be unavailable for most of the day. This includes HPC home directory space as well as departmental file systems supported by CHPC. We will work to get things online as soon as possible.


Downtime for Updraft and Telluride completed. 9:30 a.m. on March 2nd, 2011

Posted: March 2, 2011

The updraft and telluride clusters are now available for use after CHPC upgraded the operating system from RedHat version 4 to RedHat version 5. Please recompile your programs with the new MPI's. CHPC recommends using OpenMPI. Please see: http://www.chpc.utah.edu/docs/manuals/user_guides/updraft/#para for details.

CHPC staff are still working through all of the supported applications testing and recompiling. If you have problems, please let us know by sending email to issues@chpc.utah.edu.


Cambridge Crystallographic Database System has been updated to 2011 version

Posted: February 18, 2011

The Cambridge Crystallographic software/database system (CSD) has been updated to the 2011 version (Conquest version 1.13). The only impact on users of the CSD is that the first time you access this new version you will be asked to provide the new confirmation code: 7E92A7

Please contact CHPC if you have any difficulties accessing the database.

Anita Orendt


UPDRAFT and TELLURIDE Downtime: beginning February 28th, 2011 - Until approximate end of day March 2nd, 2011

Posted: February 11, 2011

Duration: Three days

Systems Affected/Downtime Timelines: The updraft and telluride clusters will go down beginning 8 a.m. on Monday, February 28th, 2011. We are planning these downtimes to last up to three days, until March 2nd, 2011.

CHPC will update the UPDRAFT and TELLURIDE cluster from RedHat version 4 to RedHat version 5. This will bring the operating system to the same version which is already running on the ember and sanddunearch clusters. We are planning for this to take approximately 3 days, but will return to service as soon as the updates are complete.


Change in Ember Policies: starting Tuesday, January 11th, 2011

Posted: January 10, 2011

  • Freecycle jobs will be limited to 12 hours
  • Freecycle jobs will be preemptable, (i.e. subject to termination)

Beginning January 11, 2011 there will be a change to the ember policies surrounding freecycle jobs (jobs not associated with an allocation - assigned to QOS "freecycle"). We will be fine tuning all of the policies over the next few weeks as we evaluate the load and resource sharing of the ember cluster. Free cycle jobs may be terminated at any time, beginning tomorrow.


New CHPC Cluster (ember.chpc.utah.edu) Available for Use January 1, 2011

Posted: January 4, 2011

As of January 1st, 2011 CHPC's newest cluster, ember.chpc.utah.edu is available for general use. As with all of the CHPC production clusters, your jobs will run at higher priority if you have been awarded an allocation on the system. The allocations for next calendar quarter (4/1 - 6/30) are due March 1st, 2011.