2013 CHPC News Announcements

Outage to Migrate File Systems to New Hardware -- December 4, starting at 8 am

Posted: November 25, 2013

Duration: All day, starting at 8am

On December 4, 2013, starting at 8am we will be doing the final migration of the file systems listed below from off the CHPCFS hardware that is being retired. In addition, we will be moving the ASTROFS file system to new hardware. While this move will not impact all of CHPC users, it is CRITICAL that each user determine if they will be impacted and prepare accordingly.

To check which file system your home directory is in, while in your home directory, do a “finger UNID”

  • if your directory starts with “/uufs/astro.utah.edu” you are in ASTRO_HOME and will be affected
  • If your directory has “sdss_home” in its path you will be affected
  • If your directory starts with “/uufs/chpc.utah.edu” then do a “df | grep UNID” and if you see one of the file systems listed below you will be affected

If you will be impacted, please be sure to prepare for this outage before 8am on December 4 by:

  • Logging off of the interactive nodes AND any desktops that mounts any of these file systems
  • Do not have jobs running/queued as any job that starts or stops during this time will be affected

We expect to have all file systems listed below with the exception of CHPC_HPC available in their new location by around noon. CHPC_HPC will be unavailable most if not all of the day. An “all clear” message will be sent to users when the file systems become available.

Home directories on the following file systems will be migrated:

  • CHPC_HPC
  • CHPC_INSCC
  • BMI_HOME
  • ASTRO_HOME
  • GEO_TOMOFS
  • SDSS_HOME

The following group spaces will be migrated:

  • ASTRO_DATA1
  • ASTRO_DATA2
  • PHYS_Bolton_Data1
  • PHYS_Bolton_Data2
  • PHYS_Bolton_Data3
  • PHY_Springer
  • ASTROFS

If you have any questions, please contact us at issues.


Reminder - MPI workshop registration ends on Friday

Posted: November 20, 2013

The registration for the MPI workshop that will take place December 4-5 ends this Friday (11/22), due to time needed to create the attendees accounts and the upcoming Thanksgiving holiday.

If you are planning to attend this workshop and haven't registered yet, please, do so in the next few days. We are willing to accommodate those who did not get time to register on a walk in basis but in this case you will not get an account on the machine where the hands on assignments will be done.

To register, choose the University of Utah as a site on the XSEDE Portal Registration pages:
https://portal.xsede.org/course-calendar

Please visit the workshop page for more information:

https://www.psc.edu/index.php/training/xsede-hpc-workshop-december-2013


XSEDE HPC Workshop at the CHPC - December 4,5 2013 - MPI

Posted: November 11, 2013

CHPC will be a satellite site for next month's XSEDE HPC two-day workshop covering MPI. The workshop is run by Pittsburgh Supercomputing Center and the Texas Advanced Computing Center and CHPC will provide interactive telecast.

This workshop is intended to give C and Fortran programmers a hands-on introduction to MPI programming. Both days are compact, to accommodate multiple time zones, but packed with useful information and lab exercises. Attendees will leave with a working knowledge of how to write scalable codes using MPI ? the standard programming tool of scalable parallel computing.

This workshop is NOT available via a webcast.

Please choose the University of Utah as a site on the XSEDE Portal Registration pages:
https://portal.xsede.org/course-calendar

The tentative agenda given below is subject to change.

Tuesday, December 3
All times given are MST
9:00 Welcome
9:15 Computing Environment
10:00 Intro to Parallel Computing
11:00 Lunch break
12:00 Introduction to MPI
1:30 Introductory Exercises
2:30 Scalable Programming: Laplace code
3:00 Adjourn/Laplace Exercises

Wednesday December 4
All times given are Eastern
9:00 Laplace Exercises
10:00 Laplace Solution
10:30 Lunch break
11:30 AdvancedMPI
12:30 Outro to Parallel Computing
1:30 MPI Debugging and Profiling
2:30 Adjourn

Please visit the workshop page for more information:

https://www.psc.edu/index.php/training/xsede-hpc-workshop-december-2013


Ember and Kingspeak core service interruption

Posted: November 5, 2013

Around 3:30 p.m. this afternoon a core service machine that supports both kingspeak and ember experienced an outage. CHPC staff have recovered the service and all should be returned to normal over the next few hours. All nodes are being swept currently to make sure there are no problems. We will send an ALL CLEAR once we have verified all nodes.

ALL CLEAR: 4:26 p.m. Please report any questions or problems to issues@chpc.utah.edu


Retirement of turretarch1 (linux statistics box) - Dec 1, 2013

Posted: November 4, 2013

Effective Dec 1, 2013 CHPC will no longer be maintaining turretarch1.chpc.utah.edu, the linux statistics machine. There are a number of reasons for this decision.

  • This box has had very limited use since the start of the year
  • CHPC has in place a windows statistics box, Kachina. Information on Kachina, available software, and how to access can be found in the Kachina User Guide
  • Linux version of some of the packages available on Kachina are accessible on the CHPC clusters

If you need to discuss the impact of this decision on your research, please send in a request to issues@chpc.utah.edu.


OpenACC programming workshop

Posted: October 18, 2013

CHPC will be a satellite site for an XSEDE workshop presented by the Pittsburgh Supercomputer Center that focuses on OpenACC. OpenACC is the accepted standard using compiler directives to allow quick development of GPU capable codes using standard languages and compilers. It has been used with great success to accelerate real applications within very short development periods. This workshop assumes knowledge of either C or Fortran programming. It will have a hands-on component.

The workshop will take place on November 5, 2013 in INSCC Auditorium (INSCC 110) from 9am to 3pm. To register, follow https://portal.xsede.org/course-calendar/-/training-user/class/152/session/269.

There is no registration fee for this workshop.

Please address any questions to the CHPC help desk at issues@chpc.utah.edu.

XSEDE HPC Monthly Workshop – OpenACC

Tuesday November 5
All times are given are MDT

9:00 Welcome
9:15 Computing Environment
9:45 Parallel Computing & Accelerators
10:15 Intro to OpenACC
11:00 Lunch break
12:00 Introduction to OpenACC (cont.)
1:00 Using OpenACC with CUDA Libraries
2:30 Advanced OpenACC and OpenMP 4.0
3:00 Adjourn


Ember now available

Posted: October 17, 2013

Access to Ember is now available. Please take note of the information listed below before you start using the cluster. We consider this first few days of access as a testing period, as it is impossible for CHPC to thoroughly test all possible running conditions. We will be monitoring the nodes and will offline any nodes if we see an issue. If you run into any problems please send a report to issues.

Below is some very important information about changes

  • The cluster is now running RHEL6 and has new versions of the batch scheduler and usage accounting software; it is now the same as what is running on Kingspeak.
  • The batch policies should be identical to what they were before ember was relocated.
  • The node numbering has changed -- very important for those who ssh to the nodes. A new listing of the node numbers is available on this CHPC wiki page
  • SSH keys have changed. You will get messages and not be able ssh to the nodes until you have either deleted your .ssh/known_hosts or edited this file to remove all entries for the ember interactive and compute nodes by searching and removing keys for ember, ember*, em*, and 172.17.4*.*
  • You need to check if there is a new RH6 build of any other package/library (e.g., OpenFoam, python,netcdf) before running – these are indicated by a _rhel6 or _rh6 after the version number. While CHPC has tested many of the builds, we may have not caught all the ones that need rebuilt for the new OS --- please let us know of any we may have missed.
  • You need to check your codes to see if they need rebuilt. We strongly recommend that you no longer use any packages/libraries in /uufs/arches/sys location; we will not be maintaining this file system moving forward. It is important that you use the RH6 version of the compilers (listed below):
    • PGI: /uufs/chpc.utah.edu/sys/pkg/pgi/std_rh6
    • GNU: /usr/bin/gcc (4.4.7) OR /uufs/chpc.utah.edu/sys/pkg/gcc/4.7.2_rh6 (also gfortran)
    • INTEL: there is no change for the path to the intel compilers
  • In the/uufs/ember.arches/sys/pkg application tree, the std, std_intel, std_pgi links for MVAPICH2 and OPENMPI point to the RHEL6 versions
  • New ember specific builds of FFTW, BOOST, AMBER12, GROMACS can be found /uufs/ember.arches/sys/pkg; the old ones have permissions set so they are not accessible/executable. New ember builds of QE and LAMMPS are being worked on.
  • For GPU node use – we have updated CUDA to 5.5 (/usr/local/cuda)

New Allocation Usage Tracking pages Available

Posted: October 9, 2013

New General Allocation Pool Usage pages are now online

With the change to a general allocation pool instead of individual allocations on ember and kingspeak, new allocation usage tracking pages have been created.

Here are the links:

As a reminder, here are the corresponding links for the usage on updraft and sanddunearch:

Finaaly for usage of prior quarters see Allocation Usage.

If you have any questions about these pages, please email issues@chpc.utah.edu


DOWNTIME -- Oct 14 starting at 6:30am

Posted: October 3, 2013

Duration: All day, starting at 6:30am

Overview: During this downtime the quarterly cooling maintenance on the Komas Datacenter will be performed, the Redbutte File Server will be relocated from the SSB Datacenter to the Downtown Datacenter, the /uufs/chpc.utah.edu/sys application file system will be moved to new hardware, and system updates will be applied to the HOMERFS file system.

Detailed Impact to Users:

  • Updraft and Sanddunearch clusters will have their batch queues drained by 6:30am in preparation for the clusters to be shut down for the cooling maintenance in the Komas Datacenter. They will be brought back online once the maintenance is complete.
  • Telluride, Apexarch and Turretarch clusters will have their batch queues drained by 6:30am in order to move the application file system to new hardware. These clusters will be brought back once this move has been completed.
  • The protected environment file system, HOMERFS, will have firmware updates applied. This means that Apexarch and SWASEY will be unavailable until this is completed and returned to service.
  • The Redbutte file server will be moved from SSB to the Downtown Datacenter. Our goal is to have this up before the end of the day (but it will most likely be late). While we do not expect any problems with this move CHPC recommends that any critical data in the GROUP spaces, which are either not backed up or only backed up with quarterly archives, should be moved or copied elsewhere as a precaution. Many groups have space on other the salt flat or dry creek file servers; another alternative for temporary storage would be one of the scratch file systems. The following groups are impacted:
    • HOME directory: Baron, Cheatham, Cliu, Garrett, Gregg, Horel, Jenkins, Jiang, Krueger, Lin, Mace, Paegle, Perry, Reichler, Smithp, Steele, Steenburgh, Strong, Whiteman, Yandell, Zhdanov, Zipser, and Zpu.
    • GROUP spaces: cliu-group1, garrett-group1, horel-group, krueger-group1, mace-group1, steenburgh-group1, lin-group1, reichler-group1, strong-group1, whiteman-group1, zpu-group1, cheatham-group1, cheatham-group2, avey-group1, baron-group1, gregg-group1, sandick-group1, steele-group1, stoll-group1, voelk-group1, and yandell-group1
  • Note that kingspeak (and ember if it is back in service by this time) will not have their queues drained or be shut down. HOWEVER, if you are in one of the groups with a HOME directory listed in the previous impact item, any running jobs and any interactive sessions will hang when the file server is turned off for the move. We recommend that users in these groups log out of any interactive sessions, and plan on having their running jobs exit before 6:30am on Oct 14. Any idle jobs in the batch queue should have batchholds placed on them.

Reminder -- Allocation Changes starting Oct 1

Posted: September 30, 2013

There are two major changes regarding allocation usage with the start of the new quarter tomorrow:

First, Kingspeak will no longer be run in frecycle mode. Freecycle jobs can still run, but they will be preembtable.

Second, there has been a change in allocation policy. Instead of having individual allocations for Kingspeak and for Ember, there will be one combined pool of allocation that will be used on these two clusters. Ember jobs will be charged the number of core hours (1 SU = 1 CPU-core-wall-clock-hour), whereas Kingspeak jobs will be charged 1.5 times the number of core hours (1.5 SU = 1 CPU-core-wall-clock-hour), to account for the faster speed of the nodes of this cluster. Groups with existing Ember allocations will have the amount adjusted to conform with the new metric of 1 SU = 1 CPU-core-wall-clock-hour and these awards can be used on either Ember or Kingspeak.


/scratch/ibrix/chpc_gen and /scratch/ibrix/icse now back in service

Posted: September 30, 2013

The /scratch/ibrix/chpc_gen and /scratch/ibrix/icse are now back in service after being moved to the Downtown Data Center. These file sytems are now mounted on sanddunearch and updraft via ethernet and on kingspeak via Infiniband.

If you encounter any problems using these file systems, please report to issues@chpc.utah.edu.


Informal autoconf discussion/tutorial

Posted: September 27, 2013

We will have an informal discussion about usage and caveats encountered when configuring and building packages that use the autoconf tool on Linux or a Mac. This discussion should be helpful for those who are trying to build software in their own user space. Most of the commonly used software uses autoconf (configure), such as NetCDF, HDF5, MPIs, to name a few. The plan is to go over the configure basics and then configure and build a few packages. If you have a particular package you'd like to build, let us know.

We will meet on Wednesday October 2nd at 2pm in the CHPC Training Lab, INSCC 407. If you have any questions, please, contact issues at chpc.utah.edu.


XSEDE Webinar: Writing a Successful XSEDE Allocation Proposal

Posted: September 23, 2013

For those of you considering applying for an XSEDE allocation, I recommend the webinar referenced below. XRAC proposals are being accepted until October 15th for the next selection period. For more information on XSEDE, please see: XSEDE Support Available through CHPC

Writing a Successful XSEDE Allocation Proposal 10/10

October 10, 2013 (Thursday)
1 p.m. to 3 p.m. (ET) OR

This short webinar will introduce users to the process of writing an XSEDE allocation proposal, and cover the elements that make a proposal successful. This webinar is recommended for users making the jump from a startup allocation to a research allocation, and is highly recommended for new campus champions.

Registration: https://www.xsede.org/web/xup/course-calendar


REMINDER - Ember cluster and /scratch/ibrix outage starting at Sept. 24 at 8am

Posted: September 18, 2013

Ember, /scratch/ibrix/chpc_gen and /scratch/ibrix/icse will be taken down on Sept 24 starting at 8:00am to move this hardware from the Komas Data Center to the Downtown Data Center. Along with the move, the OS will be updated to RHEL 6 and the new versions of the batch scheduler and resource manager, Moab and Torque, will be installed. There is a reservation in place to empty Ember queue by this time. The /scratch/ibrix file servers data will not be erased, however, please, keep in mind that the servers will be physically moved which can cause drive failures. Therefore it would be prudent to make sure to have any important data copied from these two file servers before the outage. For owner Ember nodes, both compute and interactive: Note that any data on the local hard drives (/scratch/local) will be lost in this process. If you have any data in this location, you need to move it elsewhere before 8:00am Sept 24. If you need any assistance, please submit a request to issues@chpc.utah.edu.


Short kingspeak outage on Monday, 9/23

Posted: September 17, 2013

We will be migrating the applications file server on Kingspeak cluster to a new hardware on Monday, 9/23 starting at 11am. The migration is expected to take several hours. A reservation has been set on Kingspeak not to allow any running jobs in this time frame. Please, ensure that your current jobs walltimes are set to finish before 9/23 at 11am. Once the migration is over, any queued up jobs will be allowed to run again.


Ember move postponed by a week

Posted: September 10, 2013

Event date: September 24, 2013

The Ember cluster and its Ibrix scratches move to the Downtown Data Center and OS upgrade have been postponed to start on September 24th at 8:00am at the request of a major user group.

The reservation to empty Ember queue will be updated to reflect this date change.

Users that use/mount these /scratch/ibrix file spaces from other locations (other clusters or desktops) please make sure you have any data you need moved off of these two file systems before this time.

For owner Ember nodes, both compute and interactive: Note that any data on the local hard drives (/scratch/local) will be lost in this process. If you have any data in this location, you need to move it elsewhere before 8:00am Sept 24. If you need any assistance, please submit a request to issues@chpc.utah.edu.


REMINDER - Ember and /scratch/ibrix Downtime Starts Sept 18 at 8:00am

Posted: September 9, 2013

Ember, /scratch/ibrix/chpc_gen and /scratch/ibrix/icse will be taken down on Sept 18 starting at 8:00am to move this hardware from Komas to the Downtown Data Center. Along with the move, the OS will be updated to RHEL 6 and the new versions of the batch scheduler and resource manager, Moab and Torque, will be installed.

There is a reservation in place to empty Ember queue by this time.

Users that use/mount these /scratch/ibrix file spaces from other locations (other clusters or desktops) – please make sure you have any data you need moved off of these two file systems before this time.

For owner Ember nodes, both compute and interactive: Note that any data on the local hard drives (/scratch/local) will be lost in this process. If you have any data in this location, you need to move it elsewhere before 8:00am Sept 18. If you need any assistance, please submit a request to issues@chpc.utah.edu.


CHPC Leadership Announcement

Posted: September 4, 2013

After over twenty years of leading the Center for High Performance Computing (CHPC) at the University of Utah, Julio Facelli will step down as director of the Center, effective October 1, 2013, in order to focus on his administrative and research responsibilities in the Health Sciences Center. He will remain professor and vice chair in the department of Biomedical Informatics. In addition, he will continue to serve as director of the Biomedical Informatics core within the Center for Clinical and Translational Science (CCTS).

Under Prof. Facelli's leadership, CHPC has played a critical role in providing advanced computational services to a number of researchers across campus. Major research projects supported include the CCTS, the Institute for Clean and Secure Energy, the Flux network and systems research group in the School of Computing, the Utah Genome Project, and the Utah/Wyoming EPSCoR CI-WATER project. Notably, CHPC's reputation for service and its collaborative approach have assisted in attracting many leading computational science and engineering faculty members to the University. From 2003 to 2004, Prof. Facelli chaired the Coalition for Academic Scientific Computation (CASC), the national affinity group of over seventy campus high performance computing centers. His long affiliation with CHPC will continue as a research user of CHPC's services and in a new role as senior scientific advisor.

Dr. Steve Corbato, the University's deputy chief information officer, will serve as the interim director of CHPC until the permanent director is named. He has an extensive background in experimental astrophysics, advanced networking, and research cyberinfrastructure. Dr. Corbato holds an adjunct faculty appointment in the School of Computing and also serves as an active NSF principal investigator, reviewer, and advisor. Concurrently, Corbato and Prof. Thomas Cheatham of Medicinal Chemistry are co-chairing a CHPC futures subcommittee of the Research Portfolio within the new campus information technology governance structure.


UPDRAFT cluster back in service!

Posted: September 3, 2013

The task of separating Ember from Updraft went smoothly (and very quickly!) and Updraft has been returned to service and is running jobs.

Due to the new IP addresses of this cluster, when you login for the first time you will get a ssh trust question. In addition, you will no longer be able to submit jobs to other clusters or query the status of jobs on other clusters from Updraft.

Please remember that the /scratch/ibrix/chpc_gen and the /scratch/ibrix/icse file systems are currently not mounted on Updraft. They will once again be mounted after the move of Ember and these two file systems to the Downtown Data Center has been completed. This move is scheduled to be started on September 18th.

As always, if you have any questions or run into any problems please send them to issues@chpc.utah.edu


REMINDER - Updraft downtime starts Tuesday Sept 3rd at 8:00AM

Posted: August 27, 2013

The first step in the process to prepare for the move of Ember and the two /scratch/ibrix file systems to the new datacenter will start on Tuesday, Sept 3rd at 8AM when UPDRAFT is taken down in order to separate Ember from Updraft (they share a common infinband infrastructure) and re-IP address Updraft. Our goal is to have Updraft back up by sometime on Friday, Sept 6th.

When Updraft is back, it will no longer have the two /scratch/ibrix file systems (/scratch/ibrix/chpc_gen and /scratch/ibrix/icse) mounted. Access to these two scratch systems will be restored on Updraft once Ember and the two /scratch/ibrix file systems have been moved to the downtown datacenter (a process scheduled to start Sept 18th and one that will take around 3 weeks). Updraft will still have access to /scratch/general, /scratch/uintah and /scratch/serial when it is back in service.

We do not anticipate any need for Ember to go down or for there to be any impact on running jobs, though we may pause the scheduler for brief periods of time.


Kingspeak Cluster Open to All Users in Freecycle Mode

Posted: August 26, 2013

CHPC has now opened the new Kingspeak cluster to all users. The cluster will operate in a un-allocated, freecycle manner until Oct 1 when the Fall quarter allocations go into effect.

A user guide can be found at: https://wiki.chpc.utah.edu/display/DOCS/Kingspeak+User+Guide

It is CRITICAL that all users get a new .tcshrc and .bashrc before they begin using this new cluster. The new versions can be obtained from the links on the main CHPC webpage, www.chpc.utah.edu: http://www.chpc.utah.edu/docs/manuals/getting_started/code/chpc.tcshrc http://www.chpc.utah.edu/docs/manuals/getting_started/code/chpc.bashrc

Please send any questions or problems to issues@chpc.utah.edu


XSEDE Support Available through CHPC

Posted: August 23, 2013

XSEDE (xsede.org) is a national computing resource established by the NSF available to researchers performing significant computational research across the United States. XSEDE comprises roughly a dozen dedicated resources at various affiliate institutions. The majority of these resources are built with cutting edge technology, or comprise clusters with significantly greater numbers of cores than the CHPC currently offers. Many XSEDE resources also offer storage services with fast I/O for researchers who have large datasets or may want to host databases with significant query loads.

The University of Utah Campus Champion is Julia Harrison. She and other members of the CHPC staff can assist you with access, allocations and usage of XSEDE resources. A little about the XSEDE service at CHPC is available on the CHPC Wiki

Two events you may be interested in if your are new to XSEDE or interested in using XSEDE resources:

  1. XSEDE New User Training
    August 28, 2013 (Wednesday)
    1 p.m. to 2:30 p.m. (Eastern)

    XSEDE new user training is a 90 minute webinar providing general overview and reference information for first-time users of XSEDE resources at any of XSEDE’s service providers. This session is particularly targeted at users who have just received their first allocation on XSEDE. It is not intended to teach programming, numerical methods, or computational science, but rather to provide a quick tour of what XSEDE has to offer.
    Topics covered will include:
    • Overview of XSEDE resources and services
    • How to sign on to / access XSEDE systems and the Common User Environment
    • Moving data in and out of XSEDE
    • Basics of running jobs
    • The XSEDE User Portal
    • Training and documentation resources
    • How to get help Extended and Collaborative support
    • Software availability
    • Allocations
      Significant time will be allotted for Q&A. This webcast is free, and open to all users or prospective users of XSEDE resources.
      Participants will receive instructions via email on how to access the webcast.
      https://portal.xsede.org/group/xup/course-calendar/-/training-user/class/149
  2. XSEDE Resource Support at CHPC
    September 10, 2013 (Tuesday)
    1 p.m. to 2:00 p.m.
    Location: INSCC Auditorium, room 110

    This is one hour presentation (part of the CHPC presentation series) which covers similar topcis as the XSEDE New User webinar.


Change in HPC allocation process at CHPC beginning October 1st, 2013

Posted: August 21, 2013

In an effort to streamline and more efficiently operate, beginning with Fall 2013 (Oct-Dec), CHPC will be changing two aspects of our allocation system. First, we will be creating one pool for both the EMBER cluster and the new KINGSPEAK cluster. The second aspect is that we will change the metric to 1 Service Unit (SU) = 1 CPU-core-wall-clock-hour on EMBER, and 1.5 SUS = 1 CPU-core-wall-clock-hour on KINGSPEAK.

Those renewing allocations (starting now) for Fall 2013 calendar quarter (and after) should submit their requests using the new metric. The allocation application form has been modified to reflect this change. These allocations are due September 9th, 2013.

We will be converting all previously awarded EMBER allocations to the new metric.

Note: Some codes may not achieve 50% increase in performance on KINGSPEAK. You should benchmark your codes on both systems to see where it performs most efficiently. Smaller performance increase on KINGSPEAK may indicate sub-optimal vectorization of the code, in which case we recommend contacting CHPC on ways to improve code vectorization.

Previously awarded UPDRAFT allocations for Fall 2013 (Oct-Dec) will honored (with no adjustment to SU metric) as UPDRAFT awards. No further awards, however, will be made on UPDRAFT. Existing Spring 2014 (Jan-Mar) UPDRAFT awards will be converted to an award that can be used on either EMBER or KINGSPEAK using the new metric. After December 31, 2013 the general side of UPDRAFT will be run in freecycle until there is no longer space available in the Komas Data Center, or UPDRAFT is no longer cost effective to operate, whichever comes first.

We will continue to provide you with Allocation and Usage data (see: Usage) for both UPDRAFT and the new pooled allocation beginning October 1st 2013. Please let us know if you have any questions by sending email to issues@chpc.utah.edu.


CHPC Fall 2013 Presentation Schedule

Posted: August 21, 2013

All presentations are at 1:00 p.m. in the INSCC Auditorium unless otherwise specified.

Everyone is welcome to attend!

  • September 5th - Overview of CHPC: Wim Cardoen
  • September 10th - **NEW** XSEDE Resource Support at CHPC: Julia Harrison and Albert Lund
  • September 12th - Introduction to Parallel Computing: Martin Cuma
  • September 17th - Introductory Linux for HPC Part 1: Martin Cuma (1-3 p.m.)
  • September 19th - Introductory Linux for HPC Part 2: Martin Cuma (1-3 p.m.)
  • September 24th - Mathematical Libraries at CHPC: Martin Cuma
  • September 26th - Debugging with Totalview: Martin Cuma
  • September 26th - Protected Environment, AI and NLP Services: Sean Igo: HSEB 2908, 1:30-2:30 p.m.
  • October 3rd - Using Python for Scientific Computing: Wim Cardoen
  • October 8th - Chemistry Packages at CHPC: Anita Orendt
  • October 10th - Using Gaussian09 and Gaussview: Anita Orendt
  • October 22nd- Introduction to GPU Programming: Wim Cardoen
  • October 24th - Introduction to I/O in the HPC Environment: Brian Haymore and Sam Liston
  • November 7th - Introduction to programming with MPI: Martin Cuma
  • November 12th - Introduction to programming with Open MP: Martin Cuma
  • November 14th -Hybrid MPI-OpenMP Programming - Martin Cuma
  • November 19th - Fast Parallel I/O at CHPC - Martin Cuma
  • November 26th - Introduction to BRISC Service – Bernie LaSalle

Please visit http://www.chpc.utah.edu/docs/presentations/ for descriptions of these presentations.


Opportunity to purchase nodes on new CHPC cluster

Posted: August 16, 2013

Dell has pre-authorized CHPC through January 2014 to get pricing on additional nodes for the new kingspeak cluster at a great price of $5,800* per node. If you are interested in purchasing nodes for your research please contact issues@chpc.utah.edu indicating your interest and the quantity of nodes, in multiples of 4 for this pricing.

Each node has:

  • CPU: Intel Xeon (Sandybridge) e5-2670 2.6 Ghz (8 cores/ socket, 2 sockets/node=16 Cores)
  • Memory: 64 GB 1600Mhz RDIMMS. (4 GB/CPU-Core)
  • Local Disk: 500 GB for swap and /scratch/local
  • Interconnect: Mellanox FDR Infiniband (56 Gbs)
  • 5 Years Warranty

If you need a different configuration (e.g. not in multiples of 4, more memory, more local disk, etc) please contact us for specific quotes. The pricing we expect to still be good for these, but not the great deal above. Note that if we get enough requests and can bundle up a big order, we may be able to aggregate smaller node counts into multiples of 4 and get the pricing above.

*Node cost $5,250 plus $550 for cables and switch costs.


Upcoming Data Center Move Dates

Posted: August 9, 2013

As we are nearing completion of the new kingspeak cluster, we are finalizing dates for the move of ember and the /scratch/ibrix file systems from their current home in the Komas datacenter to the Downtown datacenter. As updraft and ember currently share an infiniband infrastructure, we first will need an outage to separate the two clusters and re-IP address updraft.

Here is the planned timeline:

Tuesday, Sept 3rd - updraft will be taken down. Our goal is to have it back up by sometime on Friday, Sept 5th. When updraft is back, it will no longer have the two /scratch/ibrix file systems mounted (access will be restored once the ember move - see below - has been completed). It will still have access to /scratch/general, /scratch/uintah and /scratch/serial. We do not anticipate any need for ember to go down.

Thursday, Sept 19th - Ember and the two ibrix file systems will be taken down and moved to the Downtown datacenter. The ember cluster will be moved to the RHEL 6 OS and also have the updated batch scheduler put in place. The plan is to bring up the file systems first as work is being completed on the cluster. We anticipate that ember will be down for 2-3 weeks.


CHPC Outage: CHPC beginning about 6:00 a.m. until just before Noon

Posted: August 5, 2013

About 6 a.m. Today, August 5th, CHPCFS reported a drive failure in tray 4. At approximately 8:20 a.m., a second drive was reported down in the same tray. When a tray is down 2 drives, it is considered a critical situation and it would not reboot.

CHPC staff were able to replace failed gear and bring all but one file system back online just before Noon today. CHPC will contact that remaining group in a separate email.


IMPORTANT NOTICE for PIs with EMBER ALLOCATION

Posted: July 1, 2013

In anticipation of the upcoming move of Ember to the new Downtown Datacenter along with a batch system and OS update, currently scheduled for early to mid August, and the fact that the cluster will be unavailable for several weeks, it was decided that only half of the allocated amount would be deposited into the ember allocation accounts for the current Summer 2013 quarter (which started today). This is a pro-rated allocation amount based on the time between the start of the quarter and the anticipated move date. If for any reason the move is delayed beyond the mid point of the quarter (Aug 15) we will deposit an additional portion of the allocation in the accounts. Once Ember is back in service, a pro-rated amount of allocation will be deposited in all accounts with allocation, based on the remaining number of days in the quarter.

Also, remember that Ember will not be taken from service until after the new Kingspeak cluster is up and accessible to all users in freecycle mode.

As always, please let us know of any questions or concerns.


FY14 Budget Cuts at CHPC

Posted: June 27, 2013

Dear CHPC Users,

With the expected decline in federal research funding and the ensuing reduction in campus research overhead funds, CHPC’s FY14 budget has been reduced. Accordingly, CHPC has realigned its priorities along the University’s advanced research mission. Unfortunately, we must reduce our past levels of support for researcher computational environments (desktops) and advanced multimedia. We also have had to eliminate five positions from our staff.

The good news for the coming year is that CHPC will continue its focus on advanced research computing services - HPC, storage/Big Data, and virtualization/cloud services. In spite of the cuts, we will continue to invest in new computational capabilities this year to keep the clusters supporting the general allocation pool relevant. In addition, we are deploying a new cluster - kingspeak - right now.

Kingspeak cluster description:

  • 32 nodes (16 cores each) - 512 cores total
  • 2 interactive nodes
  • 2.6-Ghz speed with AVX support: 10.6 TFlops maximum, (without AVX: 5.3 TFlops)
  • Not all codes will be able to take advantage of the AVX support as this feature is dependent upon how well the codes vectorize. Also the general nodes on Ember run at a maximum speed of 9 TFlops: Infiniband interconnect
  • New /scratch space of approximately 150 TBytes
  • Expected to be online sometime this summer and under allocation control beginning October 1

The unfortunate news is that CHPC can no longer provide desktop and multimedia support in the same manner as we did in the past. We will work to recraft agreements with groups for whom we have provided such support. Please be patient with our staff as we regroup. We anticipate that desktop support requests will take longer to address.

If you have questions or concerns, please contact Steve Corbato, Julio Facelli, Guy Adams or myself. We appreciate your continued support of CHPC and look forward to continued collaboration to assist your research.


/scratch/ibrix/chpc_gen is 94% full

Posted: June 21, 2013

/scratch/ibrix/chpc_gen is currently 94% full. We ask that all users please check their usage of this file system and clean up as much space as possible in order to avoid issues with running jobs using this space.


Virtual School of Computational Science and Engineering summer classes at the CHPC

Posted: June 18, 2013

We would like to remind our users that CHPC will be hosting two courses from the Virtual School of Computational Science and Engineering, in early and late July. We encourage everyone interested in the topics covered to register and attend these courses. They provide a unique opportunity to attend classes taught by nationwide leaders in the field. There are still plenty of spaces available for both classes, which are free and open to everyone, not just the University affiliates. Feel free to forward this information to anyone who may be interested. For more details, see the Summer School local webpage at http://www.chpc.utah.edu/docs/news/news_items/vscse-2013.php


CHPC wiki (wiki.chpc.utah.edu) will be down starting at Noon June 12th for updates

Posted: June 11, 2013

All Clear: Update was completed as of 2:07 p.m. June 12th, 2013.

We will be taking wiki.chpc.utah.edu for a brief outage to update the Confluence software to a newer version. The process will begin at Noon tomorrow, June 12th, 2013.


All Clear: CHPC File Server Outage - CHPC_FS

Posted: June 7, 2013

The CHPC_FS file server had problems this morning which affected some home directories and web servers. We now have the all clear that it is up and happy again. If you are mounting on your desktop file system on CHPC_FS you should reboot your desktop before reporting issues to us.

The hardware which failed has been replaced and the file server is now up and functioning properly.

As usual, please report any issues to issues@chpc.utah.edu


CHPC DOWNTIME: All clear for tasks scheduled for June 4th, 2013

Posted: June 4, 2013

All clusters are online and scheduling jobs except apexarch (HIPAA) and turretarch (UCS).

A few other systems will also remain offline until tomorrow as scheduled.


UPDATE on DOWNTIME

Posted: June 4, 2013

The updates on the redbutte and drycreek file systems has been completed and they have been brought back online. All home and group directories affected should now be available. IF you are not seeing the expected file systems from your desktop system, please reboot and see if that resolves the issues.

Work is continuing on the cooling maintenance at the Komas Datacenter. We will send another message when this has been completed and the clusters are back online


DOWNTIME: June 4th and June 5th, 2013 - Updates and status

Posted: June 4, 2013

June 4th, 2013

7:00 a.m. Downtime Started:

  • HPC Cluster downed ((UP, EM, SDA, APEX, IBRIX, UCS, netapp))
  • meteo/atmos/wx nodes downed
  • allocation manager downed

7:45 a.m.

  • time2, time3, and first part of Phase I VM move powered down.
  • kachina, swasey, homerfs, pxe, bamboo powered down.

8:30 a.m.

  • Movers loading truck with equipment moving from SSB data center to DDC.

9:20 a.m.

  • CMSS arrived to begin maintenance work on HVAC system in Komas data center.

9:30 a.m.

  • Movers loading the last of the gear from SSB. Then heading up to Komas (for Apex Arch, UCS, and NetApp) before delivering to the DDC.

9:45 a.m.

  • Uplink to INSCC has been moved to new Arista switch
  • All ToR switches at DDC are new connected to new Arista core
  • Routing interfaces for the following have been moved from Komas core to Router Core (UCS, Apex)
  • Routing interfaces for hidden arch have been moved from SSB to Router Core

11:00 a.m.

  • Movers leaving Komas for DDC

12:20 p.m.

  • Updates on Fileservers completed; directories are back online

1:00 p.m.

  • Physical move complete to DDC
  • Work on wiring continues

2:00 p.m.

  • Some service machines in DDC up and running (alloc)

3:00 p.m.

  • More service machines in DDC up and running (time1, time2, pxe)
  • trmm, wx, atmos up at DDC
  • CMSS work complete in Komas, starting to bring up clusters

4:00 p.m.

  • Telluride up and scheduling jobs
  • Kachina and Swasey up

6:30 p.m.

  • Updraft up and scheduling jobs

7 p.m.

  • Ember up and scheduling jobs

June 5th, 2013

12:45 p.m.

  • Apexarch and UCS servers up and scheduling jobs

Upcoming Downtime - Tuesday June 4, 2013

Posted: May 28, 2013

Duration: All day, starting at 6:30am

CHPC will have a downtime on June 4, 2013 starting at 6:30am to do maintenance on the cooling system in the Komas Datacenter. During this downtime we will also start the move of equipment from Komas and the SSB Datacenter to the new Downtown Datacenter (DDC), as was previously announced. Firmware updates on SOME file systems will also be completed (see below).

CLUSTERS --- During this downtime Ember, Updraft, Sanddunearch and Telluride will be down for most of the day. Reservations are in place to drain the batch queues by June 4th at 6:30am, so any job that will not finish before the start of the downtime will not be allowed to start until the downtime is finished. A notice will be sent out when the clusters have been brought back online and the batch queues restarted.

FILE SYSTEMS AFFECTED --- The redbutte and drycreek file systems will be taken offline at about 8:30am for firmware updates; a notice will be sent out once this work has been completed. This space includes:

  • HOME directory space for the following groups: All ATMOS faculty, Smithp, Cheatham, Baron, Steele, Yandell, Zhdanov and Gregg
  • The following group spaces: All ATMOS faculty group spaces (named piname-groupx, where x=1-3), baron-group1, steele-group1, yandell-group1, gregg-group1, cheatham-group1-3, chpc-vis1, avey-group1, sandick-group1, stoll-group1, voelk-group1, molinero-group1, arup-storage2, chpc-group1, bowen-group1, and smiskovic-group1

MOVE TO DDC --- The following equipment will be moved to the DDC during this downtime:

  • Atmospheric Sciences cluster (atmos, meteo and wx, and nodes, except gl nodes) - Expect an extended downtime for these servers of approximately 2 days beginning June 4th at 7am
  • kachina.chpc.utah.edu and swasey.chpc.utah.edu - Expect extended downtime of 2 days
  • phase I of VM Farm - No downtime expected
  • Apexarch cluster and homerfs - Expect extended downtime of 2 days
  • UCS Nodes and attached storage - Expect extended downtime of 2 days


MPI libraries updates

Posted: May 27, 2013

We have updated the MPICH2 and MVAPICH2 libraries on CHPC clusters and administered desktops to versions 3.0.4 and 1.9 respectively. The main advantage of the new versions is support of most of the MPI 3.0 standard, including non-blocking collective communication.

There should not be any need for changes on user's end (compilation or running), but, if you encounter any problems please open a ticket at issues@chpc.utah.edu.


Upcoming Downtime - Tuesday June 4, 2013

Posted: May 24, 2013

Duration: All day

CHPC will have a downtime on June 4, 2013 starting at 6:30am to do maintenance on the cooling system in the Komas Datacenter. During this downtime we will also start the move of equipment from Komas and the SSB Datacenter to the new Downtown Datacenter (DDC), as was previously announced.

During this downtime Ember, Updraft Sanddunearch and Telluride will be down for most of the day. Reservations are in place to drain the batch queues by June 4th at 6:30am. Work will also be done on a number of the file systems. Details on the specifics of these file system outages will be given next week.

The equipment that will be moved to the DDC during this downtime:

  • Atmospheric Sciences cluster (atmos, meteo and wx, and nodes, except gl nodes) - Expect an extended downtime for these servers of approximately 2 days beginning June 4th at 7am
  • kachina.chpc.utah.edu and swasey.chpc.utah.edu - Expect extended downtime of 2 days
  • phase I of VM Farm - No downtime expected
  • Apexarch cluster and homerfs - Expect extended downtime of 2 days
  • UCS Nodes and attached storage - Expect extended downtime of 2 days

MPI workshop at CHPC

Posted: May 21, 2013

Duration: June 17 9am-3pm, June 18 9am-3pm

CHPC will be a satellite site for Pittsburgh Supercomputer Center two day workshop focusing on MPI programming. This is an excellent opportunity to expand on MPI programming skills beyond the short presentations that we teach.

For more details and schedule of the workshop, visit http://www.psc.edu/index.php/training/mpi-programming. The location of the MPI workshop at the University of Utah will be at the INSCC Auditorium, INSCC 110. Local staff will be on site to address local and we will also be able to ask questions to the speakers via the webcast.

We encourage everyone, not just University of Utah affiliates, to attend this workshop. If you have any local questions, please, send them to issues@chpc.utah.edu.

The registration is available through the XSEDE site, portal.xsede.org/course-calendar/-/training-user/class/124. If you don't have an XSEDE portal account, you can create it for free. The MPI workshop is also free and open to the public.


UPDATE on Major Campus Power Outage TONIGHT

Posted: May 15, 2013

Duration: Wednesday, May 15, 2013 at 11:59 pm to Thursday, May 16, 2013 at 6:30 am

As announced last week, there is a Campus planned power outage that will occur overnight that will affect the INSCC and SSB buildings.

JOB SCHEDULERS WILL BE PAUSED: Initially we announced that the clusters would not be impacted, but when reconsidering the potential impact of the equipment affected by the outage the decision was made to pause the schedulers on ALL clusters right before the outage starts. This means that no new jobs will be started during the outage, but that jobs already running will continue. The schedulers will be resumed once we receive notification that the power has been restored.

REMINDER: CHPC recommends that ALL tenants of INSCC and any other buildings impacted by this outage shut-down their desktop prior to leaving for the day on Wednesday May 15th.


Summer school courses at the CHPC

Posted: May 14, 2013

Similarly to previous few years, CHPC will be hosting two courses from the Virtual School of Computational Science and Engineering, in early and late July. We encourage everyone interested in the topics covered to register and attend these courses. They provide a unique opportunity to attend classes taught by nationwide leaders in the field.

For more details, see the Summer School local webpage at http://www.chpc.utah.edu/docs/news/news_items/vscse-2013.php

Feel free to forward this information to anyone who may be interested. The Summer School is open to everyone, not just University of Utah affiliates.


Matlab upgrade

Posted: May 14, 2013

We have upgraded Matlab on our Linux clusters and administered desktops to version R2013a. The major change you will notice from the previously default version R2012a is a change in the GUI interface. There are additional new features that are listed at http://www.mathworks.com/help/relnotes/new-features.html.

If you encounter any problems, please, let us know at issues@chpc.utah.edu.


Major Campus Power Interruption, Little Impact on CHPC Services (Wed 5/15 11:59 pm until Thur 5/16 6:30 am)

Posted: May 10, 2013

Wednesday, May 15, 2013 at 11:59 pm to Thursday, May 16, 2013 at 6:30 am

The Campus has planned a power outage of several buildings and has announced this outage is necessary to prevent further damage to equipment or a safety hazard to building occupants. Most CHPC services will not be impacted, however power will be out in the INSCC and SSB buildings. This power outage will affect CHPC services as detailed below:

INSCC Building data center (and building):

  • No air conditioning in the INSCC data center room and the UPS will not sustain load for the entire outage. Any equipment housed in INSCC data center must be shut down before 11:59 p.m. on Wednesday May 15th.
  • ALL tenants of the INSCC building should shut down their desktop prior to leaving for the day on Wednesday May 15th.

SSB Building data center:

  • The data center part of SSB is expected to ride through this outage on UPS/generator.

Other buildings affected:

  • CHPC recommends that ALL tenants of impacted buildings shut-down their desktop prior to leaving for the day on Wednesday May 15th.

CHPC Summer Downtimes and Data Center Move Schedule

Posted: May 9, 2013

Many of you have heard about the modern, off-campus data center that the University has developed in downtown Salt Lake City. Over the past year, CHPC has been planning its move to the new facility, which will bring our community many benefits, including more stable electric power and significantly more expansion capacity for rack space and power. Nevertheless, the move will require some significant disruptions to CHPC services at times over the summer. We ask for your patience and flexibility as we go through this process. By remaining flexible, we believe we can minimize the duration of the downtimes. We will provide frequent updates through email and also our new Twitter feed (@CHPCUpdates).

Here is the anticipated general timeline of the significant steps and milestones in the DDC move process:

May 2013:

  • Configure and test new switch in DDC
  • Receive new "Kingspeak" cluster (see below for a description) hardware and begin provisioning (with the upgraded Red Hat Enterprise Linux 6 operating system – RH6) in DDC
  • Receive and install new CI-WATER storage in DDC
  • Receive and install new Sloan Sky Survey storage in DDC
  • Prepare for June equipment moves
  • May 31: Allocation proposals are due for Ember and Updraft (Updraft only will be allocated through 12/31/2013)

June 2013:

  • Continue receiving and provisioning Kingspeak; begin staff testing, software builds on RH6 (including new batch system software), and early user access
  • June 4th: Regular CHPC Major downtime: Ember, Updraft, and Sanddunearch down for Komas machine room maintenance as usual
    • Move Atmospheric Sciences cluster (atmos, meteo and wx, and nodes, except gl nodes) - expect an extended downtime for these servers of approximately 2 days beginning June 4th
    • Move kachina.chpc.utah.edu and swasey.chpc.utah.edu - Expect extended downtime of 2 days
    • Move phase I of VM Farm - No downtime expected
    • Move of Apexarch cluster and homerfs – Expect extended downtime of 2 days
    • UCS Nodes and attached storage – Expect extended downtimes of 2 days

July 2013:

  • Batch system up - Kingspeak cluster will run in freecycle mode through October 1

August 2013:

  • All users will be given access to the Kingspeak cluster in freecycle mode.
  • Move Ember cluster - current downtime estimate is 3 +/- 1 weeks. This window will be more tightly specified based on move experience over the summer and more detailed work scheduling as this window approaches.
  • August 31, 2013: Allocation requests are due for Kingspeak and Ember. No further allocations will be awarded on Updraft. September 2013:
  • September: Ember will be brought up under RH6 and under the new batch system and will run in freecycle through October 1.

Please note that we will not be moving the Sanddunearch and Updraft clusters to the DDC, but instead will run them in place until December 31, 2013 or thereabouts. These nearly end-of-life clusters will be retired as the remodeling of the former Komas data center is scheduled to begin at that time. Also slated for retirement are /scratch/serial, /scratch/uintah, and /scratch/general file systems. These /scratch systems will not be mounted on Kingspeak or on Ember after it has been moved to the DDC.

Please relay any concerns about this planned work, particularly in regard to deadlines for conferences and grant proposals and other impacts.

Kingspeak cluster details (general nodes):

  • 32 nodes (16 cores each) - 512 cores total
  • 2 interactive nodes
  • 2.6Ghz speed with AVX support: 10.6 TFlops max, (without AVX: 5.3 TFlops)
    • Note that not all codes will be able to take advantage of the AVX support as this feature is dependent upon how well the codes vectorize.
    • Also note that the general nodes on Ember run at a max speed of 9 TFlops
  • Infiniband interconnect
  • New /scratch space of approximately 150 TBytes

CHPC now on Twitter!

Posted: May 8, 2013

In anticipation of the impact of the move to the downtown datacenter (more on this will follow in the coming days) CHPC is adding Twitter as a mechanism to disseminate information to our users. We have established two feeds: @CHPCOutages for information on both planned and unexpected outages and @CHPCUpdates for all News items.

No Twitter account is needed and all information distributed in this manner will be redundant to information available on the CHPC website. You may bookmark the above webpages, follow the feeds if you are a Twitter user, or both. There will also be a link to these feeds on the CHPC main webpage.

The @CHPCOutages feed will be used to announce downtimes, both planned and emergency outages and hardware failures, and to provide updates. We will strive to post updates of the progress of planned downtimes to better provide users with the current status. For emergency outages and hardware failures we will strive to send out updates much more frequently – even if it is to let users know that there is no change in status. This feed will also be used to update users on the status of the move to the new Downtown Datacenter – a process that will occur over the next several months and that will require several disruptions in service. An announcement with our tentative timetable will be sent to users in the next few days.

The @CHPCUpdates feed will be used to distribute News and other Information about CHPC. This will include items such as CHPC presentations, short courses, new resources, and publications resulting from use of CHPC resources.

NOTE: Please do not use Twitter as a mechanism to report problems or ask questions. While we will monitor our feeds closely, we will use our jira system to track questions, and other issues needing our attention. We ask you please continue to send such concerns as usual to issues@chpc.utah.edu, or post directly on jira.chpc.utah.edu.


/scratch/ibrix back on-line and available for use

Posted: May 7, 2013

The maintenance on the /scratch/ibrix system is now complete. Users are now welcome to once again make use of both the /scratch/ibrix/chpc_gen and the /scratch/ibrix/icse file systems.

If you have issues with a batch job accessing this space, please send us an issue report which included the job number. If you have issues accessing this file system from the interactive nodes or another location, please send us an issue report giving both the machine name and the time the problem with access occurred.


UPDATE on /scratch/ibrix

Posted: May 6, 2013

The work on the /scratch/ibrix file system is progressing.

While users may notice that the /scratch/ibrix/chpc_gen is mounted, the file system is NOT ready for use. Please do not access this space until you receive a notification that it is ready for use - this most likely will not occur today.

Work is also continuing on the /scratch/ibrix/icse file system.


/ibrix/scratch DOWN for SERVICE

Posted: May 6, 2013

The /scratch/ibrix/chpc_gen and the /scratch/ibrix/icse file systems have been taken off line for service, as was mentioned in the message sent on Thursday May 2, 2013. A HP engineer is on site to work to resolve issues that have been found on this file system.

At this time we have no estimate for the duration of this outage. During this outage the batch scheduler will remain active. Please do not submit jobs which use these scratch systems as they may hang the assigned nodes, making them unavailable for other jobs.

If you have any questions, please send them to issues@chpc.utah.edu. We will send more information when it becomes available.


UPDATE on EMERGENCY OUTAGE of /scratch/ibrix

Posted: May 2, 2013

Here is an update:

/scratch/ibrix/chpc_gen has being brought back online - but will be taken down again next week, most likely on Monday. /scratch/ibrix/icse will remain down. The batch queues have been restarted and they will remain up even after /scratch/ibrix is taken back down next week. Please note that if you had a job running when the file system was taken down this morning it may have hung and might die when the file system is mounted. Also, if you find that there is a machine where the mount of this file system is missing, please send us an issue report.

On Monday a HP service engineer is expected to arrive to continue work to resolve the ongoing issues with the /scratch/ibrix file system. Unfortunately, we cannot make an estimate of how long this will take, however we expect it might take multiple days. At this point, HP engineering and CHPC staff have no indication that the data on /scratch/ibrix/icse is at risk.

When the engineer arrives /scratch/ibrix/chpc_gen will be taken down again. It will most likely be down for multiple days. We suggest that users SELECTIVELY move data that they need for the next few days from /scratch/ibrix/chpc_gen to other locations such as group file systems, /scratch/serial, /scratch/general, or home directories (listed in order of preference). Please keep in mind that these alternate locations are much smaller than /scratch/ibrix/chpc_gen space so we cannot have users moving all of their the data off the /scratch/ibrix/chpc_gen space.

The batch queues will not be paused when work is resumed on the /scratch/ibrix file system. Users can have jobs which use /scratch/ibrix/chpc_gen over the weekend but any use of this file system needs to be complete by Monday 8am as we will not be able to give advance notice of when the file system will be taken down. Any jobs that will run after Monday morning should not use this space as that will cause the job to die/hang when the system is taken offline.


EMERGENCY OUTAGE of /scratch/ibrx file system - ADDITION

Posted: May 2, 2013

Duration: Unknown

The batch schedulers on the clusters have been paused to keep new jobs from starting at this time.


Redbutte File Server Outage: All clear as of 12:48 p.m. (3/18/2013)

Posted: March 18, 2013

The reboot is completed and all are now in a healthy state.

As always, please let us know of any issues or problems by sending email to issues@chpc.utah.edu


Redbutte File Server: brief Outage beginning now (approx 12:30 p.m. on 3/18/13)

Posted: March 18, 2013

The redbutte file systems, after applying some routine operating system updates, were in a strange state which required us to reboot. We are in the process and don't expect it to take very long. The outage should be completed within the hour (by 1:30 p.m.)

We will alert you when everything is back in a happy state.


Clusters back online after the downtime

Posted: March 12, 2013

The maintenance on the cooling systems in the Komas data center has been completed. The affected clusters (ember, updraft and sanddunearch) are back online and the batch queues restarted.


Current information on Hardware Failure of Feb 25th and the Current Status of the Restoration

Posted: March 8, 2013

On Monday, February 25, CHPC experienced a major file system failure, which impacted the home directories for about 275 of our users. The initial report listed the groups involved. Below we provide additional information about this event and the current status of the file restoration.

CHPC still is working with the hardware vendor (HP) to determine the cause of the failure. So far we know that it was not a single failure, but a combination of failures in the controller and the disk. The analysis of the failure is ongoing.

Please note that the damaged equipment is not in service – the restorations are being performed to replacement hardware. Also all restored files are coming from the backup tapes, not from the damaged hardware; the integrity of the restored files has not been in question.

Here is an overview of the restoration process to date:

The majority of the users had their home directories back online by Saturday, March 2nd . These file systems were restored from the last full backup tape, which was started on Friday, February 22 and was still running when the disk failure occurred. In reviewing the logs of the restoration we noticed POTENTIAL missing files from some of these home directories- these users were notified of this fact on March 2nd (more below).

There was a subset of 40 users, whose home directories did not get restored at all on the initial attempt. Most of these have been traced to not being backed up on the last full restore (the weekend the failure occurred). All of these users were contacted over the weekend via e-mail. These users had a “new” home directory created with the standard CHPC dot files, and the restores were done to a different location and rsynced over when done. These restores were done up through the last incremental backup started on February 21. This process was completed Mar 7th and each user was notified when their home directory was completed.

As mentioned above, during this restoration process, some files were flagged with a “did not restore” message in the logs, and we now have a list of these files. Most of the users impacted by the failure are not on this list. Some that are on this list were notified over the weekend that the file restore may not be complete; others noticed that files were missing and have let us know. We are now focusing on restoring these files on an individual case basis. Anyone who discovers that they have missing file(s) should let CHPC staff know by opening an issue report.

Finally, for all involved, we are pulling and storing all tapes with the data lost by the hardware failure so that we will have them in case any user needs us to look for a missing file in the future. Backups will be continued starting this weekend on alternate tapes.


Komas Datacenter Downtime: Tuesday March 12, 2013 for critical service to the cooling tower

Posted: March 8, 2013

Event date: March 12, 2013

Duration: Clusters in Komas Datacenter will be down beginning at 7:00 a.m. until about 5:00 p.m.

Systems Affected/Downtime Timelines:

All clusters in the Komas Datacenter including Ember, Updraft and Sanddunearch will be down from 7 a.m. until about 5 p.m.. Scratch space will remain up unless the temperature gets too high during this maintenance, at which time we will also need to down these servers as well.

The folks that service our cooling system (CMMS) in the Komas datacenter have notified us that the cooling system is in critical need for service, and they are very concerned about the current situation. It is believed if we see temperatures above 60 to 65 degrees we may have an outage on the cooling system.

We have set the reservations to drain the queues on the clusters, and we expect to make it to Tuesday morning and a graceful shutdown. However there is a chance that if the cooling fails in the meantime, we'll have an emergency downtime before the scheduled time.

CMMS will take the coolers offline at 8 a.m. Tuesday morning, so we need to shut the clusters down at 7 a.m. They expect to be finished by 2:00 p.m. so we can begin bringing the clusters backup as soon as that is complete. It usually takes 2-3 hours for the clusters to be up and made available to users.

Please let us know if you have any questions by sending email to issues@chpc.utah.edu.


**CHPC unexpected outage**: Home directories for a subset of users, 2/25/2013

Posted: February 25, 2013

Duration: Estimate - sometime 2/28/2013, individual file systems may be available more quickly

Dear HPC Users,

Update: As of 4 p.m. 2/27 we are finding the estimates for the restore very difficult to guess. We have come up with a way to make filesystems available as they complete and we will notify individual groups when their space is ready. At this point we guess it will continue to run at least through the night, and part way into the day tomorrow 2/28.

Update: The restore continues to run. The best guess for having file services restored for those affected is sometime tomorrow morning - 2/27/2013. Thank you again for your patience.

The groups home directories listed below are currently down, and will remain down for an extended period of time. We have had a disk failure and the built in redundancy measures we had in place also failed to work as expected. We are currently restoring from backup, but will not be able to bring anything back online until the restore has completed, which we expect to take a full day or more. We sincerely apologize for the inconvenience and will keep you posted as we progress and as time estimates improve.

[baron-home]
[cheatham-home]
[cliu-home]
[garrett-home]
[gregg-home]
[horel-home]
[jenkins-home]
[jiang-home]
[krueger-home]
[lin-home]
[mace-home]
[paegle-home]
[perry-home]
[reichler-home]
[smithp-home]
[steele-home]
[steenburgh-home]
[strong-home]
[whiteman-home]
[yandell-home]
[zipser-home]
[zhdanov-home]
[zpu-home]


CHPC Downtime has ended

Posted: January 15, 2013

The CHPC Downtime has ended. All of the clusters are back in service and running jobs. All of the home directories scheduled to be moved to the new filesystem have been moved. IF you have any difficulty in reaching your CHPC home directory or group space from your desktop we request that you see if a reboot solves the problem before sending in an issue report. As always, please let us know if you have any issues accessing or using CHPC resources.


CHPC Major Downtime: Tuesday January 15th, 2013 beginning at 7:00 AM - Unknown

Posted: January 8, 2013

Event date: January 15, 2013

Duration: From 7 a.m. January 15th: Clusters down most of the day. Other services, see below.

Systems Affected/Downtime Timelines: During this downtime, maintenance will be performed in the datacenters, requiring many systems to be down most of the day. Tentative timeline:

  • HPC Clusters: beginning at 7:00 a.m. lasting most of the day
  • File Servers: CHPCFS will remain up. While redbutte will stay up, there are a number of groups who will have outages to migrate home directories from oquirrh to redbutte. These particular spaces are:
    • baron-home
    • cheatham-home
    • cliu-home
    • garrett-home
    • gregg-home
    • horel-home
    • jenkins-home
    • jiang-home
    • krueger-home
    • lin-home
    • mace-home
    • paegle-home
    • perry-home
    • reichler-home
    • smithp-home
    • steele-home
    • steenburgh-home
    • strong-home
    • whiteman-home
    • yandell-home
    • zhdanov-home
    • zipser-home
    • zpu-home
  • Network outages: No outage
  • Virtual Machines: No outage
  • Software License Server: 15 minute outage sometime between 8 and 10 a.m.

Instructions to User:

  • Users involved the the home directory outages listed above should logoff of any Linux systems mounting their home directories. Groups will be notified as their home directory moves are completed, and will be told the new path for any samba or cifs mounts. At this point it is recommended you reboot your desktop prior to contacting CHPC with issues. After you have rebooted, please let us know of any issues by sending email:issues@chpc.utah.edu.
  • Please remember that /scratch is scrubbed of all files older than 60 days on a regular basis.


Emergency Cluster DOWNTIME: starting 1/8/2013 approximately 12:30 p.m.

Posted: January 8, 2013

Duration: Unknown

The CHPC clusters went down approximately 12:30 p.m. (Tuesday, 1/8/2013) to protect the equipment due to a cooling problem in the Komas machine room. These clusters include ember, updraft and sanddunearch.

By about 2:30 p.m. the cooling issues were mitigated. Ice had built up and was diverting the water out of the cooling tower.

All clusters were back online and running jobs by 5:15 p.m. Please let us know if you see any issues by sending email to issues@chpc.utah.edu.