CHPC upgrade of cluster OS to CentOS7 - Center for High Performance Computing

Update 3/6/2017

The ash cluster will be out of service starting on Tuesday March 14, 2017 for some benchmarking runs followed by the upgrade of the OS to CentOS7.

As with the upgrade of the other clusters there is now a reservation in place so jobs that will not finish before this time will not start, and any idle jobs that remain in the queue will be lost during the upgrade. Any data local to the ash interactive and compute nodes (/scratch/local file system on the individual nodes) will also be lost.

We expect that the nodes will be back in service possibly late in the day on Friday March 17 or early in the following week.

Update 3/3/2017

The Kingspeak cluster is back in service with the exception of three interactive nodes: kingspeak4, kingspeak21 , kingspeak22.

Please remember that the ssh keys have changed, so you will need to either delete or edit the .ssh/known_hosts files to remove references to any kingspeak nodes.

Refer to the earlier messages below for additional details about changes with the move to CentOS7.

Update 2/13/2017

In continuing the HPC cluster OS upgrade to CentOS7, we will take kingspeak, both the interactive and compute nodes, out of service on Wednesday March 1 at 9am.

There is a reservation in place to drain the batch queue of running jobs by this time. As with the other clusters that have already been upgraded, any idle jobs in the batch queue will be lost. Also, please save any work on the interactive nodes before the 9am on March 1.

In addition, any data on the local hard drive on these nodes will be lost. If any group that own nodes have data on their nodes that needs to be saved, please be certain to move it elsewhere before March 1.

We anticipate that kingspeak will be returned to service early the following week (week of March 6th).

Update 1/20/2017

Both the ember cluster and the frisco nodes frisco1-7 are back in service

Please remember that the ssh keys have changed, so you will need to either delete or edit the .ssh/known_hosts files to remove references to ember, ember 1, ember2, emXXX, and frisco1-7.

Refer to the earlier message below after the OS upgrade on lonepeak for additional details about changes with the move to CentOS7.

As always, please report any issues that you have with running on ember or the frisco nodes to issues@chpc.utah.edu

Additional notes on the frisco nodes:

1) All users should update to the latest FastX2 client, using the instructions on the FastX2 web pagehttps://www.chpc.utah.edu/documentation/software/fastx2.php . Please note that the older FastX version1 clients will not work with the version of the FastX2 server that we are using.

2) All the frisco nodes now have graphics cards which enable the use of vglrun (before it was only frisco5-7). The FastX page has been updated to reflect this change.

Next Steps in the upgrade process – kingspeak and the atmos & meteo nodes– starting time still to be determined.

Next week we will meet internally to evaluate the dependencies in order to determine the timeline for the upgrade process on kingspeak. Once we have done this we will announce this date.

With the atmos and meteo nodes, we will be reaching out to the individual groups that own these nodes to make arrangements for when these nodes will have their OS upgraded.

Update 1/10/2017

The tangent cluster is back in service.

As with lonepeak, please note the ssh keys have changed, so you will need to either delete or edit the .ssh/known_hosts files to remove references to tangent or tpXXX. A change has also been made to address sporadic issues with the mounting of the /scratch/general/luster file system on the tangent compute nodes, so if you run into problems with this, please let us know.

after the OS upgrade on lonepeak for additional details about changes with the move to CentOS7.

Next Steps in the upgrade process – ember and the frisco nodes :

On Tuesday Jan 17, starting at 9am, we will start the CentOS7 upgrade on the frisco nodes, frisco1- frisco7. Note that frisco8, already running CentOS7, was added a couple of weeks ago. Any user that has any jobs or open sessions on frisco1-7 should save and close them before this time.

At this same time, the upgrade will start on the ember cluster, both compute and interactive nodes. A reservation has been put in place to drain jobs by this time. Note that there may be some longer wall time jobs that will be interrupted by this reservation. On the interactive nodes, please save any work that may be running before this time.

Update 12/9/2016

The lonepeak cluster is back in service.

Please note that:

The nodes are now running CentOS7
The ssh keys have changed, so you will each need to either delete or edit the .ssh/known_hosts files to remove references to the old keys for these nodes (any reference to lonepeak or lpXXX)

We ask users to take the time to test their applications on lonepeak and to please let us know of any issues that may arise from the newer OS. Depending on the experience users have on lonepeak we will set a timetable to migrate the remaining clusters and resources to CentOS7. We expect that the upgrade of the remaining clusters will move more quickly than that of lonepeak.

Items to keep in mind as you start to use resources running CentOS7

some older modules are "hidden"

To see them, one can use "module avail --show-hidden". Explicitly stating the module name and version will load the module (e.g. "module load intel/2015.1.133").

MPICH module name has changed from "mpich2" to "mpich" to reflect the official name change.
NetCDF environment variables that define library headers and paths have changed to NETCDFC_*, NETCDFF_* and NETCDFCXX_*. This is important for those programs that need NetCDF to build, e.g. WRF. This name change has been made for better naming consistency. Use module show netcdf to see the variables which are described in the header of the module.
Intel compiler stack uses 2017 version of compilers, PGI uses version 16.9, GCC system default uses 4.8.5.

Older Intel compilers installed on CentOS6 will still work for compiling (but we recommend not using them moving forward)
Older PGI compilers installed on CentOS6 will no longer work for compiling , although binaries built with PGI on CentOS6 should work by loading the CentOS7 PGI module<

Note that most binaries built on CentOS6 should run on CentOS7

We therefore recommend that you first try to run by loading the default compiler/mpi/library modules (e.g. for WRF, "module load pgi mpich netcdf-c netcdf-f") , as mentioned above.
If this does not work, load the modules with which the binary was compiled on CentOS6
Then if this still does not work, the program will need to be compiled – if it is one of the applications CHPC staff installed, please notify us with a report to issues@chpc.utah.edu.

Next steps in the upgrade process

The next cluster to move will be tangent; we are looking at doing the migration of this cluster starting on Jan 4. This will be followed by ember, tentatively scheduled for Jan 11. Ember may take a bit longer as it is the first cluster with infiniband and also that along with the OS upgrade the cluster will be re-IP addressed. Following ember will be kingspeak, at a date to be determined. We will work with the group that owns ash to determine its upgrade schedule.

Independent of the upgrade schedule on the clusters, we will start to schedule upgrades on the meteo, atmos, and frisco nodes in early January. We will do these in steps including a subset of each of these groups, as to allow testing of applications without a total interruption of access. We do however encourage groups that make use of these resources to start testing on the lonepeak interactive nodes now and work with CHPC staff to get their applications working before this transition is started.

Posted November 15th, 2016

CHPC is ready to start the migration of the clusters and interactive (login) nodes from CentOS6 to CentOS7. We will be doing the upgrade in stages, starting with lonepeak. At this time the upgrade will be completed on the lonepeak compute nodes as well as on the two general lonepeak interactive nodes.

The OS upgrade on lonepeak will start on Tuesday November 29 at noon. A reservation is in place to drain the batch queue by that time. Any idle jobs in the batch queue will be lost in the conversion.

We anticipate that the lonepeak nodes will be down for about two weeks. This amount time will allow for the proper testing of applications on the new OS. However note that this is only an estimate; the length of the down time is dependent on how the testing proceeds.

Once the lonepeak cluster is back in service, we request that users test their applications on CentOS7, and let us know of any issues.

Once the testing is underway on lonepeak we will announce the rest of the schedule; our tentative plans are to next upgrade the frisco, atmos, and meteo nodes (several at a time so that not all of one group is down at one time), followed by the other clusters, again one at a time.

CHPC upgrade of clusters OS to CentOS7