2006 CHPC Downtimes and History

CHPC Downtime: Thursday November 30th, 2006 5 p.m. until Midnight

Posted: November 21, 2006

Arches Downtime Duration:

CHPC Downtime: Thursday November 30th, 2006 5 p.m. until Midnight

Systems affected: All services will be down or intermittent. Reservations will be in place to prevent jobs from starting. Running jobs will continue to run. Network connectivity will be sporadic out of the INSCC building. Fileserv2 will be down.

Duration: Thursday November 30th, 2006 from 5:00 p.m. until Midnight.

CHPC Downtime: Thursday November 30th, 2006 5 p.m. until Midnight

Systems affected: All services will be down or intermittent. Reservations will be in place to prevent jobs from starting. Running jobs will continue to run. Network connectivity will be sporadic out of the INSCC building. Fileserv2 will be down.

Duration: Thursday November 30th, 2006 from 5:00 p.m. until Midnight.


CHPC Network Downtime: Intermittant Network Connectivity, October 26th, 2006 from 5:30 until 7:30 p.m.

Posted: October 3, 2006

Arches Downtime Duration:


updated: October 18, 2006

CHPC Network Downtime: Intermittant Network Connectivity, October 26th, 2006 at 5:30 - 7:30 p.m.

Systems affected: Network Connectivity. The arches queues will be suspended. Running jobs will continue to run, queued jobs will stay queued, but no jobs will be allowed to start until the network maintenance is finished.

Duration: October 26th, 2006 from 5:30 until 7:30 p.m.

Scope: There will be intermittant network connectivity. The CHPC link between SSB and INSCC will be brought from 2Gb to 20Gb.


updated: October 18, 2006

CHPC Network Downtime: Intermittant Network Connectivity, October 26th, 2006 at 5:30 - 7:30 p.m.

Systems affected: Network Connectivity. The arches queues will be suspended. Running jobs will continue to run, queued jobs will stay queued, but no jobs will be allowed to start until the network maintenance is finished.

Duration: October 26th, 2006 from 5:30 until 7:30 p.m.

Scope: There will be intermittant network connectivity. The CHPC link between SSB and INSCC will be brought from 2Gb to 20Gb.


CHPC Downtime: Komas Machine Room Downtime: Beginning September 7th, 2006 at 5:00 p.m. until about Midnight

Posted: August 29, 2006

Arches Downtime Duration:

CHPC Downtime: Komas Machine Room Downtime: Beginning September 7th, 2006 at 5:00 p.m. until about Midnight

Systems affected: All Arches Clusters, slickrock cluster, CHPC webserver and some fileservers.

Duration: Begins at 5:00 p.m. Thursday September 7th, 2006. We expect all systems back and functioning by Midnight.

Scope: The coolers in the Komas machine will be serviced. Arches will be taken down to prevent over heating. CHPC will take advantage of this outage to do system maintenance.

CHPC Downtime: Komas Machine Room Downtime: Beginning September 7th, 2006 at 5:00 p.m. until about Midnight

Systems affected: All Arches Clusters, slickrock cluster, CHPC webserver and some fileservers.

Duration: Begins at 5:00 p.m. Thursday September 7th, 2006. We expect all systems back and functioning by Midnight.

Scope: The coolers in the Komas machine will be serviced. Arches will be taken down to prevent over heating. CHPC will take advantage of this outage to do system maintenance.


Arches Clusters PVFS (/scratch/parallel) outage: Tuesday August 15th, 9 - 9:15 a.m.

Posted: August 11, 2006

Arches Downtime Duration:

Arches Clusters PVFS (/scratch/parallel) brief outage: Tuesday August 15th, 9 - 9:15 a.m.

Systems affected: The /scratch/parallel filesystem on all arches clusters.

Duration: 5 - 15 minutes

Scope:On Tuesday, August 15th about 9:00 a.m. CHPC will be performing maintenance on the arches clusters PVFS filesystem (/scratch/parallel). The data from running jobs which access this space may be lost and/or corrupted by this maintenance. Please clear out any of your jobs which are using this space, and refrain from submitting new jobs (which use this space) until after the outage. We will let you know when the maintenance is complete.

We’re hoping that this fix will remedy a bug that several of you have run into recently. We expect the outage to be very brief (5-15 minutes)

Arches Clusters PVFS (/scratch/parallel) brief outage: Tuesday August 15th, 9 - 9:15 a.m.

Systems affected: The /scratch/parallel filesystem on all arches clusters.

Duration: 5 - 15 minutes

Scope:On Tuesday, August 15th about 9:00 a.m. CHPC will be performing maintenance on the arches clusters PVFS filesystem (/scratch/parallel). The data from running jobs which access this space may be lost and/or corrupted by this maintenance. Please clear out any of your jobs which are using this space, and refrain from submitting new jobs (which use this space) until after the outage. We will let you know when the maintenance is complete.

We’re hoping that this fix will remedy a bug that several of you have run into recently. We expect the outage to be very brief (5-15 minutes)


Arches Clusters Downtime: Tuesday August 1st 10:15 a.m. Power Outage - Komas machine room.

Posted: August 1, 2006

Arches Downtime Duration:

Arches Clusters Downtime: Tuesday August 1st 10:15 a.m. Power Outage - Komas machine room.

Systems affected: All Arches Clusters, all systems and networking in Komas machine room.

Duration: Unknown.

Scope:About 10:15 a.m. Tueday August 1st, 2006 the power went out at the Komas machine room taking down the arches clusters, and other equipment. The power company has been notified but no estimate is yet known on the possible duration.

Arches Clusters Downtime: Tuesday August 1st 10:15 a.m. Power Outage - Komas machine room.

Systems affected: All Arches Clusters, all systems and networking in Komas machine room.

Duration: Unknown.

Scope:About 10:15 a.m. Tueday August 1st, 2006 the power went out at the Komas machine room taking down the arches clusters, and other equipment. The power company has been notified but no estimate is yet known on the possible duration.


Arches Clusters Downtime: Wednesday July 19th at 1:00 p.m. through sometime July 20th
INSCC Desktops Downtime Wednesday July 19th at 5:00 p.m. until about 9:00 p.m.
INSCC Network downtime: Wednesday July 19th at 5:00 p.m. until about

Posted: July 13, 2006

Arches Downtime Duration:

Arches Clusters Downtime: Wednesday July 19th at 1:00 p.m. through sometime July 20th
INSCC Desktops Downtime Wednesday July 19th at 5:00 p.m. until about 9:00 p.m.
INSCC Networks Downtime Wednesday July 19th at 5:00 p.m. until about 7:00 p.m.

Systems affected: All Arches Clusters, INSCC Networking and Desktops

Duration: Arches will be taken down at 1:00 p.m. Wednesday, July 19th and will be brought back sometime the next day. INSCC Desktops access will be affected from 5 p.m. until approximately 9:00 p.m. INSCC networking will be affected from 5 p.m. until approximately 7:00 p.m.

Scope: Arches will be taken down to complete repairs on some of the nests and perform software upgrades and maintenance. Maintenance on servers will affect desktops until approximately 9:00 p.m. Maintenance on networking equipment expect to be completed by approximately 7:00 p.m.

Recommendation: CHPC recommends that INSCC tenants logoff their desktops prior to leaving work on the 19th.

Arches Clusters Downtime: Wednesday July 19th at 1:00 p.m. through sometime July 20th
INSCC Desktops Downtime Wednesday July 19th at 5:00 p.m. until about 9:00 p.m.
INSCC Networks Downtime Wednesday July 19th at 5:00 p.m. until about 7:00 p.m.

Systems affected: All Arches Clusters, INSCC Networking and Desktops

Duration: Arches will be taken down at 1:00 p.m. Wednesday, July 19th and will be brought back sometime the next day. INSCC Desktops access will be affected from 5 p.m. until approximately 9:00 p.m. INSCC networking will be affected from 5 p.m. until approximately 7:00 p.m.

Scope: Arches will be taken down to complete repairs on some of the nests and perform software upgrades and maintenance. Maintenance on servers will affect desktops until approximately 9:00 p.m. Maintenance on networking equipment expect to be completed by approximately 7:00 p.m.

Recommendation: CHPC recommends that INSCC tenants logoff their desktops prior to leaving work on the 19th.


SSB power outage: Thursday July 6th, 10.15 am

Posted: July 6, 2006

Arches Downtime Duration:

SSB lost power for a moment. Although the CHPC machine room has backup power, for some reason, this failed and all systems housed in SSB rebooted. This includes most file servers and some utility servers (DNS,...).
The affected machines seem to be up by 10.30am, but, there may be some slower performance as the file servers perform disk checks.
We are not expecting many job failures since the jobs that tried I/O in this time frame should delay it till the file servers came back up, but, there is a possibility that some jobs may have failed.

SSB lost power for a moment. Although the CHPC machine room has backup power, for some reason, this failed and all systems housed in SSB rebooted. This includes most file servers and some utility servers (DNS,...).
The affected machines seem to be up by 10.30am, but, there may be some slower performance as the file servers perform disk checks.
We are not expecting many job failures since the jobs that tried I/O in this time frame should delay it till the file servers came back up, but, there is a possibility that some jobs may have failed.


INSCC Machine Room, Icebox Downtime: Monday March 27th, 2006 5pm-10pm.

Posted: March 23, 2006

Arches Downtime Duration:

INSCC Machine Room, Icebox Downtime

Systems affected: Icebox, all networking connectivity to and from INSCC building.

Date: March 27th, 2006

Duration: 5pm-10pm

Scope: A new UPS system will be installed in the INSCC machine room, that will require a power outage. This outage will last from 5pm to 10pm. During this time, the main router for the INSCC, Komas and SSB machine rooms will be off-line. No connectivity will exist to the outside world or between internal networks. Since Icebox does its routing through the INSCC router, jobs that will need routing at that time may fail. As a precaution, we have set a reservation on Icebox for the given time period.

INSCC Machine Room, Icebox Downtime

Systems affected: Icebox, all networking connectivity to and from INSCC building.

Date: March 27th, 2006

Duration: 5pm-10pm

Scope: A new UPS system will be installed in the INSCC machine room, that will require a power outage. This outage will last from 5pm to 10pm. During this time, the main router for the INSCC, Komas and SSB machine rooms will be off-line. No connectivity will exist to the outside world or between internal networks. Since Icebox does its routing through the INSCC router, jobs that will need routing at that time may fail. As a precaution, we have set a reservation on Icebox for the given time period.


Arches Clusters Downtime: RedHat Enterprise Linux Upgrade Monday March 27th, 2006 till later that week.

Posted: March 23, 2006

Arches Downtime Duration:

Arches Clusters: RedHat Enterprise Linux Upgrade Monday March 27th, 2006 till later that week.

Systems affected: All Arches Clusters

Date: March 27th, 2006

Duration: To be determined, up to a whole work week.

Scope: We will be upgrading the rest of the Arches cluster to Red Hat Enterprise Linux. Whole Arches cluster will go down at the above specified time. This also includes all scratch file servers. We will be reformatting all the scratch servers with the exception of /scratch/serial-pio. This means that all data on these servers (with the exception of /scratch/serial-pio) will be deleted. Please, move all your important files off the scratch file servers by Sunday, March 26th. We are planning to bring Landscapearch back online either later in the day on Monday, or after the INSCC machine room power outage (see below) on Tuesday morning. The rest of the Arches clusters will be functional after the upgrade, which is anticipated to be later in the week.

Arches Clusters: RedHat Enterprise Linux Upgrade Monday March 27th, 2006 till later that week.

Systems affected: All Arches Clusters

Date: March 27th, 2006

Duration: To be determined, up to a whole work week.

Scope: We will be upgrading the rest of the Arches cluster to Red Hat Enterprise Linux. Whole Arches cluster will go down at the above specified time. This also includes all scratch file servers. We will be reformatting all the scratch servers with the exception of /scratch/serial-pio. This means that all data on these servers (with the exception of /scratch/serial-pio) will be deleted. Please, move all your important files off the scratch file servers by Sunday, March 26th. We are planning to bring Landscapearch back online either later in the day on Monday, or after the INSCC machine room power outage (see below) on Tuesday morning. The rest of the Arches clusters will be functional after the upgrade, which is anticipated to be later in the week.


Arches Clusters Unscheduled Downtime: UP&L power outage Tuesday March 7th, 2006. Arches was back online as of about 3:00 p.m. March 10th, 2006.

Posted: March 8, 2006

Arches Downtime Duration:


Updated March 9th, 2006
Updated March 10, 2006

Arches Clusters Unscheduled Downtime: UP&L power outage Tuesday March 7th, 2006. Arches was back online as of about 3:00 p.m. March 10th, 2006.

Systems affected: All Arches Clusters

Date: Late in the evening Tuesday March 7th, 2006

Duration: Until about 3:00 p. m. Friday, March 10th, 2006

Scope: Late Tuesday night March 7th, Komas data center which houses the Arches clusters experienced a power failure. UP&L has moved the schedule to complete repairs several times, and as of Friday, the best information we have is that they will have another power outage sometime next week (3-13 to 3-17). CHPC is brought things back online the afternoon of Friday, March 10th. Arches cluster were available to users by about 3:00 p.m. We will let you know of any change to this plan. Thanks for your continued patience.


Updated March 9th, 2006
Updated March 10, 2006

Arches Clusters Unscheduled Downtime: UP&L power outage Tuesday March 7th, 2006. Arches was back online as of about 3:00 p.m. March 10th, 2006.

Systems affected: All Arches Clusters

Date: Late in the evening Tuesday March 7th, 2006

Duration: Until about 3:00 p. m. Friday, March 10th, 2006

Scope: Late Tuesday night March 7th, Komas data center which houses the Arches clusters experienced a power failure. UP&L has moved the schedule to complete repairs several times, and as of Friday, the best information we have is that they will have another power outage sometime next week (3-13 to 3-17). CHPC is brought things back online the afternoon of Friday, March 10th. Arches cluster were available to users by about 3:00 p.m. We will let you know of any change to this plan. Thanks for your continued patience.


Arches Clusters Down: Monday March 6th 5:00 p.m.

Posted: March 3, 2006

Arches Downtime Duration:

Arches Clusters Down: Monday March 6th 5:00 p.m.

Systems affected: All Arches Clusters

Date: Monday March 6th, 2006

Duration: 5:00 p.m. until Tuesday morning

Scope: Komas Machine room maintenance required to clean heat exchangers.

Arches Clusters Down: Monday March 6th 5:00 p.m.

Systems affected: All Arches Clusters

Date: Monday March 6th, 2006

Duration: 5:00 p.m. until Tuesday morning

Scope: Komas Machine room maintenance required to clean heat exchangers.


CHPC Downtime: Landscapearch and /scratch/serial-pio, Monday March 6th, 2006 from 8:00 a.m. for about a week

Posted: February 15, 2006

Arches Downtime Duration:

CHPC Downtime: Landscapearch and /scratch/serial-pio, Monday March 6th, 2006 from 8:00 a.m. for about a week

Systems affected: Landscapearch and /scratch/serial-pio

Date: Monday March 6th, 2006

Duration: About a week.

Details:

Over the next several months CHPC will be migrating the Arches Opteron Clusters to RedHat Enterprise Linux v4. We are planning to do this migration in phases, but there will be some significant downtimes and other information you will need for planning purposes. This migration will improve our ability to maintain the clusters and is also a step in moving toward a “grid” model.

One of the biggest impacts will be that all /scratch space currently mounted on the arches clusters will be scrubbed (all files removed) during the different stages. We ask you now to start migrating any data you want to save off of this space. CHPC will be prune this space over the next several weeks.

Cleanup of /scratch filesystems 2/16/2006: All files older than 2 months in /scratch will be purged
2/23/2006: All files older than 1 month in /scratch will be purged
3/2/2006: All files older that two weeks in /scratch will be purged

**Every Thursday thereafter the cleanup script will remove files older than two weeks 3/6/2006: *ALL* files in /scratch/serial-pio will be purged (This will coincide with Phase one of the migration (see below). All files you wish to keep from that space will need to be moved by Friday, 3/3/2006.

Phase I: March 6th, 2006 This will be a downtime of approximately 1 week of the landscapearch cluster only, beginning the morning of March 6th, 2006. During this downtime the interactive nodes, compute nodes and necessary administrative systems will be brought up under the new RH kernel. For those users running on landscapearch, you will need to re-compile your programs. CHPC staff are already working on the application builds and we expect everything to be in place before this deadline.

At this same time, we will scrub /scratch/serial-pio of all files and bring it up under a new kernel as well. After the downtime, you will need to edit (or remove) your ~/.ssh/known_hosts file so it can create new entries for the landscapearch interactive nodes.

As this time approaches we will send you more information and details of how to plan your work.

Phases II and III The dates for these phases have not yet been confirmed but we are planning for late March, early April for phase II. Phase III will follow a few weeks after that . During these two phases the rest of the arches clusters (marchingmen, delicatearch and tunnelarch) and the supporting administrative systems will be migrated to the new operating system. The duration for each of these phases is expected to be approximately 1 week. The extent of these downtimes will depend upon our experience with landscapearch so we will not detail it now. However, please remember that during these two phases, the rest of the /scratch (serial, da, mm, parallel) filesystems will be scrubbed completely so we want to emphasize that you need to get all important data off of /scratch over the next month or two.

Please let us know if you have any questions or concerns.

CHPC Downtime: Landscapearch and /scratch/serial-pio, Monday March 6th, 2006 from 8:00 a.m. for about a week

Systems affected: Landscapearch and /scratch/serial-pio

Date: Monday March 6th, 2006

Duration: About a week.

Details:

Over the next several months CHPC will be migrating the Arches Opteron Clusters to RedHat Enterprise Linux v4. We are planning to do this migration in phases, but there will be some significant downtimes and other information you will need for planning purposes. This migration will improve our ability to maintain the clusters and is also a step in moving toward a “grid” model.

One of the biggest impacts will be that all /scratch space currently mounted on the arches clusters will be scrubbed (all files removed) during the different stages. We ask you now to start migrating any data you want to save off of this space. CHPC will be prune this space over the next several weeks.

Cleanup of /scratch filesystems 2/16/2006: All files older than 2 months in /scratch will be purged
2/23/2006: All files older than 1 month in /scratch will be purged
3/2/2006: All files older that two weeks in /scratch will be purged

**Every Thursday thereafter the cleanup script will remove files older than two weeks 3/6/2006: *ALL* files in /scratch/serial-pio will be purged (This will coincide with Phase one of the migration (see below). All files you wish to keep from that space will need to be moved by Friday, 3/3/2006.

Phase I: March 6th, 2006 This will be a downtime of approximately 1 week of the landscapearch cluster only, beginning the morning of March 6th, 2006. During this downtime the interactive nodes, compute nodes and necessary administrative systems will be brought up under the new RH kernel. For those users running on landscapearch, you will need to re-compile your programs. CHPC staff are already working on the application builds and we expect everything to be in place before this deadline.

At this same time, we will scrub /scratch/serial-pio of all files and bring it up under a new kernel as well. After the downtime, you will need to edit (or remove) your ~/.ssh/known_hosts file so it can create new entries for the landscapearch interactive nodes.

As this time approaches we will send you more information and details of how to plan your work.

Phases II and III The dates for these phases have not yet been confirmed but we are planning for late March, early April for phase II. Phase III will follow a few weeks after that . During these two phases the rest of the arches clusters (marchingmen, delicatearch and tunnelarch) and the supporting administrative systems will be migrated to the new operating system. The duration for each of these phases is expected to be approximately 1 week. The extent of these downtimes will depend upon our experience with landscapearch so we will not detail it now. However, please remember that during these two phases, the rest of the /scratch (serial, da, mm, parallel) filesystems will be scrubbed completely so we want to emphasize that you need to get all important data off of /scratch over the next month or two.

Please let us know if you have any questions or concerns.