August 4th update:
Tangent update #2:
The issues with nodes not completing the configuration process continue. Since yesterday no jobs have successfully started. Therefore we have put a reservation on the nodes to stop jobs from trying to start. We are working with the FLUX group to resolve the current issue and test before returning the nodes to service.
We suggest that users of tangent look to the other resources available to run their jobs until we send out a message that tangent is again ready for use.
July 28th update:
Currently, there are changes being made by the FLUX group that administers the servers that are used for the tangent compute nodes. You can see details about the project that provides this resource in the tangent user’s guide at https://www.chpc.utah.edu/documentation/guides/tangent.php. Note that these changes are part of the dynamic nature of this project, and in the near future we anticipate that there may be additional inconsistencies in the behavior of the tangent batch system.
We are in communication with this group on the changes, and we will continue to work with them as they make further changes, to evaluate the impact (and provide feedback) and to work to adapt to the changes.
We have attempted to address the configuration issue mentioned in the previous message to mitigate the impact as much as possible. To this end, we will continue to make adjustments to do this when possible.
If you notice additional problems, please report them to firstname.lastname@example.org
Original Post September 26th:
There is an issue on tangent where jobs are not starting. Jobs are trying to start, but the nodes are not completing the configuration step, and then after about 25 minutes or so they exit from the queue. We are working on resolving the issue, and will update when we know more.