ITS Service Status Report

Service Degradation — Resolved

Great Lakes IME Burst Buffer Maintenance

Services Affected: Research HPC

Start Time: 05/05/2021 8:30 am

Service Restored: 05/05/2021 11:17 am

Issue Symptoms: Degradation

The IME Burst Buffer on Great Lakes is unavailable.  Access and jobs on Great Lakes that did not use IME are not impacted.

Users who had data on IME should send a request to arcts-support@umich.edu  to have their IME data synced to Scratch.

IME will remain unavailable until an emergency software update is scheduled with the vendor

We are planning an update of the system to a new version of the software on 3/25. 

Update 5/4:  We are planning to execute this update at 8:30am Wednesday 5/5. 

Who is Impacted? Users of Great Lakes

Next Update: At completion of emergency change 3/25

Technical Details

Service Type: Production

Server Name: Great Lakes

Comments:

A new version of the software was tested the week of April 26th and found to be stable in testing.  We will update here when a rollout to full production is scheduled and tested at scale.

 

The update scheduled for 3/25 was unsuccessful at resolving all issues.  Systems works ok at small scale but larger 16+ node jobs the system still locks up and evicts server nodes making the system unusable.  It has been removed form Great Lakes production nodes again and new data uploaded to the vendor keeping the issue open.

 

An emergency change request is scheduled for 3/25

 

IME use is causing Great Lakes compute nodes to lock until IME is disabled.  This appears to be a bug in the version of IME we updated to in January and has now caused a few small outages.

We have an open issue with the vendor and they have a fix.  Applying that fix with their support is not scheduled at this time. 

IME has a small specialized user base and we will keep IME up on a test system to service requests for data in IME.  A communication will be emailed to users who currently have data in IME and offer to sync it to Scratch.

No other HPC clusters have IME and are not impacted by this issue.

Update 5/4: We have received the proper updates and have applied them, we will be restarting the IME daemons at 8:30 am, 5/5.   There should be no adverse effects on user jobs that are running while the update proceeds.  

 

Update: 5/5  IME is working and stable on Great Lakes and available to users.

Report Additional Impacts

Contact the ITS Service Center for more information or to report additional impacts.