ITS Service Status Report

Emergency Maintenance — Service Restored

Restarting non-restricted Turbo, Locker and Data Den Globus endpoint.

Services Affected: Research Storage

Start Time: 01/11/2023 11:00 am

Service Restored: 01/26/2023 9:46 pm

Issue Symptoms: Outage

We need to restart the server behind the Globus endpoints for Turbo, Locker and Data Den to potentially fix some contention issues between the group sync scripts and globus transfers on the server that are causing large file transfers to restart.  This will be done on 1/11 at 11am.   We will also be restarting one of the Locker/Data Den protocol nodes as a part of this maintenance.  

Edit 12:15 pm: The restarts did not fix the issues.  We are continuing to look into the problem and identify potential issues.

Edit 3:30 pm: The endpoints have been up since 12:30, we are still looking into why sometimes volumes lose connection periodically to the server behind the endpoints.  We are still investigating. 

Edit 1/12 5pm:The endpoints were working smoothly except for a specific job involving a 1.7 TB file.  We have paused that customers other jobs to help with issues until the large job file completes.  We are going to let things run overnight until the large job finished and restart the customer's other job.  We are continuing to add tuning parameters to attempt to improve performance on the client and  NFS server both.  

Edit 1/17: noon: We are still having issues with large files finishing.  We have locked it down to checksum issues on the xfer1 node.  We are continuing to look into issues with the locker NFS server and engage with IBM for improvements. All jobs are currently finishing except for files > 2 TB. 

Edit 1/18  3pm: We have deployed a new transfer node and moved all of the jobs to it.  Testing shows that large jobs are completing on the new node.  We have removed the old node from service and performing upgrades to it, and will be returning it to service in the next day or two.  We are working with the Globus team to explore the original error and understand what the issue was with the initial endpoint. 

Edit 1/23: 3pm: We are still watching things, but we have not had a recurrence of the initial errors since 1/19.   There may be some issues with performance.  If no errors arise between today and tomorrow, we will close this issue 

Edit 1/26.  We are closing this out, as we have not had a recurrence after a week of operation.  We are also moving more bandwidth through these two servers than we did in three older servers, so things are better working order. 

 

 

Who is Impacted? Users of the ARC Globus service.

Technical Details

Service Type: Production

Comments:

1/12 update:   We added some extra tuning parameters that were on the flux-xfer server which also improved performance, but not enough to allow the large jobs to finish.  We are reaching out to globus and the GPFS vendor to look for additional parameters to make transfers more smooth. 

1/17 update: We are adding a second server to the Turbom Locker and Data Den pools, and upgrading the OS to RHEL8 from RHEL 7 to improve performance.

 

 

Report Additional Impacts

Contact the ITS Service Center for more information or to report additional impacts.