ITS Service Status Report

Service Degradation

MiStorage disaster recovery sync delay

Services Affected: MiStorage CIFS, MiStorage NFS

Start Time: 09/01/2020 9:50 am

Anticipated End Time: Unknown

Issue Symptoms: Degradation

September 21, 2020: Troubleshooting with vendor continues. Researching hardware solution.

September 17, 2020: Troubleshooting with vendor continues. Researching hardware solution.

September 16, 2020: Troubleshooting with vendor continues.

September 15, 2020: Sent more data to vendor to analyze. Have been able to get 99% of shares within SLE. May need alternative disaster recovery for the remaining shares. 

September 14. 2020: Setting change did not resolve issue. Repeated service restarts during the weekend have improved the percentage of syncs completing within SLE than what we had on Friday, but ITS is continuing to troubleshoot with vendor.

September 12, 2020: Implemented setting change and restarted service as recommended by vendor and have begun to see improvements. On track to be back in SLE by Monday.

September 11, 2020: No production issues to MiStorage shares, but the regular daily sync of production data to disaster recovery has been lagging behind by more than one day. ITS is troubleshooting with vendor to determine source of lag and remedy this.

  • What we are doing: ITS is working with vendor.
  • Why we are doing this: Disaster recovery backup is beyond service level expectations.
  • Who will be impacted: This is a notification only. The only impact would be if a disaster strikes the MACC data center and a restore is required from ASB. If that happens before the problem is resolved then some MiStorage shares may be restored without some of their most recent data.
  • Who will not be impacted: No impact to other systems such as Turbo or disaster recovery such as MiBackup.

Who is Impacted? IT staff responsible for disaster recovery planning

Next Update: 12:00 PM Monday, September 28, 2020

Technical Details

Service Type: Production

Server Name: asb-its-cli.ifs.umich.edu

Comments:

September 12, 2020: Implemented setting change and restarted service as recommended by vendor and have begun to see improvements. On track to be back in SLE by Monday. Added worker limit of to SyncIQ rule on disaster recovery cluster. Started at 70% as per recommendation but still saw issues. Updated to 99% and saw improvement. Theory was that without limit jobs were ending due to timeouts and not being cleaned up. With rule in place there is worker clean up by the system.

Report Additional Impacts

Contact the ITS Service Center for more information or to report additional impacts.