Update - Services are now available after underlying storage problems on the evening of 24 July

alert

[25 July 2014]

Update 2.00pm, 25 July: Further details on the incident and some FAQs are available below. 


Update, 8.30am, 25 July: All services identified as being affected by the storage problems yesterday evening are back.  If you are aware of any ongoing problems please contact the IT Service Desk.


Update, 8.00am, 25 July: The Plone CMS service was restarted last night and affected websites are available. We will continue to monitor services closely.


Update, 9.40pm, 24 July: Most services are back but the Plone website Content Management System (CMS) is unavailable. This means that some Faculty, Schools and Department web sites are unavailable. We also have reports that some Library services are unavailable. We are working to recover these services.


Update, 8.25pm, 24 July: Storage back at 6.30pm and we have been bringing back services since. Most services were back by 7.30pm We are continuing to work on restoring the remaining services.


Update, 7.00pm, 24 July: Work has taken place to bring back the storage system. Services should be returning but this may take some time. The storage system is considered at risk and additional work is needed.


5.45pm: A large number of  University services are currently unavailable. This is due to an issue with a storage system underlying the services. IT Services is working to resolve this as soon as possible in conjunction with our supplier.

Services affected include:

  • Finance Systems
  • Student Systems
  • Research Systems (including Pure)
  • HR Systems
  • Sympa (mailing list service)
  • MyBristol
  • Contact Directory
  • Business Objects
  • Proactis
  • Open Days Booking
  • Datahub
  • Student Printing
  • RxWorks
  • Aleph
  • Wiki Service
  • Site Manager CMS (editing of pages)
  • Plone CMS (editing of pages)
  • Planon
  • Lenel
  • Cougar

 

Further details and some FAQs

What caused the interruption?

A controller within our main data storage array failed. A large number of our services run on this underlying storage array. We fixed the problem by diagnosing the cause and then swapping the faulty part out, under guidance from our supplier.

What resilience do you have in the storage?

There are multiple controllers and multiple copies of the data within the storage array. The system is designed such that hardware items within the array may fail in use, and the services and data will be completely unaffected.

However, this particular failure occurred during the middle of a routine software (firmware) upgrade. A failure during this brief period of the upgrade is more serious, and in this case the system didn't come back cleanly.

Could the fault with the component have been detected in advance?

Our supplier performs health checks on the hardware every 24 hours. The last check did not detect any problems. This issue is the main focus of the washup meeting we will be holding with the manufacturer over the next few days.

Was any data lost?

No data on the storage was lost in the incident.

It is possible that some unsaved data was lost (eg, if someone was in the middle of editing a webpage at the time, they may have been unable to then save the page)

Why does it take time to restore the service?

Modern IT architecture is constructed in layers, eg network, storage, servers, and finally the applications people use. This approach is necessary to provide well performing and cost effective IT.

On this occasion the storage failed. We’ve first got to diagnose the issue and then restore the storage service. Only once this has been done can the various servers and services which run on top of it be restarted. Some services are in turn dependent on other services and therefore there is a distinct order in which some of them must be brought back up. Most services were back by 7.30pm, but some were restored later.

Isn’t it dangerous having so many systems dependent on the storage array?

No, quite the opposite. Although obviously an interruption happened on this occasion, problems with the storage system are quite rare, as the array has so much internal resilience. Failures would occur more frequently, and be more difficult to recover from, if each server had its own internal storage.

Is this related to the MyFiles storage problems?

No. That was an entirely separate storage system designed to hold staff and student individual files. This is a storage system which underpins servers hosting business applications.

We provide many storage services, of which MyFiles is only one, and it is discrete from the other storage services. There are no dependencies between them.

What would you have done if you couldn't have restored the array?

We also have an entirely separate version ('B end') of the storage located on another site. This is intended for use in a scenario where the primary version (A end) cannot be recovered.

However, there is a lag in data being copied to the B end. Failing over to it is not a quick process, and will result in some data loss as the B end is always slightly less up to date that the A end. Therefore on this occasion the best course of action to minimise interruption and avoid loss of data was to restore the A end.