Posted by: Judith Morganson
Last month, between the 11th January and the 24th January we experienced three infrastructure outages.
In line with our desire to be transparent and to provide as much information to our customers as possible, below we have detailed what happened and what we did.
While infrastructure failures are not common they do occur from time to time and if the outages caused any impact to clients business we apologise.
Osmotion uses a third party datacenter provider called SoftLayer (www.softlayer.com). When providing hosting services Osmotion commissions and maintains our clients installations on Virtual Machines hosted on SoftLayer infrastructure.
On the 12th January 2013 a piece of hardware called a router began to display erratic behavior, the main symptom being higher than usual CPU usage. The behavior first became apparent at 11:18am (AEST). While SoftLayer had to intervene, the hardware did not fail but did create disruptions in web traffic as some requests timed out.
Within 2 hours SoftLayer had moved to the failover hardware and brought the router back online. At 18:30pm (AEST) SoftLayer advised that they would be upgrading the hardware in 30 minutes and that a short disruption would occur during the process.
The upgrade was completed and an upgraded redundant server was put in place to handle further issues. Final notification was received at 19:50pm (AEST) advising the issue was resolved and would be monitored.
On the 20th January 2013 the same router began to display similar symptoms. SoftLayer advised us at 12:54pm (AEST) that some disruption was occurring at that the issue was being investigated. At 13:50am SoftLayer advised the issue was causing significant disruption and an outage would occur while their engineers investigated and rectified. SoftLayer advised that issue was closed and all services had been returned at 17:31pm (AEST).
On the 24th January 2013 the router again began to display the same symptoms. Osmotion received a notification at 01:46am (AEST) advising that the issue was being investigated.
At 06:30am (AEST) we received notification that the offending hardware would be completely replaced with a new device. We were advised that there would be an outage for 10mins between 07:12am (AEST) and 07:22am (AEST).
At 07:32am (AEST) we were advised that all issues were resolved and services restored.
During this period The Osmotion Technical Team and the Support Team worked closely together to work on notifying our customers and keeping them up to date.
However, this was our first major series of issues and we were keen to review what we did and identify how we can improve our processes. Our major focus was to determine if our hosted clients received the information they needed in a timely manner and if the channels we used (Email, Phone and Helpdesk) were appropriate.
The key review items for our Support Team were:
Last year we introduced two tools to manage application and infrastructure monitoring; Nagios and New Relic. The intention was that we would be notified as soon as an issue became apparent and allow us to notify our customers rather than the other way around.
However, during each of these incidents our customers new just before or at the same time we did. Our review identified that the notifications are on a 15 minute schedule, this means that there is the potential for us to find out up to 15 minutes after an outage occurs (This is in fact what happen on one occasion).
Additionally, our processes focused on our own notifications and the SoftLayer notifications were not monitored in real-time but rather as a diagnostic tool.
To rectify this we have started to work with SoftLayer to make their notifications real-time and to shorten the schedule for our own tools. The final solution for consuming notifications will require some work and we will update our Hosted Customers once we have finalised and formalised the process.
Once identified, our best endeavors to notify you of the disruption relied on out of date contact lists and email addresses, slowing up the notification process.
To rectify this we have asked for your help to create the most appropriate recipient list for notifications. Each of our Hosted Clients has been asked to complete a questionnaire that will allow us to get the information to the people that need it. Our internal systems are in place to call up the list as required.
We also spent some time analyzing and critiquing the most effective means of communicating to clients. Again we have engaged directly with our Hosted Clients to ascertain which channels are appropriate for which users. If you would like to suggest additional channels please feel free to make suggestions below or on our Helpdesk.
I believe that our capacity and capability to deal with clients directly is a key part of our Support value proposition. We will maintain phone calls as our first step to ensure clarity in communication.
However, it is understood that people are not always available and the commitment is that all initial notifications and updates will be forwarded by email and text message where requested.
Updates will continue to be posted to the Osmo Helpdesk Home Page to ensure users not on the key contacts list have access to the information.
Thank you for your cooperation and feedback. While we have some work to finalise with internal notifications, we feel this experience has helped with another Osmotion business improvement.