Home Service Updates Recent Server Outage - Post Mortem

Recent Server Outage - Post Mortem

WinWorld
Status Updates:

Find us on Facebook Follow us on Twitter

Recent Server Outage - Post Mortem

This is a follow up review of the recent server outage that occurred Monday July 12th, 2010.

 

We received the first report of the outage at around 5:30pm Monday evening. Shortly following that, technicians at the data center were able to diagnose the problem as being two failed hard drives in one of our servers. The data center is physically located in the Equinix facility in Dulles, VA. The company we use to manage our servers is Serverbeach. We have been using their services for six years without any hardware failures. The technicians were able to quickly diagnose the problem and provision a new server with brand new drives. Early the next morning we began to reload all files from backups. The full restore took until about 5:30 Tuesday evening. 

 

One thing we have focused on with our hosting infrastructure at WinWorld is redundancy and disaster recovery. We were utilizing a redundant array of hard drives to mitigate the possibility of a hard drive failure but in the unlikely event of that happening, we also had full nightly backups in place to restore quickly with minimal data loss. We were pleased that our backup system functioned as promised in this unlikely event.

 

Over this past week we have reviewed the incident and made the following improvements:

  • We have a system in place on the server that will notify us if a hard drive is starting to fail. We have increased the sensitivity level of that monitoring system to provide more advance warning in the future.
  • Our disaster recovery plan dictated how we would recover from the server outage. We were able to add more detail to the restore procedure so it could be expedited in the future.
  • We realized that transmitting a large volume of data from our off-site backup location was time consuming. We will investigate storing large sets of files from our backup rotation on a server that is within the datacenter so that the bulk of files could be restored more quickly in the future.
  • We expanded the backup system to include more configuration files so less reprogramming would be required.
  • We will be investigating more frequent backups of the database throughout the day so less data would be lost.

Since the server outage, some clients have asked what they can do to reduce downtime if a hardware failure were to occur again.

 

The type accounts that were affected by the outage were shared hosting accounts on one of our servers (see hosting types description on our website). One way to mitigate downtime would be to move to a more isolated hosting environment, such as virtual or dedicated. This has two benefits; firstly the load on the hardware over time is only that which your site places on the server. Secondly, if a hardware failure occurred, your files would be the only ones needing to be restored. If uptime is even more critical for your web application, we could setup a live backup of your website so that if the primary site failed, traffic could be diverted to the backup site immediately. If you are interested in any of these options, please contact our sales representative.

 

Our own site was affected by the outage. We will likely implement a live backup site for ourselves so that, in the event of an outage, we could divert traffic to it and communicate more effectively.

 

We were able to utilize our Constant Contact email system, Facebook, Twitter, and Google Sites to communicate during the outage.

 

Please ensure your technical point of contact is subscribed to our newsletter and/or our Facebook/Twitter account.

 

Due to the extra measures we have put in place, we do not anticipate this type of situation occurring again any time soon.

 

We want to express our appreciation for everyone's patience during and following the downtime.

 

If you have any questions about the outage, please feel free to post them as comments on this article which is archived on our website.

 

Yours in service,

 

Jase Clamp

Director of Operations

WinWorld




WinWorld Footer
Comments (1)Add Comment
0
Hot spare disk
written by Jeremy Gault, July 26, 2010
If you are using RAID 5, and have not already done so, you may want to look into having a "hot spare" drive.

For example: If your RAID 5 array has three drives, you would have a fourth drive in the system that was not used (but was powered up.) This fourth drive would be marked as your "hot spare" drive. If one of your three drives within the RAID 5 array were to fail, the controller would immediately begin rebuilding the contents of that drive onto the "hot spare" drive (using the contents of the remaining two drives.) As long as you didn't have a second drive failure before the rebuilt completed, you would be safe. After the rebuild completes, you could have a second drive failure and still retain your data. (RAID 5 works so long as you have at least two of the drives working.)

Of course, if two drives fail simultaneously (or the second fails before the rebuild is complete), you're out of luck. However, having that "hot spare" can decrease the risk of any data loss.

Write comment

security code
Write the displayed characters


busy
Web Content Management System Web Development Technical Partners

Newsletter Signup

Signup for WinWorld's monthly newsletter
and special offers.

ARCHIVE >>
Our Web Design page on Facebook
  
Our Twitter Feed