Dear Valued Customer,
Recently we had an extended outage for the customers on one of our servers. Now that the crisis has passed and we have had time to analyze the root cause of this incident I would like to provide this report.
On February 17th our monitoring system alerted us that a server running Standard class VPS plans was not responding. We immediately investigated why the server was down. The severity of the crash required that we dispatch our on-site data center technicians to gain physical access to the server. We determined that the RAID array had a catastrophic failure and that the data on the array was unrecoverable. Losing data on a RAID array is not a frequent occurrence, but those in our industry know that it does happen. When it does so our recovery process is to restore the VPS containers from nightly backup. We began this process immediately and notified the customers on the server.
This server was using a new backup system that we purchased from a leading vendor in 2009. This system was activated for Standard Class servers in late 2009 and we had been running on the new backup system for several months. This backup software does an excellent job of backing up the servers and provides a high degree of control for managing the backups and selectively restoring files. The backup system also resulted in much higher efficiency of processing when making the backups, resulting in lower load on our production servers. Our plan was to roll this new backup system out to our other types of plans in the future.
When we began restoring the VPS containers from backup, we encountered excruciatingly slow speeds. Prior to deploying the new backup system we performed extensive testing of the restoration process. We found that the restoration speed was about 50% slower than our old method, but we were willing to accept that for two reasons. First, the backup processing side of it was significantly faster than the old method, resulting in less load on the servers, and secondly the vendor promised us that a new version with much faster restore speeds was going to be released soon. We accepted this trade off.
Unfortunately in the real world, the restoration process was not 50% slower, but up to five times slower depending on the number and size of files in the container, and other factors. Our post incident analysis determined that the amount of time and incremental backups taken since the initial seed backup greatly affected the restore times. The restoration process took over 24 hours. In a worst case scenario it would take 4 to 5 hours to restore a server under our previous method. This of course was beyond totally unacceptable. Monitoring during the restore process showed that while the backup server hardware (external FC array, dual quad core cpus, 4Gb of RAM) was showing a load of less than 1, the backups were still taking an exorbitant amount of time. The bottleneck was in the software.
The nightmare didn’t end when we completed the restores. We found out that more than half of the VPS containers that were restored inexplicably had files with their permissions altered. We had never encountered this problem during our pre-implementation testing. When we discovered this we assigned a team comprised of our Chief Technical Officer and other top technical staff to analyze the problem. They devised a solution that used a combination of programming tools and manual effort to fix the problem. This solution was successful due the hard work and perseverance of our staff, but unfortunately took two days to complete.
We have now reverted back to our old, tried and true, backup method and have no plans at this time to implement this new commercial backup system for our other servers. We are engaging this vendor to learn more about the problems we have had with their software.
At this time we would like to again express our deepest apologies to the customers that were impacted by this unfortunate event. A downed server is serious business to us and we appreciate the extreme patience of those customers whose sites were down. We understand that a site outage makes our customers frustrated and upset. A few customers let not only us, but the world, know it. But we believe that our customers have a right to be demanding, and when we do not live up to expectations our company must accept the consequences.
In closing I would like to mention two things. First, our technical staff is seriously engaged in evaluating and testing out a new generation of hosting technologies that we expect to take our service to the next level in terms of performance and reliability. We are setting the bar higher for ourselves and at the proper time we will announce the results of this effort. Secondly, we are blessed by having extremely loyal customers. We are truly grateful for the trust you place in us. We do not take that trust for granted. The people in our company will continue to work as hard as possible to earn the right to be your preferred hosting provider.
Sincerely,
Rick Lingsch
President
eApps Hosting
