Host 3 Server

  • Sunday, 20th May, 2018
  • 14:09pm
UPDATE May 23 @12:26 p.m. EST -- All websites have been restored. If you are missing a website, missing email, or have an outdated database I can try to restore those from the remote backup taken Sunday. Send an email to support@vervehosting.com with as much detail as possible.

UPDATE May 22 @9:11 p.m. EST - About 50% of the sites have been restored. The restore process is ongoing.

UPDATE May 22 @7:30 a.m. EST -
From LW: As previously stated the deploy is still running so I have taken no action.  Upon through investigation it appears that the deploy was hung on installing Cloudlinux and this has caused the delay in getting your server online.  I have canceled the cloudlinux installation and I will have to complete this manually after the rest of the display is finished

UPDATE May 22 @4:12 a.m.  EST
-
From LW: At this time the server is still deploying and installing critical server software.


For those with questions, this is a timeline of events that's caused the current problems:


May 20 @1:55 p.m. EST  - The server goes down briefly, reboots and is online for about 10 minutes before it's offline again.

May 20 @2:00 p.m. EST -  Liquid Web (the company we lease the server from) is notified and say they are looking into it.

May 20 @2:10 p.m. EST -  LW notifies me that power to rack where the host3 server sits was briefly interrupted and the servers are rebooting.

May 20 @2:23 p.m. EST  - LW notifies me that the host3 server was damaged due to the sudden power outage and would not reboot. Their systems recovery team is investigating.

May 20 @3:16 p.m. EST -  LW responds that there are inconsistencies in the drives on the host3 server and they are starting a manual disk check.

May 20 @10:33 p.m. EST  - LW notifies me that there is extensive corruption in the files system and the server will have to be restored from a backup.

May 21 @3:38 a.m EST  - LW notifies me that they are setting up the new server.

May 21 @9:00 a.m. EST  - LW notifies me that they are beginning the restore process. They used the wrong IP address to restore the sites to. We corrected the IP addresses. The MySQL users are not restored with the database for some of the sites. This has to be manually corrected by logging into the cpanel for affected sites, creating the MySQL user and updating the configuration file for the site if necessary.

May 21 @5:00 p.m. EST  - We notice that the server is running out of space and realize that LW has partitioned the server incorrectly. We try to move directories around to create space but could not create enough space for the /var directory where the MySQL databases resided. LW insists on taking the server offline to resize the partition. With no other options, we reluctantly agree to let them take the server offline @ 11:00 p.m.

May 22 @12:46 a.m. LW  - notifies us that there's an issue with the LVM resize that caused the entire /home partition to be unusable and as a result the server won't boot and the drive is also failing to FSCK. As the system is failing to boot and the /home partition is unusable I will have to start the restore process over from the beginning.  This will take a few hours as this will require a new OS to be booted and then the sites restored from the backup drive.


UPDATE on 5/22 @ 12:46 a.m. : Liquid Web (the datacenter where we lease the server) screwed the resize of the /var partition:

Hello,

I have picked up this ticket as David is no longer in the office.  I have had an initial look at the server and it appears there may have been an issue with the LVM resize that caused the entire /home partition to be unusable and as a result the server won't boot and the drive is also failing to FSCK.

Looking at the server I am unsure why the server was kicked as LVM when our normal kicks only use a normal single / partition scheme so that there are only 2 partitions on the system.

As the system is failing to boot and the /home partition is unusable I will have to start the restore process over from the beginning.  This will take a few hours as this will require a new OS to be booted and then the sites restored from your backup drive.

I will update you again when your server is back online.

Regards,
Nathaniel C. Bailey
Systems Monitoring & Recovery Team



UPDATE: The techs at the datacenter have to take the server offline to increase the /var partition size. They expect this will take about 1 hour.

UPDATE: I know this has been a nitemare and I sincerely apologize. Please be patient. We are working as fast as we can. We are currently working to resolve the error message regarding space on the / partition. Once that's done I will go to each site and resolve the database issues.

UPDATE: Liquid Web's recovery team will start restoring the sites from the backup drive. Once that's done they will rsync the /home partition from the old server to the new server. There is no ETA on how long this will take. The cPanel restore function is not fast, but may go faster if the restore is done while there's no traffic to the server. I apologize for the issues with the server and will be issuing downtime credits for May.

UPDATE: Due to the extensive corruption in the file-system for host3.vervehosting.com, the manual FSCK completed, but the SQL data residing in /var was moved to lost+found and is now unrecoverable.. The techs have started building a new server and will restore from the local backup drive. This is going to take several hours. I apologize for the inconvenience. If you have backups of your sites I can set you up on a different server right now and you can restore your backups there.

UPDATE: The tech is running a chck disk on the drive instead of replacing it. He expects it to take a couple of hours due to the amount of data on the drive.

UPDATE: The server failed to boot once power was restored. They are investigating.

The power supply in the rack that holds the host3 server died and had to be replaced. The technicians at the datacenter are replacing it now and expect that all affected servers will be back oline in 20 - 30 minutes.
« Terug