Sunday, November 03, 2013

Dell PERC Punctured RAID Array Swing Fix on Windows Server 2008 R2

It feels like any Dell RAID 5 Array will eventually be in what they call a "punctured" state.  I believe what this means that it is replicating bad data across the drives, and so if it were to lose a drive, it could not be recovered.  Everything still works, but if a drive goes, you are going to be sad.

This keeps happening across multiple servers.  I'm not sure if it is something we are doing, or a design flaw, or simply just a potential risk of using RAID.  Regardless I just had the pleasure of fixing one.

Dell support will tell you, backup, delete the RAID config, recreate, and restore.

Well, when you are talking about over 5TB of data, in a production environment where at best you can get a weekend to do this, well that is going to be a problem.  Trying to copy that amount of data from any backup is simply going to take longer than my maintenance window.  Normally, you would just say, sorry users, downtime on the horizon.  But I guess I got lucky this time.

So in this case the physical server itself had more free slots than it had drives.  And for whatever reason, Dell shipped me 4 replacement drives  (2 of the 4 drives in use were showing SMART potential failure).

So, what I ended up doing was adding the new drives to the enclosure, and setting up a new Disk Group, with Virtual Disks that were identical to the ones already in place.

Next I booted into Clonezilla Live and copied the existing partitions in to the new Virtual Disks.  (don't forget to set your boot flag on the copied system reserved partition).  This part still took 12 hours to copy.  But just copying across the disk controller was significantly faster than trying to copy from any external source.

Once complete, I was ready for the "nerve racking" part.

I powered off the system, and pulled the original disks.  There is no way in this PERC to take disks offline, so they had to be removed.  When booting the Server will alert you that the missing virtual disks will have to be removed from the config if you want to continue.  Now, normally you would expect to be able to re-import the Virtual Disks if something went wrong, but I don't know if that is possible due to the "punctured" status on the original Virtual Disk.  Thus the "nerve racking".


Any way, Windows then tried to boot, but failed.  Expected, as the disks have changed.  Specifically the error is this:

"Windows failed to start.A recent hardware or software change might be the cause"
    some more info then
"Info:  The boot selection failed because a required device is inaccessible"

So, I then boot the server with my Windows Server 2008 R2 install disk, and choose "repair computer" once given the choice.

Choose, command prompt, and then once it comes up enter:
Bootrec /RebuildBcd

It will want to scan your system for installations of Windows, and should find one.   Mine reported as being E:\windows  - don't let the different drive letter fool you.  go ahead and add it.  Once done you should be fine to reboot.

I only even post this as Dell support was not very helpful with this.  The concept of this plan sort of threw the people I was working with.  I call it a swing fix, but really my move was only half of a swing,  you could do a full swing (if you didn't have the correct replacement disks) just move your partitions in a similar way to a temporary storage location (perhaps a RAID 0, or stand alone disk) and then recreated your original Virtual Disks, then move the partitions them back to complete the "swing".




No comments: