mdadm woes / array suddenly kicking OK disk

Discussion in 'Computer Games and General Discussion' started by Scorpei, Jan 14, 2012.

Jan 14, 2012
  1. Scorpei
    OP

    Member Scorpei GBAtemp Maniac

    Joined:
    Aug 21, 2006
    Messages:
    1,295
    Country:
    Netherlands
    On my raid 5 array I recently experienced a bit of strange behaviour. The array is made up out of 6 disks, of which one is hot-spare. All disks are 1 terabyte of 2 brands (WD and samsung), different types (EADS / EACS / spinpoint F1 and F3) and ages. The disks all report smart OK and I believe it :). Anyway, recently I added the hot-spare (spinpoint F1) due to one of the disks periodically reporting smart failure/warning which I have ruled to be a heat issue coming from the silencer it was in + low airflow. To be safe, I added the hotspare. All disks have 0 formatting and are used RAW.

    Now however, I moved the array to a new system (the same system, new install on a different disk). Keep in mind the old system still worked flawlessly at this point with the array and perhaps too would the new system if I had moved my mdadm.conf. However the new system reported 1 disk not part of the array (so 5 disk array, including the hot-spare and not containing one of the regular disks) which made me pass an "mdadm --assemble --scan /dev/md0 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg". After, the array was deemed degraded and started rebuilding.....

    It would appear a regular disk (WD EACS), that had previously NOT been giving any problems (not the heat issue or any other) suddenly reports with less sectors and not detected to be part of the array... in the old setup there were absolutely 0 problems with the disk. The disk also reports no reallocated sectors or anything else. No partitioning has been done inbetween.... I am at a loss what could be causing this. A full clean sweep of the disk still yields in lower sectors than all the other disks. Meanwhile the array is back-up now containing the hot spare, running in clean raid 5 mode.

    before rebuilding the system (new install) already reported the disk not part of a raid array, when booting back into the old OS the array (and thus that disk) came up like a charm.. I did not check to see wether the disk also reported a lower size in the old system as I did not expect this problem. I can still boot into the old system and check which I will do possibly

    Anyone have any clue what just happened with the array and perhaps a way to get the drive back in there (as hot-spare)? Easiest way would of course be to backup all data, simply toss the entire array and rebuild it a new while tinkering with partitioning. However as the array is 4 terabytes big I have no where to store the data inbetween and I do want to keep it. Any help appreciated :).

    Owyea, I forgot to mention, it';s the only disk fdisk doesn't say has partition table problems (all the other disks report not having one and having invalid flag 0x0000 on partition table 4)
     
  2. Scorpei
    OP

    Member Scorpei GBAtemp Maniac

    Joined:
    Aug 21, 2006
    Messages:
    1,295
    Country:
    Netherlands
    As I feel the urge to keep this updated...

    After some research I have tried a few things:
    -Rezero the drive via dd if=/dev/zero of=/dev/sdf
    -Rezero and diagnose via WD tools
    -SMART diagnose with regular linux tools
    -Botch the partition table altogether with lots of nice writes
    -Tried these things on a different machine
    None of this has had any effect. The label on the disk clearly shows the same number of sectors as the other disks and more than are shown via every tool. SMART reports all clear and so does the WD tool without any sectors remapped. This brings me to several new options, of which the currenly most likely being HPA or host protected area. It could be that for whatever reason it was always there and I simply didn't see it (I am not the first owner of the drive, and perhaps due to the way I built my array previous mdadm simply built the array smaller (which I could have kept that way by moving my mdadm.conf instead of --assemble --scan) then the other discs allowed) or that it magically appeared. Not only does my server machine see this disk with lower sectors but a second machine also sees less sectors using a different OS.

    In order to test this theory however very few tools are available as people tend not to play around with HPA. I highly doubt this error manifested itself only when I installed the new OS and let it rebuild the array. Theoretically it shouldn't have made any writes to the drive. I am going to run diskstat under linux (currenty the disk isn't connected to a linux machine) to see if HPA is present. If it is, it will be a while to get HPA off there as I need a machine that has support using DOS boot disks for the required tools.

    Options to remove HPA
    -A linux command line .c
    Warning: Spoilers inside!
     
  3. Scorpei
    OP

    Member Scorpei GBAtemp Maniac

    Joined:
    Aug 21, 2006
    Messages:
    1,295
    Country:
    Netherlands
    An update. After giving it a few more hours today and putting the drive back in the linux machine I have fixed the problem. I now truly believe my array used to be 4 megabytes smaller then it could have been. It turns out it was the HPA causing the problem. I have not taken the time to see what was in the HPA, I know it wasn't my data so I really didn't care.

    In order to find out the problem with the HPA I ended up not having any of the tools I previously looked up or the ability to run them. The CDs didn't function with my hardware (available) and the command line tool didn't properly compile and was written for ATA as opposed to SATA. I did however find out HPA was used on the drive via dmesg.

    Warning: Spoilers inside!

    Now in order to verify this I looked long and hard and finally found one of my favorite programs capable of checking this: hdparm. It turns out hdparm has a -N flag that checks for HPA. It also turns out to be able to change the value of the HPA via -N by passing the sectors wanted, in my case the command ended up being:
    hdparm -N1953525168 /dev/sdh


    Sadly, that value does not propagate after a power down and the kernel didn't pick up on the new size (without passing a command). Hdparm also has a -k flag which should tell the drive to keep any changes made to it's data even after a power cycle, a dangerous flag to pass if you are doing silly things. Regardless I tried to pass it but it gave me a ioctl error and didn't seem to have worked. After googling a bit more I found that hdparm also as the ability to permanently write it's -N value in a different way.
    hdparm -N p1953525168 /dev/sdh
    That did the trick and the drive is now back to full size and a new functional hot/cold (spun down) spare for my array. So in the end, the problem could have been fixed in various ways but this it he nicest. The array has been restored to a proper fully functional version using all the disk space available (whoohoo, 4 megs more!!!!) and strange 'secret' parts of the drive / original array work fine again.​
    By the way, funny thing, after restoring the disk to it's original size, diskutil suddenly recognised it as part of an array. I think I may have found the problem: a hardware array (afaik mdadm doesn't use HPA) was used and that drive held it's configuration. That would explain the small size (1 megabyte) and the use of HPA in general.​
    All in all glad I fixed it. Hope someone else will have use for these posts at some time in the future and if so, glad I could help :).​
     
  4. exangel

    Member exangel executioner angel

    Joined:
    Apr 20, 2010
    Messages:
    1,574
    Location:
    Tucson, AZ
    Country:
    United States
    Thank you for following up. As a gaming forum people rarely troubleshoot problems this complex here. It'd be a really small number (if any) of other 'tempers that personally own or administer an array of that size. Or even use a RAID configuration other than RAID 0, for that matter.
    Anyway, kind of you for letting us know you resolved this, and providing the solution you found, for others who may encounter this thread by search.
     
  5. Coto

    Member Coto GBAtemp Addict

    Joined:
    Jun 4, 2010
    Messages:
    2,277
    Country:
    Chile
    Thanks, it'll be helpful with newer drives (along newer technology) on older arquitectures!

    Since HPA works along BIOS, and your hdd controller was working fine on the older computer, a wrong report from hdd's HPA to BIOS, could causse lockups or corruption or anything.
    Incorrectly accesing a different area, mainly if the MBR's 1st OS partition began on 1953525168 (unsure if this was dumped from MBR or the very hdd firmware layer).

    Code:
    [49759.766072] ata17.00: HPA detected: current 1953523055, native 1953525168
    
    Thanks to this topic, this site has a lot of info about hdds, HPA, running plate's speeds, and so:

    http://www.nslu2-lin...ownUSBHarddisks

    edit: fixed wrong quote
     

Share This Page