Gaming mdadm woes / array suddenly kicking OK disk

Scorpei

Well-Known Member
OP
Member
Joined
Aug 21, 2006
Messages
1,295
Trophies
0
Website
scorpei.com
XP
263
Country
Netherlands
On my raid 5 array I recently experienced a bit of strange behaviour. The array is made up out of 6 disks, of which one is hot-spare. All disks are 1 terabyte of 2 brands (WD and samsung), different types (EADS / EACS / spinpoint F1 and F3) and ages. The disks all report smart OK and I believe it :). Anyway, recently I added the hot-spare (spinpoint F1) due to one of the disks periodically reporting smart failure/warning which I have ruled to be a heat issue coming from the silencer it was in + low airflow. To be safe, I added the hotspare. All disks have 0 formatting and are used RAW.

Now however, I moved the array to a new system (the same system, new install on a different disk). Keep in mind the old system still worked flawlessly at this point with the array and perhaps too would the new system if I had moved my mdadm.conf. However the new system reported 1 disk not part of the array (so 5 disk array, including the hot-spare and not containing one of the regular disks) which made me pass an "mdadm --assemble --scan /dev/md0 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg". After, the array was deemed degraded and started rebuilding.....

It would appear a regular disk (WD EACS), that had previously NOT been giving any problems (not the heat issue or any other) suddenly reports with less sectors and not detected to be part of the array... in the old setup there were absolutely 0 problems with the disk. The disk also reports no reallocated sectors or anything else. No partitioning has been done inbetween.... I am at a loss what could be causing this. A full clean sweep of the disk still yields in lower sectors than all the other disks. Meanwhile the array is back-up now containing the hot spare, running in clean raid 5 mode.

before rebuilding the system (new install) already reported the disk not part of a raid array, when booting back into the old OS the array (and thus that disk) came up like a charm.. I did not check to see wether the disk also reported a lower size in the old system as I did not expect this problem. I can still boot into the old system and check which I will do possibly

Anyone have any clue what just happened with the array and perhaps a way to get the drive back in there (as hot-spare)? Easiest way would of course be to backup all data, simply toss the entire array and rebuild it a new while tinkering with partitioning. However as the array is 4 terabytes big I have no where to store the data inbetween and I do want to keep it. Any help appreciated :).

Owyea, I forgot to mention, it';s the only disk fdisk doesn't say has partition table problems (all the other disks report not having one and having invalid flag 0x0000 on partition table 4)
 

Scorpei

Well-Known Member
OP
Member
Joined
Aug 21, 2006
Messages
1,295
Trophies
0
Website
scorpei.com
XP
263
Country
Netherlands
As I feel the urge to keep this updated...

After some research I have tried a few things:
-Rezero the drive via dd if=/dev/zero of=/dev/sdf
-Rezero and diagnose via WD tools
-SMART diagnose with regular linux tools
-Botch the partition table altogether with lots of nice writes
-Tried these things on a different machine
None of this has had any effect. The label on the disk clearly shows the same number of sectors as the other disks and more than are shown via every tool. SMART reports all clear and so does the WD tool without any sectors remapped. This brings me to several new options, of which the currenly most likely being HPA or host protected area. It could be that for whatever reason it was always there and I simply didn't see it (I am not the first owner of the drive, and perhaps due to the way I built my array previous mdadm simply built the array smaller (which I could have kept that way by moving my mdadm.conf instead of --assemble --scan) then the other discs allowed) or that it magically appeared. Not only does my server machine see this disk with lower sectors but a second machine also sees less sectors using a different OS.

In order to test this theory however very few tools are available as people tend not to play around with HPA. I highly doubt this error manifested itself only when I installed the new OS and let it rebuild the array. Theoretically it shouldn't have made any writes to the drive. I am going to run diskstat under linux (currenty the disk isn't connected to a linux machine) to see if HPA is present. If it is, it will be a while to get HPA off there as I need a machine that has support using DOS boot disks for the required tools.

Options to remove HPA
-A linux command line .c
Code:
/* setmax.c - aeb, 000326 - use on 2.4.0test9 or newer */
/* IBM part thanks to Matan Ziv-Av  */
/*
* Results on Maxtor disks:
* The jumper that clips capacity does not influence the value returned
* by READ_NATIVE_MAX_ADDRESS, so it is possible to set the jumper
* and let the kernel, or a utility (like this one) run at boot time
* restore full capacity.
* For example, run "setmax -d 0 /dev/hdX" for suitable X.
* Kernel patches exist that do the same.
*
* Results on IBM disks:
* The jumper that clips capacity is ruthless. You clipped capacity.
* However, if your BIOS hangs on a large disk, do not use the jumper
* but find another machine and use a utility (like this one) to
* clip the non-volatile max address.
* For example, run "setmax -m 66055248 /dev/hdX" for suitable X.
* Now go back to your first machine and proceed as with Maxtor drives above.
*/
#include 
#include 
#include 
#include 
#ifndef HDIO_DRIVE_CMD_AEB
#define HDIO_DRIVE_CMD_AEB 0x031e
#endif
#define INITIALIZE_DRIVE_PARAMETERS 0x91
#define READ_NATIVE_MAX_ADDRESS 0xf8
#define CHECK_POWER_MODE 0xe5
#define SET_MAX   0xf9
#define LBA 0x40
#define VV 1  /* if set in sectorct then NOT volatile */
struct idecmdin {
unsigned char cmd;
unsigned char feature;
unsigned char nsect;
unsigned char sect, lcyl, hcyl;
unsigned char select;
};
struct idecmdout {
unsigned char status;
unsigned char error;
unsigned char nsect;
unsigned char sect, lcyl, hcyl;
unsigned char select;
};
unsigned int
tolba(unsigned char *args) {
return ((args[6] & 0xf) >= 8;
args[5] = (lba & 0xff);
lba >>= 8;
args[6] = (args[6] & 0xf0) | (lba & 0xf);
}
int
get_identity(int fd) {
unsigned char args[4+512] = {WIN_IDENTIFY,0,0,1,};
struct hd_driveid *id = (struct hd_driveid *)&args[4];
if (ioctl(fd, HDIO_DRIVE_CMD, &args)) {
perror("HDIO_DRIVE_CMD");
fprintf(stderr,
"WIN_IDENTIFY failed - trying WIN_PIDENTIFY\n");
args[0] = WIN_PIDENTIFY;
if (ioctl(fd, HDIO_DRIVE_CMD, &args)) {
perror("HDIO_DRIVE_CMD");
fprintf(stderr,
"WIN_PIDENTIFY also failed - giving up\n");
exit(1);
}
}
printf("lba capacity: %d sectors (%lld bytes)\n",
id->lba_capacity,
(long long) id->lba_capacity * 512);
}
/*
* result: in LBA mode precisely what is expected
*		 in CHS mode the correct H and S, and C mod 65536.
*/
unsigned int
get_native_max(int fd, int slave) {
unsigned char args[7];
int i, max;
for (i=0; i
 

Scorpei

Well-Known Member
OP
Member
Joined
Aug 21, 2006
Messages
1,295
Trophies
0
Website
scorpei.com
XP
263
Country
Netherlands
An update. After giving it a few more hours today and putting the drive back in the linux machine I have fixed the problem. I now truly believe my array used to be 4 megabytes smaller then it could have been. It turns out it was the HPA causing the problem. I have not taken the time to see what was in the HPA, I know it wasn't my data so I really didn't care.

In order to find out the problem with the HPA I ended up not having any of the tools I previously looked up or the ability to run them. The CDs didn't function with my hardware (available) and the command line tool didn't properly compile and was written for ATA as opposed to SATA. I did however find out HPA was used on the drive via dmesg.

Code:
[49758.856452] ata17: irq_stat 0x00400040, connection status changed
[49758.856455] ata17: SError: { PHYRdyChg CommWake DevExch }
[49758.856461] ata17: hard resetting link
[49759.752028] ata17: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[u][49759.766072] ata17.00: HPA detected: current 1953523055, native 1953525168[/u]
[49759.766139] ata17.00: ATA-8: WDC WD10EACS-, 01.01A01, max UDMA/133
[49759.766142] ata17.00: 1953523055 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
[49759.766981] ata17.00: configured for UDMA/133
[49759.780025] ata17: EH complete
[49759.780124] scsi 16:0:0:0: Direct-Access	 ATA	  WDC WD10EACS- PQ: 0 ANSI: 5
[49759.780306] sd 16:0:0:0: Attached scsi generic sg8 type 0
[49759.780494] sd 16:0:0:0: [sdh] 1953523055 512-byte logical blocks: (1.00 TB/931 GiB)
[49759.780545] sd 16:0:0:0: [sdh] Write Protect is off
[49759.780548] sd 16:0:0:0: [sdh] Mode Sense: 00 3a 00 00
[49759.780568] sd 16:0:0:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[49759.796228]  sdh: unknown partition table
[49759.796398] sd 16:0:0:0: [sdh] Attached SCSI disk

Now in order to verify this I looked long and hard and finally found one of my favorite programs capable of checking this: hdparm. It turns out hdparm has a -N flag that checks for HPA. It also turns out to be able to change the value of the HPA via -N by passing the sectors wanted, in my case the command ended up being:
hdparm -N1953525168 /dev/sdh


Sadly, that value does not propagate after a power down and the kernel didn't pick up on the new size (without passing a command). Hdparm also has a -k flag which should tell the drive to keep any changes made to it's data even after a power cycle, a dangerous flag to pass if you are doing silly things. Regardless I tried to pass it but it gave me a ioctl error and didn't seem to have worked. After googling a bit more I found that hdparm also as the ability to permanently write it's -N value in a different way.
hdparm -N p1953525168 /dev/sdh
That did the trick and the drive is now back to full size and a new functional hot/cold (spun down) spare for my array. So in the end, the problem could have been fixed in various ways but this it he nicest. The array has been restored to a proper fully functional version using all the disk space available (whoohoo, 4 megs more!!!!) and strange 'secret' parts of the drive / original array work fine again.​
By the way, funny thing, after restoring the disk to it's original size, diskutil suddenly recognised it as part of an array. I think I may have found the problem: a hardware array (afaik mdadm doesn't use HPA) was used and that drive held it's configuration. That would explain the small size (1 megabyte) and the use of HPA in general.​
All in all glad I fixed it. Hope someone else will have use for these posts at some time in the future and if so, glad I could help :).​
 

exangel

executioner angel
Member
Joined
Apr 20, 2010
Messages
1,571
Trophies
0
Age
40
Location
Tucson, AZ
XP
602
Country
United States
Thank you for following up. As a gaming forum people rarely troubleshoot problems this complex here. It'd be a really small number (if any) of other 'tempers that personally own or administer an array of that size. Or even use a RAID configuration other than RAID 0, for that matter.
Anyway, kind of you for letting us know you resolved this, and providing the solution you found, for others who may encounter this thread by search.
 

Coto

-
Member
Joined
Jun 4, 2010
Messages
2,979
Trophies
2
XP
2,565
Country
Chile
By the way, funny thing, after restoring the disk to it's original size, diskutil suddenly recognised it as part of an array. I think I may have found the problem: a hardware array (afaik mdadm doesn't use HPA) was used and that drive held it's configuration. That would explain the small size (1 megabyte) and the use of HPA in general.​
All in all glad I fixed it. Hope someone else will have use for these posts at some time in the future and if so, glad I could help :).​

Thanks, it'll be helpful with newer drives (along newer technology) on older arquitectures!

Since HPA works along BIOS, and your hdd controller was working fine on the older computer, a wrong report from hdd's HPA to BIOS, could causse lockups or corruption or anything.
Incorrectly accesing a different area, mainly if the MBR's 1st OS partition began on 1953525168 (unsure if this was dumped from MBR or the very hdd firmware layer).

Code:
[49759.766072] ata17.00: HPA detected: current 1953523055, native 1953525168

Thanks to this topic, this site has a lot of info about hdds, HPA, running plate's speeds, and so:

http://www.nslu2-lin...ownUSBHarddisks

edit: fixed wrong quote
 

Site & Scene News

Popular threads in this forum

General chit-chat
Help Users
  • No one is chatting at the moment.
    SylverReZ @ SylverReZ: Also nice. Never really watched Fallout on Prime, but sounds like a good show. +1