Adding a new disk resulted in balance failure

bwc · February 7, 2020, 2:52pm

[Please complete the below template with details of the problem reported on your Web-UI. Be as detailed as possible. Community members, including developers, shall try and help. Thanks for your time in reporting this issue! We recommend purchasing commercial support for expedited support directly from the developers.]

Brief description of the problem

New user here:
after adding a new disk (‘shucked’ 4TB Seagate Desktop Expansion drive) the system-initiated balance failed with an error. Running stable 3.9.2-52 (active subscription) Linux 4.12.4-1.el7.elrepo.x86_64

Detailed step by step instructions to reproduce the problem

working system is Raid 1 (2 x 3TB WD Red & 1x 4TB WD ‘shucked’ drive) with one Pool and two Shares (AFP & SMB)
Resize Pool appeared to be successful with no errors on any of the four disks
Following failed Balance self-tests performed on newly-added disk without errors
New Balance started which failed with same error msg
Unable to find any clues in the logs (fairly new to Linux as well as Rockstar ) !
Any help would be much appreciated
thanks
Brian

Web-UI screenshot

[Drag and drop the image here]

Error Traceback provided on the Web-UI


[paste here]

bwc · February 7, 2020, 3:05pm

Forgot to add that the Power Status is showing as Unknown (also when booting the system throws up a few ACPI error msgs)

phillxnet · February 7, 2020, 4:01pm

@bwc Welcome to the Rockstor community.

Could you include a screen grab of your entire Pool details page.

Also the output of the following commands may help those here chip in. Just in case Rockstor’s Web-UI has lost it’s way at some point.

btrfs fi show

and

btrfs fi usage /mnt2/PROJECTS

and

btrfs balance status /mnt2/PROJECTS

And from:

I’m assuming you are referring to a S.M.A.R.T self test rather than a pool scrub here?

Given we see Input/output error that does rather point to a disk error.

You could look to your system logs at around the problem times

journalctl

up/down cursor or PageUp/PageDown keys to navigate and Ctrl + C to exit that commands output.

Could it be you have input output errors on one of the other disks? Are you SATA cables all sure also?

I think this is a red herring. If your drive is accessible via S.M.A.R.T and is showing in the disks screen then you are probably past a drive power issue.

Lets hope the output of these commands can shed some light on things.

bwc · February 8, 2020, 5:44pm

Thanks for your super fast response
and it does appear that the other disks are experiencing I/O errors
screenshot of the Pools page (I had already initiated a Pool Scrub - previous tests were SMART):

![PROJECTS_detail|690x421](u

pload://cdKgyTrgFlEYDwkG8uaD26E24SQ.png)

and within a minute or so of starting a new Balance, syslog showed:

and threw out quite a few more of those errors
Following completion of the scrub further examination of the syslog showed more errors:

lots of these errors until it started to display the affected folder and files

and even more scrutiny revealed that the errors are on all 3 of the original disks, effecting many files (although most are on dev/sdd which is btrfs id 3, the 4TB shucked WD Elements drive)
An attempt to open a few of these files did indeed show them to be corrupted
I haven’t tested the sata cables but the errors seem to suggest rather more serious data corruption - seems unlikely that 3 cables could be dodgy?

My conclusion is that the corruption has happened during the file copy operations, so as its only a few hundred g/b I will start from scratch but this time just use the 2 WD Reds as they’re almost brand new and only copy a few files.

I wonder if anybody has any theories why there would be such data corruption?
Both source and target are on the same lan and right next to each other - sata controller possibly causing the errors?

phillxnet · February 8, 2020, 6:17pm

@bwc Well done on the investigation.

Re:

Agreed, but if they are all under specked, ie SATA1 being used on SATA3 for instance. But assuming they arn’t near max length it’s still unlikely to be this bad. Which leads us to the following:

I suspect you have bad RAM. Stop all file system operations and investigate your RAM. The new drive may not be showing any effects as it hasn’t had the opportunity to ammas any data on it due to the failed (io error early on) initial balance. This now looks to be system wide. This is lack of data is evident in the lowest table in the pool details page. The new drive has only 1 GB of allocation. That’s a single allocation event as btrfs chunks are usually 1GB, although there may be 2 one fo

We have a section on testing memory in our Pre-Install Best Practice (PBP) named Memory Test (memtest86+) where the former is linked to in our Quick start doc section.

Otherwise it could be your PSU (Power Supply Unit) which may be flaking out when under load, but I’d go for the memory test first.

Again don’t do anything until you’ve tested your memory. You haven’t indicated a suspicion of these drives being suspect and all errors are corruption errors, ie ‘in transit’ stuff; this is what was written is not what is being read back. That can very well be caused by memory as the files have to be in memory before they are transition to disk and their checksum, to be used to test their integrity later on is calculated from the memory version. Bad memory bad everything else.

Possibly, but I’m voting on memory. And if that checks out then go for PSU, but that’s quite a bit harder to check as it may only go flaky under load and ideally you would need some fairly fancy gear to assess that, e.g. oscilloscopes etc to test for power supply noise / variation of voltage etc under the fault conditions, which may be when all drives are pulling max power and the CPU is doing the same working out all the checksums.

Incidentally your Pool may well be toast at this point, it rarely gets this bad unless their is a memory issue and the fact that some of your files are corrupt already further suggests memory also as btrfs tends to return correct data or nothing. So given what you have seen I’m guessing the memory issue will be quite pronounced. Hopefully you have more than one stick and so can drop one stick and test again if you do get errors.

Quite an interesting / pronounced problem you have there so do keep up informed. Incidentally it could also be (but far less likely) CPU cache memory. I have a story about that particular type of hardware failure from my early days of discovering linux that I can share another time.

Keep us posted and see how a memory check goes. Sometimes it can take over 24 hours of continous testing to ‘provoke’ a memory error. And given it exercise the CPU make sure your cooling is in order, i.e. you CPU fan is not clogged up with dust for example.

Hope you find the cause as this is a bad one.

bwc · February 10, 2020, 3:53pm

You were spot on with the suspect bad memory!
Found one 4gb stick (Hynix) had thousands of errors, so have replaced both sticks with some Corsair XMS I had lying around and rammed those in - both passed memory testing.
Would you advise removing the disks, deleting the existing Pool and nuking the disks with DBAN and perform the file copy operations again?
Thanks again for your help

phillxnet · February 10, 2020, 4:51pm

@bwc That’s great new, sort of, and glad your now sorted. At least for finding the likely cause.

Given your report thus far I would wipe everything, including the system partition and do a complete fresh install. Btrfs apparently almost never returns corrupt files but in the case of bad memory all bets are off. And you reported actual corrupt files being returned. So definitely wipe all disks, including system disk, and start over. You are then far more likely to have a better time of things going forward, as that pool was lost to the bad memory and likely also the system pool.

As for DBAN on the disks this is very extreme. Good advise but takes hours per disks. If it’s just home use I’d just wipe them. You should be able to delete the pool and in turn the disks via the Rockstor Web-UI. If not, there are corner cases here, the command you could use, and that Rockstor uses internally when wiping disks, is as follows:

wipefs -a /dev/disk/by-id/device-to-wipe

But that’s a pretty hefty command so do be very sure you have the correct device.

If you do a re-install and make sure to wipe the original disk you could then use that Rockstor install to not import but wipe the disks it then finds. Then you are set ready to start over. We have the Reinstalling Rockstor doc section that may be helpfull.

And in case you are curious where “wipefs -a” is used in Rockstor it’s here:

github.com

rockstor/rockstor-core/blob/master/src/rockstor/system/osi.py#L856-L864


      
                  if len(vsplit) > 1:
                      config["netmask"] = convert_netmask(vsplit[1])
              elif re.match("ipv4.dns:.+", l) is not None:
                  config["dns_servers"] = l.split(":")[1]
              elif re.match("ipv4.gateway:.+", l) is not None:
                  config["gateway"] = l.split(":")[1]
          
          else:
              raise Exception("Unknown ipv4.method({}). ".format(config["method"]))

Thanks for the update and well done on persevering. Bad memory if super destructive which is why the more expensive server setups use ECC memory to help guard against corruption of data and programs in memory.

bwc · February 14, 2020, 3:38pm

Eventually got round to re-installing everything - one thing did catch me out during the re-install: I got distracted and wasn’t watching the screen and it would appear that the installation of the Rockstor system defaults to the first disk it finds, flags it, and proceeds with the install without user input. With the result that the one of the WD Red 3TB disks had the Rockstor system installed on it. Couldn’t find an easy way to wipe the disk from within Rockstor web gui - so resorted to a live Linux and used gparted… Anyway, successfully installed Rockstor 3.9.1 Linux kernel 4.10.6-1 and attempted to upgrade but didn’t appear to work as web gui notification that update to 3.9.2-53 was still available.
Afer some googling, tried yum update from the terminal and that appeared to work; reboot presented the option of both kernels but 4.12 gave a kernel panic:

System boots up ok with Linux kernel 4.10.6.4-1 although Rockstor gui displays a notification that we’re running an unsupported kernel. Apart from that all looks good - I assume its ok to restore my config backup and proceed with creating a Pool and Shares etc?

phillxnet · February 14, 2020, 3:48pm

@bwc Thanks for the update, yes that installer is quite something. We’re not using it in our next version.

We have had a few reports of systems that don’t like the 4.12 mainline but work ok with the 4.10. You should be Ok with this but I would keep a keen eye on how our openSUSE testing goes as that is your next re-install I suspect as you don’t want to be on that 4.10 kernel for too long. But yes many folks have been managing just fine for ages on the 4.10 kernel. But not ideal of course.

Shame that but at least you are up and running again.

and yes

yum info rockstor

to make double sure of the Rockstor package version you are using.

For general ‘state of play’ on the openSUSE variant you want to watch this thread:

Once that effort has feature parity we will hopefully be releasing a new, much simpler, ISO installer. But as always these things take time and we are just not there yet.