Two questions - possibly related

jmangan · December 14, 2020, 2:30pm

I’ve got a relatively new installation, 3.9.2-57, with an internal RAID 10 ‘primary’ pool and an eSata external enclosure RAID 10 ‘secondary’ pool.

I’ve been doing some fairly large Duplicati cloud backups from the primary pool and on three occasions the entire box has crashed completely - but it seems to be happening more quickly each time.

I’ve done the obvious like check, reseat the RAM, disk connections, etc. but I was wondering if there is some disk consistency check I could run as well? I haven’t seen anything obvious.

And my second question inolves an SFTP share on the secondary pool that is ‘unmounted’ after a crash that there seems to be no way to mount without deleting and re-creating the share. It’s currently the only share on this pool so I don’t know if this is specific to SFTP or not.

Any ideas, please?

Flox · December 14, 2020, 4:23pm

Hi @jmangan,

Although I’m not exactly sure of what’s happening, I might have a couple of hopefully helpful pointers for you.

We’ve recently had a user report an issue with Duplicati backups and its RAM usage, for which I’ve found several other reports online from general Duplicati users. My first thought would thus be to have a look at your memory usage and checked whether or not you can see your swap space being filled up when Duplicati runs its backup(s). That might explain why it leads your machine to crash. See the post linked below for details:

That one is curious, especially given we just had another report of something very similar by @gburian; see their post below:

In your case, I’m not sure why your share is shown as unmounted, unfortunately, but I would recommend to run the various btrfs commands I listed in this thread as they may bring more information on your system’s pool/share situation. Would you happen to to have run any update(s) or disk operation (adding/removing a disk, for instance) recently?

jmangan · December 14, 2020, 4:40pm

Flox, thanks for the reply. I’m working at the moment but I will try to go through your suggestions over the next couple of days - although it may need to wait for the weekend.

It doesn’t sound like there is - or I need to run - a disk checking process.

I did apply the latest set of patches through the console about 8-10 days ago so maybe that is related.

I’ll run through the information you’ve provided and report back.

Thanks again,

John

phillxnet · December 15, 2020, 1:50pm

@jmangan Hello there.
Re:

The btrfs scrub feature is just that. It reads all data and if it fins the copy it tries to be inconsistent with it’s checksum it will look to a duplicate entry (if btrfs raid > 0) and re-write / correct the bad checksuming copy.

If this is happening progressively more often then it could be a hardware issue that is just getting worse. I.e. a failing PSU or ram. You could try a memtest86+ check and also try replacing your PSU. They come in all manner of qualities and all will fail eventually. And under heavy load is where they are most likely to show signs of this failure.

As to your sftp issue, you could try a scrub of that pool to see if the subvolume has issues. If scrub can heal the pool you may get a successful mount there after.

Hope that helps.

jmangan · December 19, 2020, 10:52am

@Flox, @phillxnet

Thanks both. After reading Flox’s suggestions I restarted Rockstor but didn’t run any Duplicati jobs until I could go through the threads and do some investigations.

But it crashed again anyway, even though I had reseated the RAM and checked cable connections. At this point I ran Memtest (prescient, phillxnet) and, sure enough, found a dodgy DIMM. I’ve got a replacement on order.

In the meantime I followed phillxnet’s suggestion to run a scrub on the Secondary_Pool. Scrub is obviously a lot more benign than it’s name suggests. I need to do more to familiarise myself with BTRFS terminology. Once the scrub had completed the SFTP share auo-mounted and all is now well.

I’m going to hold off doing any more Duplicati jobs until I have a full complement of RAM and then monitor for the memory leak you mentioned.

Thanks to you both.

John

jmangan · December 19, 2020, 5:26pm

I couldn’t leave well alone!

Since the scrub repaired my second pool I thought it might be a good idea to run it on the main pool as well.

It failed with dev errors and suggested a command to run:
btrfs dev stats -z /mnt2/Main_Pool
[/dev/sdc].write_io_errs 0
[/dev/sdc].read_io_errs 0
[/dev/sdc].flush_io_errs 0
[/dev/sdc].corruption_errs 4
[/dev/sdc].generation_errs 0
[/dev/sdb].write_io_errs 0
[/dev/sdb].read_io_errs 0
[/dev/sdb].flush_io_errs 0
[/dev/sdb].corruption_errs 0
[/dev/sdb].generation_errs 0
[/dev/sde].write_io_errs 0
[/dev/sde].read_io_errs 0
[/dev/sde].flush_io_errs 0
[/dev/sde].corruption_errs 0
[/dev/sde].generation_errs 0
[/dev/sdd].write_io_errs 0
[/dev/sdd].read_io_errs 0
[/dev/sdd].flush_io_errs 0
[/dev/sdd].corruption_errs 4
[/dev/sdd].generation_errs 0

So, I’ve got four errors on two drives which apparently a scrub can’t fix (it stopped - twice).

What are my options apart from deleting all the data, shares, configuration, starting from scratch and restoring from backups? Can I at least identify one (or more) shares that are actually affected and just re-create them.

Thanks again for helping me on my learning journey.

John

phillxnet · December 19, 2020, 6:05pm

@jmangan Hello again.
Re:

It may be that a reboot, which effects a full unmount and remount, may help here. There are certain situations where this can help. Also can you confirm which version of Rockstor this is, i.e. you stated previously:

This is fairly up to date from our core code point of view but very old from our base OS point of view and the kernel btrfs stack perspective. Which is what you are exercising mainly when doing btrfs opperations. You may well benefit from the years of development in these stacks that is represented as a difference between our CentOS variant (pending legacy status) and our newer, still in release candidate 5 state, Rockstor 4 variant which is based
‘Built on openSUSE’ Leap 15.2 and so has far newer btrfs kernel stack components.

Unfortunately no installer download as of yet but you can build your own with all updates pre-installed via the following repos instructions:

Note however that this version has broken AD/LDAP but we do have a pending fix which we hope to review and integrate in the testing channel for that release in the near future.

Hope that helps. Also note that scrub and balance opperations can take a few goes to do their thing. So always worth trying again, especially after a clean mount (read reboot). Also these operations, balance mainly thought, can suffer from a lack of memory. You mention running on a depleated memory due to the bad stick so again you could try again once you have installed the new stick/sticks.

Let us know how it goes and glad to here that you have sorted at least some of the resulting side effects of running with bad ram. It’s more common than folks appreciate and very difficult for any file system to protect itself against. Hence the common recommendation for ECC memory on mission critical deployments.

Hope that helps.

jmangan · December 20, 2020, 5:16pm

Thanks. Well, I’ve got the new RAM, rebooted the NAS and re-tried the scrub. No luck.

Yes, I’m still on the old CentOS base. I’m looking forward to trying the new Suse-based version but I think I will sort my data before I try anything more adventurous.

As you say, no software can really protect itself if the data is being corrupted right from under it. Just bad luck.

Oh well I’ll blat the share and start from scratch. Thanks for the suggestions.

John

Hooverdan · December 21, 2020, 12:43am

@jmangan - if you’re already thinking about throwing the file system under the bus, you could attempt to get to a recent version of the kernel and btrfs-tools version … that way you would not have to deal with the installer right away, but could see whether the newer version of the btrfs tools could possibly fix your issue …

if you want to try, check out this thread:

but I would skip the yum remove btrfs-progs step, as the subsequent poster noted that it could possibly remove Rockstor right with it (which you obviously don’t want).

jmangan · January 1, 2021, 2:58pm

Thanks all. The new RAM seems to haveresolved the issues. No swap usage by Duplicati but all of my jobs working. No crashes since replacing the faulty DIMM.

I chickened out of running a load of new packages just to see if it could fix the corruption I had. Given the crashes were caused by faulty RAM I couldn’t feel confident that there weren’t other issues in the data itself or otherwise. Given that I opted for flattening the pool and starting again.

I’m hoping to find a bit of time for running up an OpenSuse VM in the next week or so and trying to build the installer for (another) test VM.