Rockstor 4 freezes after ~2.5 hours of use

parth · May 12, 2022, 3:32pm

Hi,

I’m facing a strange and difficult to diagnose issue currently. Let me explain my setup first.

We have in our small office been using Rockstor v3.9 for nearly a year now. We have a ~14TB RAID 10 array that is used by multiple server (4) machines to download documents from the web (mostly PDFs and XMLs) and store them them in 1 place. The files are made available via a Samba share and this is mounted via a CIFS mount on each of the Ubuntu server machines. We had no problems with this setup for the past year. However, with our usage reaching nearly 90%, we chose to expand the capacity of the array by adding 2 additional disks and at the same time also upgrade to the newly released Rockstor v4.1.0-0.

There were a couple of minor issues during the import related to file/share permissions etc that I was able to resolve. During the rebalance when the new disks were added, btrfs started complaining about read and write errors on 1 of the disks so I used btrfs replace to replace that disk using the instructions here. I ran a balance after this for good measure to ensure that my array was fully redundant again.

However, once we started our download servers, Rockstor seemed to work fine for a while until all of a sudden it froze with the WebUI becoming unresponsive and my SSH connection also getting disconnected. I restarted it manually and it kept happening again and again. I have looked at the rockstor, smbd and nmbd logs to try and see if I could figure out what the problem is but tbh, I’m not familiar enough with either btrfs, Rockstor or Samba to work out what the problem is. On top of that, the fact that the server runs just fine and then hangs all of a sudden makes diagnosis all the more difficult since I cannot get into the machine anymore and have to manually power cycle it.

I’m currently running a scrub to see if any file system corruption surfaces but I’d be happy to share the logs or any other details if they help. I’m at my wit’s end with this and any help would be greatly appreciated.

UPDATE: I see the following messages showing up in dmesg

[6135.085241] perf: interrupt took too long (2522 > 2500), lowering kernel.perf_event_max_sample_rate to 79250
[ 6872.237948] perf: interrupt took too long (3169 > 3152), lowering kernel.perf_event_max_sample_rate to 63000
[ 8313.936003] perf: interrupt took too long (3975 > 3961), lowering kernel.perf_event_max_sample_rate to 50250
[10969.505087] perf: interrupt took too long (4971 > 4968), lowering kernel.perf_event_max_sample_rate to 40000

I suspect this is because there is a scrub running but just wanted to update here nonetheless.

phillxnet · May 12, 2022, 5:33pm

@parth Regarding:

Yes a scrub is a pretty intensive operation as it reads and checks every part of the entire filesystem. The system can become less responsive in during these times. And the log entries you list can be an indication of the kernel adapting to this additional load.

Otherwise; the scrub is the correct move for now. Although on first indication of instability once should check the hardware such as memory integrity. Before doing anything more extensive such as the stated drive change.

Out of interest, did you let the balance after the drive additions finish. They can take 10’s of hours to complete. And sometimes even longer depending on the hardware and data concerned.

We have no other instability reports to date but if you have recently had major pool changes and during this you encountered a flaky drive then the scrub should put you right. Assuming the hardware is stable also of course.

Let us know how the scrub goes. Either via the Web-UI Pool details page Scrubs tab or via the command line.

To clarify, did you end up having to do a disk replace before the addition and more likely the consequent full balance that Rockstor initiates after adding disks to a pool finished. Should all still be doable and you are running a robust btrfs raid level. But instability is very often down to PSU stress when they are insufficient or poorly, or bad memory. See our
“Pre-Install Best Practice (PBP)” guide here:
https://rockstor.com/docs/installation/pre-install-howto.html#pre-install

Attempting any filesystem alterations, maintenance or otherwise, is likely to make things worse with bad memory.

It could be that the additional power stress of driving 2 additional drives along with the high activity associated with the balance initiated by Rockstor following that filesystem change pushed your existing power supply past it’s comfort zone.

Also note that it’s almost always best to use/try the Web-UI capabilities first. Although as you likely found out we don’t yet have a disk replace. That would instead be a disk add then remove. But in your case this would be tricky given your failing drive. Incidentally if you take a look at the following issue we have open for this ‘missing’ feature there are notes on using a switch to avoid any further writes to the drive being replaced:

https://github.com/rockstor/rockstor-core/issues/1611

I.e. the note about the “-r” switch to consider the to-be-replaced drive as read only. That’s a handy thing if you are replacing a poorly drive.

So my bet currently is on additional power requirements pushing the hardware, specifically the power supply, beyond it’s current stable capabilities: i.e. coincidence in time with the two drive additions. Or a longer standing corruption issue that has been brought to light given the extensive pool modifications that have been enacted: i.e. you have had a known bad drive show up during the 2 new drive addition, or shortly there after (I’m a little unclear on the issue history here).

Thought in all cases of stability the hardware must be checked first as it undermines all software. But you have not changed memory and have been stable previously, hence the first bet on additional power drain pushing things more than before.

More info here may help others with further suggestions. I.e. are there any current balances still running. And is the scrub able to finish. And if so with what result.

But in my experience hardware stability issues, if this is that, are often capability and cooling related. Stuff gets hotter the harder it works and is often then less capable along with it. PSU’s are notorious for instability issues as is bad ram. Along with these, and related, are insufficient cooling of all components.

Hope that helps. And you can always cancel a scrub if you want first to switch out say a PSU. Or improve cooling. Or do a badram test (these can take a very long time).

parth · May 12, 2022, 10:06pm

Thanks @phillxnet for the detailed reply. You make a lot of interesting points and the -r switch should definitely come in handy the next time I need to replace a disk.

Out of interest, did you let the balance after the drive additions finish. They can take 10’s of hours to complete. And sometimes even longer depending on the hardware and data concerned.

I thought I did since the UI indicated that the balance was 100% finished (but no end time is displayed) and when I ran btrfs balance status it indicated no balance was running. However, my new disks did not show the same amount of usage as the old ones. As I mentioned in the original post, I ran another balance after the disk replacement as well to ensure that writes if any to the degraded pool would be made redundant as well as per this article. (On a side note, I was quite surprised to learn that at the end of a disk replacement process the pool still may not be fully redundant. You guys should update the docs about the disk replacement instructions here to add a step to run btrfs balance after the btrfs replace).

As indicated in the screenshot below, the disk usage is still uneven despite that second balance showing up as 100% complete (again with no end time).

[The 9 corruption errors have been discovered in the currently ongoing scrub]

My current plan is to let the scrub complete fully before running more tests on the memory and if possible the PSU as you suggested.

I will try and provide more updates about the scrub status and the hardware in use (the completely built NAS box was actually sourced from a vendor so I don’t have these details off the top of my head).

Thanks again for taking the time to respond and I’ll keep you posted how this goes.

phillxnet · May 13, 2022, 9:10am

@parth Glad you making progress of sorts here.
Re:

Yes there are many ‘takes’ on btrfs over the years and Tim’s are certainly of a slant lets say. Btrfs is, in short, basically forging new ground in filesystems and does still require some hand holding. As it goes the block pointer re-write capability btrfs has was an original design goal of ZFS’s ‘inventors’. Early ZFS had this. But it was lost later on to scope screep. We do some hand-holding ourselves with for example our forced full balance after adding disks. This hand holding requirement and caveat requirement/knowledge is steadily reducing as time goes on however. And our move to openSUSE has in a large part aided greatly in projecting us onto an upstream supported base to address such things. OpenSUSE/SuSE are very active in btrfs development so we get the benefits of the back-ports their btrfs staff know best to apply.

Agreed:
https://github.com/rockstor/rockstor-doc/issues/377

Nice, getting somewhere now then. They should also have been corrected also I hope?

This seems sensible. You don’t give indications of instability previously so hopefully the ram, which is under an identical load to before (almost as larger pools require more ram), but the PSU is under more load as drives are the least friendly towards PSU’s bar some Graphics cards maybe.

Yes this is difficult. Plus any PSU can fail or falter at any time. Also some are sold with specifications that they simply don’t meet. So always use reputable manufacturers and over spec on the capability. Drives have many transient/peak requirements that can be difficult to supply. So when you have many working in parallel it can get complicated.

Regarding your Pool drive stats displayed from the btrfs dev stats command. They are accumulative. So will only go up until until you reset them via the command indicated. You might want to do that, after first taking a not of the associated drive serial numbers, after you scrub has hopefully completed OK and all errors have been corrected.

Also note here that you are now running what looks like 6 drives in Raid 10 yes? If so you are approaching the extreme of a sensible single drive failure raid level capability. That is btrfs raid10 only has a single drive failure capability. And btrfs’s only common offering with 2 drive failure capability is the shorter vintage parity raid level of 6. Still not generally considered production grade. As an aside and with an eye to the future. Take a look at our recently added howto:
“Installing the Stable Kernel Backport”: Installing the Stable Kernel Backport — Rockstor documentation
And specifically the “Btrfs raid1c3 raid1c4”: Installing the Stable Kernel Backport — Rockstor documentation
subsection.
I’m not suggesting you do any of this yet. But just to help ensure your awareness of options appearing on the horizon that have longer vintage, as in they are extensions of raid1/10, than the parity raids.

The balance function will at least currently not necessarily exactly evenly redistribute all data. It approximates it and you will find that over time this re-distribution should even out again with normal use. The main thing is to ensure all your redundancy profiles are honoured. Much less of a requirement than it used to be and likely not a concern unless you have dabbled in one of the parity raids or are playing with empty pools or the like. The ‘btrfs fi usage’ command can help with reassurance on that front. We use it to surface a bunch of stuff but don’t yet catch/flag mixed raid levels for example. Which would likely be another nice feature. But all our Web-UI initiated balances are full so we should be OK on that front in most practical circumstances. Although this in itself could be called out as overkill but we try to air on the cautious rather than on the quick.

Your welcome. Also a spare, reputable, over specked PSU is a great thing to have on stand-by. Many high end servers have PSU fail over for a reason. They are a single point of failure that is often highly stressed. Plus they are often trivial to swap out. A modern PSU will output current akin to a welder you know!

Hope that helps.

parth · May 13, 2022, 12:26pm

The scrub is now complete. 104 checksum errors were found and all have been corrected .

Regarding your Pool drive stats displayed from the btrfs dev stats command. They are accumulative. So will only go up until until you reset them via the command indicated. You might want to do that, after first taking a not of the associated drive serial numbers, after you scrub has hopefully completed OK and all errors have been corrected.

I have chosen for now to not reset that stats since they should be a cumulative indicator of disks that have shown errors in the past (and the numbers are small enough to not give me anxiety every time I look at them ).

Thanks for the heads up regarding raid1c3 and raid1c4. I’ll certainly keep it in mind next time I’m looking to expand. For now, the fact that it would make use of the UI difficult and the loss of storage efficiency with keeping 2 redundant copies of the data when the whole point of the exercise was to expand the size of the pool means I’ll stick with Raid10 with more frequent scrubs and pre-emptively remove disks with errors.

The main thing is to ensure all your redundancy profiles are honoured. … The ‘btrfs fi usage’ command can help with reassurance on that front.

Here’s the output of btrfs fi usage -h on my pool:

Overall:
    Device size:                  47.30TiB
    Device allocated:             27.92TiB
    Device unallocated:           19.38TiB
    Device missing:                  0.00B
    Used:                         27.91TiB
    Free (estimated):              9.69TiB      (min: 9.69TiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,RAID10: Size:13.92TiB, Used:13.92TiB
   /dev/sda        3.56TiB
   /dev/sdb        3.56TiB
   /dev/sdc        5.18TiB
   /dev/sdd        5.18TiB
   /dev/sde        5.18TiB
   /dev/sdf        5.18TiB

Metadata,RAID10: Size:40.97GiB, Used:39.19GiB
   /dev/sda       10.66GiB
   /dev/sdb       10.66GiB
   /dev/sdc       15.16GiB
   /dev/sdd       15.16GiB
   /dev/sde       15.16GiB
   /dev/sdf       15.16GiB

System,RAID10: Size:96.00MiB, Used:1.33MiB
   /dev/sda       32.00MiB
   /dev/sdb       32.00MiB
   /dev/sdc       32.00MiB
   /dev/sdd       32.00MiB
   /dev/sde       32.00MiB
   /dev/sdf       32.00MiB

Unallocated:
   /dev/sda        5.53TiB
   /dev/sdb        5.53TiB
   /dev/sdc        2.08TiB
   /dev/sdd        2.08TiB
   /dev/sde        2.08TiB
   /dev/sdf        2.08TiB

Everything looks okay here.

I am now updating the backup of the data on the array post which I’ll start with the hardware tests. I must say though that the fact that the scrub ran completely with no issues gives me much more confidence that the issue is not in the hardware. I think that the scrub would stress the PSU, memory, CPU etc much more that copying files to the array over SMB that too at fairly pedestrian rates (the average network bandwidth consumption I’ve seen when the array was live earlier over our 1Gig network was ~20MBps).

I wanted to add some more context about the setup that could maybe help you direct me to what the root cause is. I don’t know if any of these points are relevant but thought I’d mention them just in case:

I have 2 other NAS servers (1 using Rockstor and 1 using FreeNAS) that are currently on the same network all using Samba to share the files (the other 2 continue to work fine).
I have installed a few extra packages from the command line mostly for help with monitoring and trying to diagnose what the problem could be including sysstat and snapd.

phillxnet · May 13, 2022, 2:30pm

@parth Thanks for he running update. And Glad your likely out the other side now, at least data integrity wise.

That’s good to hear.

Yes, that makes sense. Some folks have in the past expressed surprise at these stats not zeroing after a scrub so thought I’d mention the accumulative nature of them.

Re the ‘btrfs fi usage’ output.

Agreed. There isn’t any rouge raid levels that can creep in; thought these are encountered after incomplete pool raid changes.

Also agreed. It may well just have been a corruption that locked things up when required to do a live repair, but that was approachable via the scrub. Difficult to tell really. I’d say a balance was also up there on the system strain point of view. A little higher than a scrub.

They seem harmless enough. You might also consider the netdata Rock-on:
https://rockstor.com/docs/interface/docker-based-rock-ons/netdata_official.html
Linked to their free cloud service (https://www.netdata.cloud/) for a limited number of systems, would give really good instant and over-time monitoring capabilities. However I’ve just checked again here:
https://www.netdata.cloud/pricing
and it looks like they are moving the free cloud tier to be 48 hours of data retention limited. But still that could help. The email notification reporting can be pretty impressive also. Assuming the Rock-on subsystem is not a step too far along the complexity line of things. I didn’t catch on as-to if you were already using the Rock-on subsystem. It’s just a docker wrapper essentially, with easy install and setup pre-configured for some popular services. Netdata and it’s cloud counterpart can also be useful across all your systems as you should then be able to see the relative load across them all, side-by-side over time as it were.

Netdata is not without incumbent system load, but it’s remarkably light-weight given it’s massive monitoring capability. Plus it has a section to indicate it’s own imposed load on the host system which is quite nice.

As always keep us posted. And keep in mind your 1 drive failure limit on the btrfs raid profile. We intend to extend our support for the newer raid1c3 and c4 profiles and to improve our ability to handle and surface within the Web-UI mixed raid profiles also, i.e. btrfs-raid6 data with btrfs-raid1c4 metadata. Again a consideration more for the future.

Hope that helps

parth · May 13, 2022, 6:25pm

I didn’t catch on as-to if you were already using the Rock-on subsystem.

I haven’t had a chance to use Rock-ons yet. This was a machine purpose-built to do one thing and one thing only and I haven’t really experimented much with all the features Rockstor provides unfortunately. Ironically, doing that would probably have helped greatly at this time.

You might also consider the netdata Rock-on

I didn’t know there was a netdata Rock-on. I’m a developer myself and am quite familiar with netdata having used it often before. I’ll definitely look into setting that up to gather data the next time I test our application. Thanks a ton!

UPDATE: The same issue occured when I ran rsync over the files to copy them over to backup, so I think we can at least eliminate Samba misconfiguration as a possible problem which was one of my fears. I also noticed that while the system was frozen, 3 of the hard disk LEDs were blinking continuously.

I’ve also since run a scrub over the boot pool as well and no errors were found. For now, I have setup a temporary storage setup that should tide us over for the duration of this troubleshooting exercise and am breaking for the weekend.

phillxnet · May 15, 2022, 2:18pm

@parth Re:

So are you saying the previously observed, apparent freeze, looks to be down to an rsync event. If so check you have sufficient memory. Larger pools and more pool members require larger memory requirements. You system may just have ground to a near halt in it’s attempt to service rsync and btrfs’s requirements. Note also that both rsync and btrfs can be CPU intensive. If cooling is insufficient that is another trip point to look out for. The scrub already demonstrated btrfs’s ability to read correctly all data so now it may be an interplay between the two systems on the current machine. Again the netdata overview may help here, especially if you can get thermal info out.

Yes, this was a good move. Just in case there was a rough corruption there somewhere.

Hope that helps.

phillxnet · May 16, 2022, 12:32pm

@parth Further to your engagement with our now dinosaur:
“Data loss Prevention and Recovery in Rockstor” doc section here:
https://rockstor.com/docs/data_loss.html
I’ve ended up working on the long standing:
https://github.com/rockstor/rockstor-doc/issues/167

Are you OK with me re-using your dev errors screen grab pic in my re-write?

As it perfectly demonstrates a real-life Web-UI component (several actually) that didn’t exist back when that guide was written.

Thanks. The only issue I can see is with the pool name but it’s hardly divulging any personally identifiable information.

Thanks again. I’ve just lost access currently to my older proof cases for such things. And this one was nice and current with 6 modern drives.

parth · May 16, 2022, 1:08pm

Hi @phillxnet,

Thanks for the heads up. I would rather not have the pool name in the pic. Here’s the same screenshot with the pool name redacted. I’ll also edit my original post with the same.

phillxnet · May 16, 2022, 2:13pm

@parth Cheers.
And thanks for preparing the replacement. I’ll use the new one in the pr then.
Much appreciated.

parth · May 17, 2022, 1:28pm

Hi All,

We seem to be out the other side of the problem now. The solution was simple. I asked the vendor to replace our NAS box and that has worked just fine in all my testing. So it seems it was a hardware issue after all . I still haven’t pinpointed which component exactly was the source of the problem but my interest is more in getting our applications up and running once again.

Based on the troubleshooting exercise, my hypothesis is that:

scrub operations ran just fine because while they may be intensive in terms of reads, the issue was probably related to write-intensive workloads
balance operations probably were unable to run to completion (refer quoted text ) but the disk replacement was.

Out of interest, did you let the balance after the drive additions finish. They can take 10’s of hours to complete. And sometimes even longer depending on the hardware and data concerned.

I thought I did since the UI indicated that the balance was 100% finished (but no end time is displayed) and when I ran btrfs balance status it indicated no balance was running

If this was indeed the case, the UI might have been a little misleading indicating the balance was 100% complete and one should look for the end time to be populated in case of a successful balance operation.
Other sustained write workloads such as rsyncing to this server and running our download applications caused the server to freeze after ~30 mins of running

So my learnings from this can be summarized basically as:

Don’t shoot yourself in the foot and ALWAYS stress test your hardware before deploying it in a production setting
A scrub is not sufficient for this purpose. A balance operation is better. Ideally, test it in its actual production usage scenario and failing that, run rsync a couple of times to copy files to and from the server to simulate lots of reads and writes.

I could spend some time testing my hypothesis (that writes were triggering the problem) on the old box but I’m quite exhausted at the moment. If I get around to it, I’ll update this post further. Do let me know if there are any specific tests that you would want me to run.

Immense thanks to @phillxnet for the assistance throughout this exercise. Your inputs and insights were invaluable to me as a Rockstor noob and I greatly appreciate you taking the time out to help me.

(PS I have netdata setup on both the Rockstor machines I’m running now. Should be pretty handy the next time something goes wrong. There does not seem to be a straightforward way in the UI to connect my local netdata instance to Netdata Cloud though. May have to tinker about with the Netdata Rockon JSON config for that.)

phillxnet · May 20, 2022, 6:08pm

@parth Hello again.
Re:

and

All now sorted by way of committed and published pull request:
https://github.com/rockstor/rockstor-doc/pull/381

Thanks again for the engagement, and the screen grab, and hopefully the new guide will be of more use now and in the future. A noteworthy addition in the re-write was the following new section:
Resizing when replacing; Data Loss-prevention and Recovery in Rockstor — Rockstor documentation
among many other new section there.

Hope that helps.

parth · May 25, 2022, 2:32pm

Had a look at the new docs. Looks great. Can’t wait for you guys to release disk replacement via the WebUI. Till then, these docs should help greatly.