Unresponsive system

shocker · January 1, 2020, 12:22pm

Hello and happy new year!
I’ve been observed for the last week the system becomes unresponsive for some seconds but only when I’m copying something over NFS. There is no error in the btrfs dash but when I’ve checked the /var/log/messages I’ve seen that there are some pending sectors with a HDD.

smartd[6711]: Device: /dev/sdk [SAT], 104 Currently unreadable (pending) sectors

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   134   134   054    Pre-fail  Offline      -       104
  3 Spin_Up_Time            0x0007   170   170   024    Pre-fail  Always       -       371 (Average 403)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       31
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   096   096   000    Old_age   Always       -       30983
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       31
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   098   098   000    Old_age   Always       -       2641
193 Load_Cycle_Count        0x0012   098   098   000    Old_age   Always       -       2641
194 Temperature_Celsius     0x0002   222   222   000    Old_age   Always       -       27 (Min/Max 2/48)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       104
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

# btrfs device stats /dev/sdk
[/dev/sdk].write_io_errs    0
[/dev/sdk].read_io_errs     0
[/dev/sdk].flush_io_errs    0
[/dev/sdk].corruption_errs  0
[/dev/sdk].generation_errs  0

phillxnet · January 1, 2020, 1:46pm

@shocker Hello again. Starting of the new year with pending dive failure. Nice.

Yes, 104 pending sectors can’t be good.

The fact that these are pending and have yet to be relocated, when aparently non have yet to be relocated (so there are still presumably reserve sectors left) would suggest they have been flagged but not fixed (relocated) or can’t be recovered to be relocated. We will need someone more up on this drives behaviour and how that is reflected in S.M.A.R.T data to say more here.

As to no btrfs errors it may be that the affected data was pulled from another drive. Failing drives can cause multi second delays though. You could do a scrub of the associated pool. That might help to update the btrfs stats, but if this is the only drive in that pool then you might be better advised to refresh your backups.

So in short the indicated smart log report is likely the result of a smart self test and may be referencing as yet unused by btrfs disk sections. Also I think in some situation, for a disk sector to be relocated, one has to write to it and hit the failure for the disk to then implement the low level relocation (logical substitution with another sector from remaining spares). Again probably down the the particular drives firmware. But one can often provoke drive self repair (sector relocation) by writing to the entire drives surface. A little impractical these days with such large drives now. Hence the Pre-Install Best Practice (PBP) section on Wiping Disks (DBAN):

“… The purpose here is to first remove all data on the existing drives and second to test that the drive is able to write to all it’s available sectors, it is often the case that a drive is unaware of an issue with itself, via the built in SMART system, until it attempts to write to a faulty sector. In fact this can trigger a drives build in ability to allocate spare sectors reserved for this very purpose.”

Hope that helps.

shocker · January 1, 2020, 2:36pm

@phillxnet thank you!
Yes, I have already started a scrub after posting this, but this will take a while
scrub status for 30ab3069-93bc-4952-a77c-61acdc364563
scrub started at Wed Jan 1 14:21:31 2020, running for 02:14:17
total bytes scrubbed: 788.61GiB with 0 errors

Happily, I’m running a raid6 so I will just unplug the drive after the scrub and order a new one just to be on the safe side, but since it’s the new year it will take a week to receive the new HDD, hopefully, I’ll get it next week

Also, the pending sectors are increasing here:
Jan 1 00:26:18 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 00:56:13 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 01:26:13 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 01:56:13 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 02:26:16 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 02:56:13 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 03:26:13 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 03:56:13 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 04:26:13 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 04:56:13 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 05:26:13 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 05:56:13 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 06:26:13 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 06:56:13 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 07:26:13 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 07:56:13 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 08:26:13 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 08:56:13 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 09:26:13 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 09:56:13 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 10:26:14 smartd[6711]: Device: /dev/sdk [SAT], 96 Currently unreadable (pending) sectors
Jan 1 10:56:13 smartd[6711]: Device: /dev/sdk [SAT], 104 Currently unreadable (pending) sectors (changed +8)
Jan 1 11:26:13 smartd[6711]: Device: /dev/sdk [SAT], 104 Currently unreadable (pending) sectors
Jan 1 11:56:15 smartd[6711]: Device: /dev/sdk [SAT], 104 Currently unreadable (pending) sectors
Jan 1 12:26:13 smartd[6711]: Device: /dev/sdk [SAT], 104 Currently unreadable (pending) sectors
Jan 1 12:56:13 smartd[6711]: Device: /dev/sdk [SAT], 104 Currently unreadable (pending) sectors
Jan 1 13:26:13 smartd[6711]: Device: /dev/sdk [SAT], 104 Currently unreadable (pending) sectors
Jan 1 13:56:16 smartd[6711]: Device: /dev/sdk [SAT], 104 Currently unreadable (pending) sectors
Jan 1 14:26:16 smartd[6711]: Device: /dev/sdk [SAT], 112 Currently unreadable (pending) sectors (changed +8)
Jan 1 14:56:13 smartd[6711]: Device: /dev/sdk [SAT], 112 Currently unreadable (pending) sectors
Jan 1 15:26:16 smartd[6711]: Device: /dev/sdk [SAT], 112 Currently unreadable (pending) sectors
Jan 1 15:56:15 smartd[6711]: Device: /dev/sdk [SAT], 112 Currently unreadable (pending) sectors
Jan 1 16:26:15 smartd[6711]: Device: /dev/sdk [SAT], 112 Currently unreadable (pending) sectors

Is there a way in btrfs to remove this drive from the pool before physically unplugging it? Or what’s the safest procedure to remove this drive and replace it with a new one later on?
As my pool has mixed drives 8 TB and 14 TB since this failed drive is an 8TB can I replace it with 14 TB ?

This is my pool status
Label: ‘server’ uuid: 30ab3069-93bc-4952-a77c-61acdc364563
Total devices 20 FS bytes used 110.37TiB
devid 1 size 7.28TiB used 7.05TiB path /dev/sde
devid 2 size 7.28TiB used 7.05TiB path /dev/sdf
devid 3 size 7.28TiB used 7.05TiB path /dev/sdg
devid 4 size 7.28TiB used 7.05TiB path /dev/sdk
devid 5 size 7.28TiB used 7.05TiB path /dev/sdl
devid 6 size 7.28TiB used 7.05TiB path /dev/sdm
devid 7 size 7.28TiB used 7.05TiB path /dev/sdn
devid 8 size 7.28TiB used 7.05TiB path /dev/sdo
devid 9 size 7.28TiB used 7.05TiB path /dev/sdp
devid 10 size 7.28TiB used 7.05TiB path /dev/sdq
devid 11 size 7.28TiB used 7.05TiB path /dev/sdr
devid 12 size 7.28TiB used 7.05TiB path /dev/sds
devid 13 size 7.28TiB used 7.05TiB path /dev/sdt
devid 14 size 7.28TiB used 7.05TiB path /dev/sdu
devid 15 size 7.28TiB used 7.05TiB path /dev/sdv
devid 16 size 12.73TiB used 3.81TiB path /dev/sdb
devid 17 size 12.73TiB used 3.81TiB path /dev/sdi
devid 18 size 12.73TiB used 3.81TiB path /dev/sdh
devid 19 size 12.73TiB used 3.81TiB path /dev/sda
devid 20 size 12.73TiB used 3.81TiB path /dev/sdj

Data, RAID6: total=110.63TiB, used=110.25TiB
System, RAID6: total=301.81MiB, used=8.36MiB
Metadata, RAID6: total=123.50GiB, used=122.35GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

phillxnet · January 1, 2020, 3:04pm

@shocker Re:

Yes, but from Rockstor’s perspective it is best done via the Web-UI via the:

Pool details page → Resize/ReRaid Pool

I would say the above method, I.e. remove first and await the subsequent automated balance finishing. Then just add the new disk via the same Resize/ReRaid Wizard.

Also make sure you are running at least rockstor version 3.9.2-50 before these operations as there were important fixes for disk removal prior to this. “yum info rockstor” to be absolutely sure.

Btrfs is as yet still not quite up to scratch on dealing with ‘missing’ disks and as such often simply goes read only. This in turn then introduces further down time and additional required procedures (ie having to remount degraded etc) so currently it’s best to go the pool resize route prior to actually disconnecting the drive physically.

And re:

Once the failing disk is removed via the “Resize/ReRaid Pool” option and the consequent automated ‘internal’ balance has finished the ‘removed’ drive will no longer be Rockstor managed or any longer associated with that pool. And so can be ‘replaced’, or not, with any size disk their after.

Rockstor doesn’t current support or understand a btrfs replace operation, although we do have a pending issue for this it is likely to be a while before we do support this:

So 20 disks. Nice. Hope you’ve got a UPS on that thing . Especially given the use of a btrfs parity raid (5/6).

Hope that helps.

shocker · January 1, 2020, 7:35pm

Awesome, thank you for the feedback!
Yes I have an UPS for that
Got the point, basically I will just remove a drive from the pool and then I can play however I want. Should I wait for the scrub to finish or I can just remove the faulty one from the pool?

Regarding the Rockstor I’m with 3.9.2-51:

yum info rockstor

Installed Packages
Name : rockstor
Arch : x86_64
Version : 3.9.2
Release : 51
Size : 85 M
Repo : installed
From repo : Rockstor-Stable
Summary : Btrfs Network Attached Storage (NAS) Appliance.
URL : http://rockstor.com/
License : GPL
Description : Software raid, snapshot capable NAS solution with built-in file integrity protection.
: Allows for file sharing between network attached devices.

rpm -qa|grep rockstor

rockstor-3.9.2-51.x86_64
btrfs-progs-4.12-0.rockstor.x86_64
rockstor-release-3-9.23.el7.x86_64

phillxnet · January 1, 2020, 9:55pm

@shocker Re:

I’d say now you have the scrub running you might as well let it finish. But if this is impractically long, which it may be even with quotas disabled given the drive is likely causing pauses, then you could just cancel this scrub and then embark on the disk removal via the Resize/ReRaid Pool option. I wouldn’t suggest you do any pool alterations while a scrub is in play; specially with the btrfs parity raids of 5/6.

Great. Out of unrelated curiosity do you have this setup via Rockstor Web-UI? Apologies if we’ve covered this before.

Hope that helps.

shocker · January 1, 2020, 9:58pm

Great, I’ll cancel the scrub then and remove the HDD from the pool.

What do you mean by that? To setup the UPS via Rockstor? If that, no and I have no idea about this functionality

shocker · January 1, 2020, 11:08pm

Canceled the scrub and deleted the device via GUI get’s me this with no additional output.
##### Houston, we’ve had a problem.
Unknown internal error doing a PUT to /api/pools/5/remove

Checking the server seems that the process is still running:
# ps x|grep delete
18032 ? D 0:15 /usr/sbin/btrfs device delete /dev/disk/by-id/ata-HGST_HUH728080ALE604_2EKLYEUX /path/to/mount

# btrfs fi show|grep -i sdk
        devid    4 size 0.00B used 7.05TiB path /dev/sdk

shocker · January 2, 2020, 11:43am

Filesystem become very unresponsive and the delete is still in progress.

Rockstor gui dash gives me this:
##### Houston, we’ve had a problem.
Unknown internal error doing a GET to /api/disks?page=1&format=json&page_size=9000&count=

time mkdir x

real 1m32.713s
user 0m0.000s
sys 0m0.002s

Any suggestions?

phillxnet · January 2, 2020, 12:28pm

@shocker Re:

This looks like a time out which is strange as onwards from 3.9.2-50 the remove is run in it’s own task to address this error always happening in Rockstor package versions prior to this. So this would suggest you are experiencing some kind of delay which is provoking this same issue but for different reasons. I’d guess that this may be related to your otherwise noted delays and it’s breaking the Web-UI’s ability to respond in a timely manner and just timing out as a result.

And this would suggest the same, i.e. the command is running but general system delays are simply failing the Web-UI’s ability to report the progress:

again this looks like a general command time out failure.

You dodgy drive may just be hanging the btrfs commands Rockstor uses to report on everything.

Try a btrfs command line to get a report of the disks removal progress.
Internally Rockstor uses the following command a lot:

btrfs fi show

and the next is used to report on what is known within the code as an internal balance. That is one that is not reported by the normal balance status command but has to be reverse engineered by the changing allocation between drives that is evident during an internal balance, ie one this is run automatically by btrfs itself internally on a drive removal:

Introduced to monitor disk removal live in 3.9.2-49 via pr:

github.com/rockstor/rockstor-core

pool resize disk removal unknown internal error and no UI counterpart. Fixes #1722

rockstor:master ← phillxnet:1722_pool_resize_disk_removal_unknown_internal_error_and_no_UI_counterpart

opened 05:46PM - 27 Jan 19 UTC

phillxnet

+871 -414

Fix disk removal timeout failure re "Unknown internal error doing a PUT .../remo…ve" by asynchronously executing 'btrfs dev remove'. The pool_balance model was extended to accommodate for what are arbitrarily named (within Rockstor) 'internal' balances: those automatically initiated upon every 'btrfs dev delete' by the btrfs subsystem itself. A complication of 'internal' balances is their invisibility via 'btrfs balance status'. An inference mechanism was thus constructed to 'fake' the output of a regular balance status so that our existing Web-UI balance surfacing mechanisms could be extended to serve these 'internal' variants similarly. The new state of device 'in removal' and the above mentioned inference mechanism required that we now track and update devid and per device allocation. These were added as disk model fields and surfaced appropriately at the pool details level within the Web-UI. Akin to regular balances, btrfs dev delete 'internal' balances were found to negatively impact Web-UI interactivity. This was in part alleviated by refactoring the lowest levels of our disk/pool scan mechanisms. In essence this refactoring significantly reduces the number of system and python calls required to attain the same system wide dev / pool info and simplifies low level device name handling. Existing unit tests were employed to aid in this refactoring. Minor additional code was required to account for regressions (predominantly in LUKS device name handling) that were introduced by these low level device name code changes. Summary: - Execute device removal asynchronously. - Monitor the consequent 'internal' balances by existing mechanisms where possible. - Only remove pool members pool associations once their associated 'internal' balance has finished. - Improve low level efficiency/clarity re device/pool scanning by moving to a single call of the lighter get_dev_pool_info() rather than calling the slower get_pool_info() btrfs disk count times; get_pool_info() is retained for pool import duties as it’s structure is ideally suited to that task. Multiple prior temp_name/by-id conversions are also avoided. - Improve user messaging re system performance / Web-UI responsiveness during a balance operation, regular or 'internal'. - Fix bug re reliance on "None" disk label removing a fragility concerning disk pool association within the Web-UI. - Improve auto pool labeling subsystem by abstracting and generalising ready for pool renaming capability. - Improve pool uuid tracking and add related Web-UI element. - Add Web-UI element in balance status tab to identify regular or 'internal' balance type. - Add devid tracking and related Web-UI element. - Use devid Disk model info to ascertain pool info for detached disks. - Add per device allocation tracking and related Web-UI element. - Fix prior TODO: re btrfs in partition failure point introduced in git tag 3.9.2-32. - Fix prior TODO: re unlabeled pools caveat. - Add pool details disks table ‘Page refresh required’ indicator keyed from devid=0. - Add guidance on common detached disk removal reboot requirement (only affects older kernels). - Remove a low level special case for LUKS dev matching (mapped devices) which affected the performance of all dev name by-id look-ups. - Add TODO re removing legacy formatted disk raid role pre openSUSE move. - Update scan_disks() unit tests for new 'path included' output. - Address TODO in scan_disks() unit tests and normalise on pre-sort method. Fixes #1722 And by way of a trivial application of the added per device allocation: Fixes #1918 "Incorrect size calculation while removing disk from disk pool" @suman Ready for review. Please note that this pr assumes the prior merge of: "regression in unit tests - environment outdated since 3.9.2-45. Fixes #1993" #1994 (Fixes unit tests) "pin python-engineio to 2.3.2 as recent 3.0.0 update breaks gevent. Fixes #1995" #1996 (Fixes basic build fail) and: "Implement Add Labels feature for already-installed Rock-Ons. Fixes #1998" #1999 (has a prior storageadmin db migration 0007_auto_20181210_0740.py - I’m trying to keep our migrations path simple) Testing: All existing osi and btrfs unit test were confirmed to pass prior to and post pr (given #1994) however as indicated above the scan_disks() unit tests required modification but only to accommodate the new behaviour introduced in scan_disks() where we request from lsblk all device paths. From the osi unit tests point of view this was a cosmetic change in test data: and no functional changes were made bar a trivial robustness improvement by way of an existing TODO. Many of the system configurations used to originally generate the osi unit test data were also tested in their install instance counterparts (ie bios raid system disk, LUKS, btrfs in partition, etc) and were also used during development to help ensure minimal regression. A full functional test on real hardware was also conducted over multiple cycles of removing (and re-adding a post 'wipefs -a' disk where appropriate). These tests are details in the comments below and indicate expected behaviour in both legacy CentOS and openSUSE (Tumbleweed in this case) installs. Caveats: Our keying from devid = 0 (for 'Page refresh required' UI element) may cause confusion during a disk replace (as yet unimplemented: see issue #1611 ) as it is understood that currently within btrfs one of the two disks involved during a 'btrfs replace start ...' operation is temporarily assigned a devid of 0. The cited issue can address this as and when needed.

github.com

rockstor/rockstor-core/blob/master/src/rockstor/fs/btrfs.py#L1681-L1689


      
                      )
                      logger.info("Previously disabled quotas can cause this issue")
                      return out, err, rc
                  # raise an exception as usual otherwise
                  raise
              return out, err, rc
          
          
          def volume_usage(pool, volume_id, pvolume_id=None):

and

github.com

rockstor/rockstor-core/blob/master/src/rockstor/fs/btrfs.py#L1721-L1745


      
          Rockstor volume/subvolume hierarchy is not standard
          and Snapshots actually not always under Share but on Pool,
          so btrf sub list -o deprecated because won't always return
          expected data; volumes (shares & snapshots) sizes got via qgroups.
          Rockstor structure has default share qgroup 0/* becoming child of
          2015/* new qgroup and share snapshots 0/*+1 qgroups assigned to new
          Rockstor 2015/*.
          Original 0/* qgroup returns current share content size,
          2015/* qgroup returns 'real' share size considering snapshots sizes too
          Note: 2015/* rfer and excl sizes are always equal so to compute
          current real size we can indistinctly use one of them.
          """
          # Here we depend on fail through throw=False if quotas are disabled/indeterminate
          cmd = [BTRFS, "qgroup", "show", volume_dir]
          out, err, rc = run_command(cmd, log=False, throw=False)
          volume_id_sizes = [0, 0]
          pvolume_id_sizes = [0, 0]
          for line in out:
              fields = line.split()
              # We may index up to [2] fields (3 values) so ensure they exist.

This file has been truncated. show original

which both run as stated:

btrfs dev usage -b mnt_pt

As these reporting tools are all based on btrfs command line programs reporting in a timely manner I suspect your reported delays are upsetting Rocsktor’s ability to parse and report on these command outputs.

So if you could report on any delays in the output of these commands then we have our culprit for Rockstor’s Web-UI face palming. Without feedback from those command in a timely manner we are at a loss to report, or even work correctly. This is a situation that is improving massively all the time and is the main drive to us moving to openSUSE as we then get their enterprise vetted backports of btrfs as and when they are deemed ready. But for the time being on our CentOS offerings we just have to wait these instances out.

Let us know how those commands respond at the command line but from your initial report of the drive being removed it is actually under way. But I suspect btrfs command delays, exasperated by your use of a parity raid (5/6) the number of drives you have, and the size of your array I’m currently guessing that the current Web-UI logic inelegantly timing out rather than reporting actual command time outs is biting your here.

I’d leave the Web-UI alone for now and as you’ve indicated the remove command looks to be in progress and the later of the above 2 command on usage should show a progressively reduced allocation on the drive that is being removed. Hence the 0 size reading on that drive initially.

OK, just noticed your time mkdir command. So yes definitely unreasonable command response times, 1.5 mins to make a directory can’t be good.

Lets see some timings on the above btrfs commands. And if I remember correctly you do already have quotas disabled I believe. This info might be useful for us to build in some kind of better behaviour re massive slowdowns such as your are seeing.

Hope that helps and in short I’d leave it to it’s devices once you have assured yourself that the btrfs allocation on the drive in removal is actually going down. If so all is well bar our Web-UI failing on long btrfs command delays. Again some technical debt we have here is our old Django and Python. Updates of which are planned and will bring better performance / async capabilities to be able to deal more elegantly going forward when a system is reduced to a snails pace response such as this one is currently. This circumstance is particularly pronounced in the parity raid levels incidentally, as is their robustness re repair. But I think patience may pay off here.

So lets get assurance that the removal is in progress via the usage command executed over time. Then leave it be to do it’s thing. And again some timing of the command response times will help us behave better in the future.

Hope that helps.

shocker · January 2, 2020, 12:48pm

Thanks for the feedback!

The btrfs commands are running fine and the result in almost instant.
The delete is still in progress but is crazy slow:
devid 3 size 7.28TiB used 7.07TiB path /dev/sdg
devid 4 size 0.00B used 7.04TiB path /dev/sdk

The only problem is that the filesystem cannot be used via NFS and everything is impacted due to slowness. Also no idea why is that slow as the IO is not that intense

Metadata,RAID6: 9178251264
Metadata,RAID6: 738197504
Metadata,RAID6: 67108864
System,RAID6: 1114112
Unallocated: 229617197056

/dev/sdp, ID: 9
Device size: 8001563222016
Device slack: 0
Data,RAID6: 3552452411392
Data,RAID6: 4182492839936
Data,RAID6: 27016101888
Metadata,RAID6: 9178251264
Metadata,RAID6: 738197504
Metadata,RAID6: 67108864
System,RAID6: 1114112
Unallocated: 229617197056

/dev/sdq, ID: 10
Device size: 8001563222016
Device slack: 0
Data,RAID6: 3552452411392
Data,RAID6: 4182492839936
Data,RAID6: 27016101888
Metadata,RAID6: 9178251264
Metadata,RAID6: 738197504
Metadata,RAID6: 67108864
System,RAID6: 1114112
Unallocated: 229617197056

/dev/sdr, ID: 11
Device size: 8001563222016
Device slack: 0
Data,RAID6: 3552452411392
Data,RAID6: 4182492839936
Data,RAID6: 27016101888
Metadata,RAID6: 9178251264
Metadata,RAID6: 738197504
Metadata,RAID6: 67108864
System,RAID6: 1114112
Unallocated: 229617197056

/dev/sds, ID: 12
Device size: 8001563222016
Device slack: 0
Data,RAID6: 3552452411392
Data,RAID6: 4182492839936
Data,RAID6: 27016101888
Metadata,RAID6: 9178251264
Metadata,RAID6: 738197504
Metadata,RAID6: 67108864
System,RAID6: 1114112
Unallocated: 229617197056

/dev/sdt, ID: 13
Device size: 8001563222016
Device slack: 0
Data,RAID6: 3552452411392
Data,RAID6: 4182492839936
Data,RAID6: 27016101888
Metadata,RAID6: 9178251264
Metadata,RAID6: 738197504
Metadata,RAID6: 67108864
System,RAID6: 1114112
Unallocated: 229617197056

/dev/sdu, ID: 14
Device size: 8001563222016
Device slack: 0
Data,RAID6: 3552452411392
Data,RAID6: 4182492839936
Data,RAID6: 27016101888
Metadata,RAID6: 9178251264
Metadata,RAID6: 738197504
Metadata,RAID6: 67108864
System,RAID6: 1114112
Unallocated: 229617197056

/dev/sdv, ID: 15
Device size: 8001563222016
Device slack: 0
Data,RAID6: 3552452411392
Data,RAID6: 4182492839936
Data,RAID6: 27016101888
Metadata,RAID6: 9178251264
Metadata,RAID6: 738197504
Metadata,RAID6: 67108864
System,RAID6: 1114112
Unallocated: 229617197056

~]# btrfs fi show /path/to/fs/
Label: ‘server’ uuid: 30ab3069-93bc-4952-a77c-61acdc364563
Total devices 20 FS bytes used 110.63TiB
devid 1 size 7.28TiB used 7.07TiB path /dev/sde
devid 2 size 7.28TiB used 7.07TiB path /dev/sdf
devid 3 size 7.28TiB used 7.07TiB path /dev/sdg
devid 4 size 0.00B used 7.04TiB path /dev/sdk
devid 5 size 7.28TiB used 7.07TiB path /dev/sdl
devid 6 size 7.28TiB used 7.07TiB path /dev/sdm
devid 7 size 7.28TiB used 7.07TiB path /dev/sdn
devid 8 size 7.28TiB used 7.07TiB path /dev/sdo
devid 9 size 7.28TiB used 7.07TiB path /dev/sdp
devid 10 size 7.28TiB used 7.07TiB path /dev/sdq
devid 11 size 7.28TiB used 7.07TiB path /dev/sdr
devid 12 size 7.28TiB used 7.07TiB path /dev/sds
devid 13 size 7.28TiB used 7.07TiB path /dev/sdt
devid 14 size 7.28TiB used 7.07TiB path /dev/sdu
devid 15 size 7.28TiB used 7.07TiB path /dev/sdv
devid 16 size 12.73TiB used 3.83TiB path /dev/sdb
devid 17 size 12.73TiB used 3.83TiB path /dev/sdi
devid 18 size 12.73TiB used 3.83TiB path /dev/sdh
devid 19 size 12.73TiB used 3.83TiB path /dev/sda
devid 20 size 12.73TiB used 3.83TiB path /dev/sdj

And yes, quotas are disabled:
~]# btrfs qgroup show /path/to/fs/
ERROR: can’t list qgroups: quotas not enabled

shocker · January 2, 2020, 1:52pm

When the filesystem becomes unresponsive the CPU for the delete process is 100% After a while the CPU for the process goes down to 0% and the FS performance is restored.
Also, notice that the system CPU is not affected by this process. Tried to nice -19 the process without any improvement.

Is there a kernel/btrfs bug?

phillxnet · January 2, 2020, 3:16pm

@shocker Re:

But the CPU is, and note that you had this NFS issue previously, it will only likelly be worse during this ‘internal’ balance. I’m leaning towards this drive causing major slow downs but this process is know to take ages and yes there are always kernel bugs, hence our ongoing move to get newer btrfs backports via the openSUSE rebase.

What’s going on in the system logs just before/during/directly after these 100% cpu times?

But this is really all in the realms of kernel and btrfs progs now. And possible flaky hardware (the dying drive) related slowdowns. But again raid5/6 is known to be very poorly with regard to performance especially during repair/reshape. It’s just far from optimised and is significantly younger than it’s btrfs raid1/10 counterparts. But folks generally don’t see this until they are in a situation such as you are now in. Interesting that the btrfs commands are now responding quickly. Is that all the time or do they also slow down in response time during these high cpu times. I put a warning in the Web-UI at the end of the Resize/ReRaid Wizard as it is common for the system to be far less responsive during both regular balances and these device removal ‘internal’ balances.

Keep up posted, and how are those “… unreadable (pending) sector” counts coming along?

shocker · January 2, 2020, 3:25pm

This is what I have in the system log file:
Jan 2 16:56:13 smartd[6711]: Device: /dev/sdk [SAT], 112 Currently unreadable (pending) sectors
Jan 2 16:56:15 kernel: BTRFS info (device sda): found 87 extents
Jan 2 16:56:17 systemd-logind: New session 9196 of user timecapsule.
Jan 2 16:56:17 afpd[37320]: Login by timecapsule (AFP3.4)
Jan 2 16:57:56 afpd[37320]: dsi_stream_read: len:0, unexpected EOF
Jan 2 16:57:56 afpd[37320]: dsi_disconnect: entering disconnected state
Jan 2 16:58:50 systemd-logind: New session 9197 of user timecapsule.
Jan 2 16:58:50 afpd[37467]: Login by timecapsule (AFP3.4)
Jan 2 16:59:57 afpd[37467]: dsi_stream_read: len:0, unexpected EOF
Jan 2 16:59:57 afpd[37467]: dsi_disconnect: entering disconnected state
Jan 2 17:01:59 kernel: BTRFS info (device sda): relocating block group 212272628367360 flags data|raid6
Jan 2 17:04:06 systemd-logind: New session 9200 of user timecapsule.
Jan 2 17:04:06 afpd[37814]: Login by timecapsule (AFP3.4)
Jan 2 17:05:07 systemd-logind: New session 9201 of user timecapsule.
Jan 2 17:05:07 afpd[37869]: Login by timecapsule (AFP3.4)
Jan 2 17:05:07 afpd[37869]: afp_disconnect: trying primary reconnect
Jan 2 17:05:07 afpd[12538]: Reconnect: transferring session to child[37814]
Jan 2 17:05:07 afpd[12538]: Reconnect: killing new session child[37869] after transfer
Jan 2 17:05:09 afpd[37869]: afp_disconnect: primary reconnect succeeded
Jan 2 17:05:09 systemd-logind: Removed session 9201.
Jan 2 17:05:57 afpd[37814]: afp_dsi_transfer_session: succesfull primary reconnect
Jan 2 17:06:58 systemd-logind: New session 9202 of user timecapsule.
Jan 2 17:06:58 afpd[37981]: Login by timecapsule (AFP3.4)
Jan 2 17:06:58 afpd[37981]: afp_disconnect: trying primary reconnect
Jan 2 17:06:58 afpd[12538]: Reconnect: transferring session to child[37814]
Jan 2 17:06:58 afpd[12538]: Reconnect: killing new session child[37981] after transfer
Jan 2 17:07:00 afpd[37981]: afp_disconnect: primary reconnect succeeded
Jan 2 17:07:00 systemd-logind: Removed session 9202.
Jan 2 17:07:54 afpd[37814]: afp_dsi_transfer_session: succesfull primary reconnect
Jan 2 17:08:55 systemd-logind: New session 9203 of user timecapsule.
Jan 2 17:08:55 afpd[38103]: Login by timecapsule (AFP3.4)
Jan 2 17:08:55 afpd[38103]: afp_disconnect: trying primary reconnect
Jan 2 17:08:55 afpd[12538]: Reconnect: transferring session to child[37814]
Jan 2 17:08:55 afpd[12538]: Reconnect: killing new session child[38103] after transfer
Jan 2 17:08:57 afpd[38103]: afp_disconnect: primary reconnect succeeded
Jan 2 17:08:57 systemd-logind: Removed session 9203.
Jan 2 17:09:51 afpd[37814]: afp_dsi_transfer_session: succesfull primary reconnect
Jan 2 17:10:51 systemd-logind: New session 9205 of user timecapsule.
Jan 2 17:10:51 afpd[38236]: Login by timecapsule (AFP3.4)
Jan 2 17:10:51 afpd[38236]: afp_disconnect: trying primary reconnect
Jan 2 17:10:51 afpd[12538]: Reconnect: transferring session to child[37814]
Jan 2 17:10:51 afpd[12538]: Reconnect: killing new session child[38236] after transfer
Jan 2 17:10:53 afpd[38236]: afp_disconnect: primary reconnect succeeded
Jan 2 17:10:53 systemd-logind: Removed session 9205.
Jan 2 17:11:46 kernel: NFSD: client 10.10.2.8 testing state ID with incorrect client ID
Jan 2 17:11:46 kernel: NFSD: client 10.10.2.8 testing state ID with incorrect client ID
Jan 2 17:11:46 kernel: NFSD: client 10.10.2.8 testing state ID with incorrect client ID
Jan 2 17:11:46 kernel: NFSD: client 10.10.2.8 testing state ID with incorrect client ID
Jan 2 17:11:49 afpd[37814]: afp_dsi_transfer_session: succesfull primary reconnect
Jan 2 17:12:49 systemd-logind: New session 9206 of user timecapsule.
Jan 2 17:12:49 afpd[38344]: Login by timecapsule (AFP3.4)
Jan 2 17:12:49 afpd[38344]: afp_disconnect: trying primary reconnect
Jan 2 17:12:49 afpd[12538]: Reconnect: transferring session to child[37814]
Jan 2 17:12:49 afpd[12538]: Reconnect: killing new session child[38344] after transfer
Jan 2 17:12:51 afpd[38344]: afp_disconnect: primary reconnect succeeded
Jan 2 17:12:51 systemd-logind: Removed session 9206.
Jan 2 17:13:58 afpd[37814]: afp_dsi_transfer_session: succesfull primary reconnect
Jan 2 17:13:58 afpd[37814]: dsi_stream_read: len:0, unexpected EOF
Jan 2 17:13:58 afpd[37814]: dsi_stream_read: len:0, unexpected EOF
Jan 2 17:13:58 afpd[37814]: dsi_disconnect: entering disconnected state
Jan 2 17:14:04 kernel: RPC request reserved 108 but used 268
Jan 2 17:14:23 systemd-logind: New session 9207 of user admin.
Jan 2 17:14:23 afpd[38430]: Login by admin (AFP3.4)
Jan 2 17:16:10 afpd[38430]: dsi_stream_read: len:0, unexpected EOF
Jan 2 17:16:10 afpd[38430]: dsi_disconnect: entering disconnected state

Yes, during the CPU high load I’m getting slowness for btrfs fi show.

If I physically unplug the HDD from the system, the btrfs know how to recover by itself? I mean that should be the basic functionality of raid6 just as a HDD crash
The delete tries to move everything from the drive to the others or what the process behind?

shocker · January 2, 2020, 4:20pm

Force restarted the fs, removed the faulty hdd, mounted as degraded and everything is working fine now
Now I’ve issued the btrfs device delete missing /path/to/fs and for the moment FS is running fine, let’s see

shocker · January 2, 2020, 4:45pm

The slowness is back, it’s not that bad as it’s not hanging that often now.

shocker · January 3, 2020, 4:04pm

After few hours I had the same issues. Process up to 100% and system become unresponsive. As the first time I was unable to stop the process. I have force restarted the system and now is running ok, in degraded mode. The process was running slow as the initial one, only increased few mb in 12h.

Any idea or should I try to run another OS with a better BTRFS kernel and try to recover from degraded?

shocker · January 3, 2020, 5:20pm

Issue seems similar to this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1294531
Just seen the redhat reply on Feb ‘19. Interesting

@phillxnet any beta for the suse edition?

phillxnet · January 3, 2020, 7:00pm

@shocker Re:

That issue references CentOS’s default kernel version which was 3.19, as per our CentOS based installer. But our installer then immediately update this to 4.12 from elrepo via their kernel-ml offering. We also bundle similarly versioned btrfs-progs of 4.12.

The slow “btrfs dev delete” issue is not unknown and has come up again recently on the linux-btrfs mailing list:
https://lore.kernel.org/linux-btrfs/7c82851a-21cf-2a66-8d1c-42d57ca0538f@dubiel.pl/T/#mc93039e2c22c9eb279ffa8ee433ee27db5291c32

And again your experience is likely compounded by the known lower performance of the parity raid variants and the duff disk likely slowing things even more initially. We will have to move to supporting the quicker disk replace functionality sooner rather than later but it is a non trivial task and we currently have a lot on with our openSUSE move.

Speaking of which:

Yes, please see the following recent ‘Announcements’ tagged forum post:

So more alpha than beta but if your game!! Also if you do take the plunge then go for a Leap 15.1 rather than a Tumbleweed. Leap 15.1 is our next main target and has hundreds of btrfs kernel backports in comparison to what our CenOS offering has so don’t be put off by the major kernel version. Suse employ quite a few btrfs developers so they have the required expertise to do this, and SLES is now binary compatibly with the openSUSE Leap line as it is now based directly off of openSUSE Leap. Plus SUSE are a significant backer of openSUSE once again.

We also have a working pre-release installer based on openSUSE Leap15.1 but I have yet to clear it’s use/config with the openSUSE folks prior to release and I don’t think I’ve quite yet jumped through all the debranding issues that are required to do this. But when this is sorted I will of course announce here on the forum and make a download link available.

All in the works but still a lot to do such as making the move over to our ‘yet to be release’ openSUSE variant as smooth as possible. By way of this here has been a fair bit of work done on the config save restore (@Flox mainly) and the ongoing work to achieve feature/function parity. Still if you fancy getting involved then that above link should get you sorted. But please do read the referenced post and it’s subsequent references fully first as this is defiantly not yet ready for general release. But we do have some forum members that have already build from source on openSUSE variants but reports of the now available openSUSE rpms is a little more relevant with regard to finally getting an openSUSE Stable channel rpm offering out.

Hope that helps, at least as an status update. I’m due to post more on the forum soon on this but as always there is lots to do and we have our current CentOS users to support during all of this. And without the Stable subscribers currently on CentOS we would have no project to have openSUSE aspirations with in the first place.

shocker · January 8, 2020, 9:15pm

Awesome! Thanks! I’ll try to switch this weekend to the opensuse one. Should I try the Tumbleweed first and try to recover? Or it have the same btrfs ports in the kernel?
I’ve seen that Tumbleweed comes with kernel 5.3 by default.