Extremely high I/O, without reason?

High Guys,

This is probably more BTRFS related, but perhaps somebody with more experience than I can shed some light…

I had a functioning BTRFS RAID10 config, 4x6Tb disks.
I had just decided the system was stable enough to reclaim aome disks and add them to the existing RAID, 4x4Tb.
I physically added the disks, killed the existing mdraid on them and zero’d the superblocks.
I the n added all 4 via the web GUI. The balance process started automatically. By 10PM, the balance was at 13%.
The balance stopped shortly after, remaining at 13% overnight.

The system was extremely non-responsive at this point, and many docker containers were not responding.
I attempted a graceful reboot, and waited 3 hours for the system to restart. At this point the console had stopped as well as SSH - I was forced to powercycle the system.

Since then, the system is started but doesnt respond to just about anything. Many processes stuck in uninterruptable sleep awaiting I/O (stat D).
Load is continually climbing, currently >35, CPU usage and I/O wait at 0.0, morenthan 50% mem free, and the disks are all idle.

I’ve attempted restarting the balance, but command is non responsive, as is checking balance status.

Many property retrieval ops in wait (please forgive the formatting, on tablet):

root 5 0.0 0.0 0 0 ? D 16:18 0:00 [kworker/u32:0]
root 142 0.0 0.0 0 0 ? D 16:18 0:00 [kworker/u32:1]
root 838 0.0 0.0 0 0 ? D 16:18 0:00 [kworker/u32:4]
root 844 0.0 0.0 0 0 ? D 16:18 0:00 [kworker/u32:6]
root 1287 0.0 0.0 0 0 ? D 16:18 0:00 [kworker/u32:11]
root 1288 0.0 0.0 0 0 ? D 16:18 0:00 [kworker/u32:12]
root 1289 0.0 0.0 0 0 ? D 16:18 0:00 [kworker/u32:13]
root 1291 0.0 0.0 0 0 ? D 16:18 0:00 [kworker/u32:14]
root 1295 0.0 0.0 0 0 ? D 16:18 0:00 [kworker/u32:16]
root 3655 0.0 0.0 0 0 ? D 16:19 0:00 [btrfs-transacti]
root 5750 0.0 0.0 200 4 ? D 16:20 0:00 elglob -0 – envdirs /var/run/s6/env-* forx -p – envdir ${envdirs} importas -u envdir envdir s6-rmrf $
{envdir}
root 6389 0.0 0.0 15920 1060 ? D 16:20 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 6537 0.0 0.0 15944 1104 tty1 D+ 16:22 0:00 btrfs balance start -mconvert raid10 -dconvert raid10 /mnt2/tempraid/
root 6573 0.0 0.0 15944 1112 tty2 D+ 16:23 0:00 btrfs balance status /mnt2/tempraid/
root 6729 0.0 0.0 15920 1020 ? D 16:23 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 7004 0.0 0.0 15920 1096 ? D 16:26 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 7228 0.0 0.0 15920 1012 ? D 16:28 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 7454 0.0 0.0 15920 1092 ? D 16:30 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 7669 0.0 0.0 15920 1020 ? D 16:32 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 7889 0.0 0.0 15920 1056 ? D 16:34 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 8441 0.0 0.0 15920 1012 ? D 16:37 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 8876 0.0 0.0 15920 1052 ? D 16:40 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 9263 0.0 0.0 15920 1020 ? D 16:42 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 9644 0.0 0.0 15920 1108 ? D 16:44 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 10441 0.0 0.0 15920 1096 ? D 16:47 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 10865 0.0 0.0 15920 1024 ? D 16:49 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 11260 0.0 0.0 15920 1016 ? D 16:51 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 11595 0.0 0.0 15920 1056 ? D 16:52 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 11843 0.0 0.0 15920 1004 ? D 16:55 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 12101 0.0 0.0 15920 1104 ? D 16:58 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 12412 0.0 0.0 15920 1004 ? D 17:00 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 12659 0.0 0.0 15920 1088 ? D 17:03 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 12890 0.0 0.0 15920 1124 ? D 17:06 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 13191 0.0 0.0 15920 1124 ? D 17:13 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3
root 13440 0.0 0.0 15920 1120 ? D 17:16 0:00 /sbin/btrfs property get /mnt2/tempraid/rockon/btrfs/subvolumes/d0bc50ff71e2d0582618d1edfcbdcbdfc7eccdbb0
804e1e928dc3da4bd03a2c3

And ideas how I can get this back up? Happy to provide more info if needed.

Good news everyone (I think).
A second hard reboot brought the system up again, and I was able to restart the balance.
Unfortunately, the problem occurred again, this time at 46% of the balance process.
I appeared to be getting a deadlock on btrfs-transaction.

Below is dmesg showing last two successful block relocations, and the advent of the issue:

[86704.611607] BTRFS info (device sdc): relocating block group 8545009401856 flags data|raid10                                  
[86731.524234] BTRFS info (device sdc): found 202 extents                                                                       
[86751.004478] BTRFS info (device sdc): found 202 extents                                                                       
[86759.861490] BTRFS info (device sdc): relocating block group 8542861918208 flags data|raid10                                  
[86784.316422] BTRFS info (device sdc): found 73 extents                                                                        
[86799.294337] BTRFS info (device sdc): found 73 extents                                                                        
[86998.162668] INFO: task btrfs-transacti:3629 blocked for more than 120 seconds.                                               
[86998.163005]       Not tainted 4.12.4-1.el7.elrepo.x86_64 #1                                                                  
[86998.163172] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.                                        
[86998.163343] btrfs-transacti D    0  3629      2 0x00000080                                                                   
[86998.163347] Call Trace:                                                                                                      
[86998.163357]  __schedule+0x28a/0x880                                                                                          
[86998.163360]  schedule+0x36/0x80                                                                                              
[86998.163392]  btrfs_tree_lock+0xef/0x210 [btrfs]                                                                              
[86998.163396]  ? remove_wait_queue+0x60/0x60                                                                                   
[86998.163414]  btrfs_lock_root_node+0x34/0x50 [btrfs]                                                                          
[86998.163431]  btrfs_search_slot+0x8b8/0x9b0 [btrfs]                                                                           
[86998.163453]  ? __set_page_dirty_nobuffers+0x10f/0x160                                                                        
[86998.163455]  ? set_page_dirty+0x52/0xb0                                                                                      
[86998.163482]  btrfs_delete_delayed_items+0xd6/0x390 [btrfs]                                                                   
[86998.163507]  ? btrfs_insert_delayed_items+0xc0/0x480 [btrfs]                                                                 
[86998.163532]  __btrfs_run_delayed_items+0x12d/0x220 [btrfs]                                                                   
[86998.163557]  btrfs_run_delayed_items+0x13/0x20 [btrfs]                                                                       
[86998.163579]  btrfs_commit_transaction+0x277/0x900 [btrfs]                                                                    
[86998.163600]  ? start_transaction+0xa7/0x440 [btrfs]                                                                          
[86998.163620]  transaction_kthread+0x18a/0x1c0 [btrfs]                                                                         
[86998.163624]  kthread+0x109/0x140                                                                                             
[86998.163644]  ? btrfs_cleanup_transaction+0x520/0x520 [btrfs]                                                                 
[86998.163646]  ? kthread_park+0x60/0x60                                                                                        
[86998.163650]  ? do_syscall_64+0x67/0x150                                                                                      
[86998.163653]  ret_from_fork+0x25/0x30

Having seen that the system was able to recover from the dangers of a cold reboot, I did this again. Once the system came up, it appeared to resume a BTRFS balance operation again, which it completed in a disturbingly short amount of time, and the system has been stable since.

Does anybody know how properly balanced data should be distributed across 4x6Tb and 4x4Tb disks in RAID 10?

Currently I’m seeing:

[ root@rockout (pass 0s) ~ ]# btrfs fi show                                                                        
Label: 'rockstor_rockstor'  uuid: d06b622f-d241-4e27-8638-063167eedb2e                                             
        Total devices 1 FS bytes used 2.51GiB                                                                      
        devid    1 size 103.42GiB used 22.02GiB path /dev/sdi3                                                     
                                                                                                                   
Label: 'tempraid'  uuid: 7f6fabf6-8fd3-4208-8522-5f258ab5c0ce                                                      
        Total devices 8 FS bytes used 9.48TiB                                                                      
        devid    1 size 5.46TiB used 3.57TiB path /dev/sda                                                         
        devid    2 size 5.46TiB used 3.57TiB path /dev/sdh                                                         
        devid    3 size 5.46TiB used 3.57TiB path /dev/sdg                                                         
        devid    4 size 5.46TiB used 3.57TiB path /dev/sdf                                                         
        devid    5 size 3.64TiB used 1.17TiB path /dev/sdb                                                         
        devid    6 size 3.64TiB used 1.17TiB path /dev/sde                                                         
        devid    7 size 3.64TiB used 1.18TiB path /dev/sdd                                                         
        devid    8 size 3.64TiB used 1.18TiB path /dev/sdc                                                         
                                                                                                                   
[ root@rockout (pass 0s) ~ ]#

I’m a little concerned, as prior to the reboot the system had balanced ~2000/~4500 chunks, and after the reboot, the ‘resumed’ balance was only performed on 7 chunks

A few very frightening experiences you’ve had there…

I have run many balances, and never had them hang on me (I have had one crash during a balance though…). I don’t think what you are experiencing is the norm.

Anyway the distribution of data looks fine to me.

I would probably run a scrub to checkup on the system.

Thanks for the info Karsten!

Yeah, I figure the balance issues are either related to the addition of four large disks to a very large RAID at at once, or possibly a race condition present on AMD architecture. The disks are all healthy and no read/write errors are present though, and DF compared to btrfs utils is comparable, so I’m not as worried as I was!

I’ve got a scheduled scrub starting in a few hours, however the system has actively served media without issue for the last day, so I’m reasonably confident that all is well.

1 Like