Raid 1 parent transid verify failed, but still mounted?

kbogert · February 6, 2017, 10:23pm

I am running rockstor 3.8.16-8. This morning I restarted the system and I noticed that NFS did not work afterwards. The service is apparently running (nfsd is working) but none of the clients can mount the shares.

Attempting to diagnose, I looked at dmesg and saw a bunch of:

BTRFS error (device sda5): parent transid verify failed on 1721409388544 wanted 19188 found 83121

So I immediately checked the drives:

btrfs device stats /dev/sda5  
[/dev/sda5].write_io_errs   0 
[/dev/sda5].read_io_errs    0   
[/dev/sda5].flush_io_errs   0   
[/dev/sda5].corruption_errs 0  
[/dev/sda5].generation_errs 0

They all came up the same, no errors in SMART.

After searching for the transid error online, I’m finding horror stories, but they all seem to be saying that the filesystem would not mount. Here it appears that it is mounting as SAMBA is working just fine and I can apparently get to all my files through it.

My setup is a little unusual, I have two 3TB disks that are divided into two pools:

btrfs fi show
  Label: 'Primary'  uuid: 21e09dd8-a54d-49ec-95cb-93fdd94f0c17 
      Total devices 2 FS bytes used 943.67GiB
      devid    1 size 2.73TiB used 946.06GiB path /dev/sdb
      devid    2 size 2.70TiB used 946.06GiB path /dev/sda5

The first one is 30GB for the rockstor system. I then have a Raid1 setup with the second partition on sda and the entirety of sdb. sda5 is this 2.7TB “disk” that shows up in the screenshot above. I had to manually create this setup, it was done on the command line by simply adding sda5 to a pool created on sdb in the rockstor gui, then changing the type to raid1 and rebalancing.

Also I’m getting warnings on an iMac that the identity of the time machine disk has changed. That share is on a different pool with completely separate drives, however, so I’m not sure if its related.

Any thoughts? I don’t have enough experience with BTRFS to feel comfortable resolving this myself, but I was thinking of simply deleting sda5 from the pool, reformatting, and re-adding it. Has anyone seen this before? Is sda5 actually mounting or is all my data ending up on sdb?

kbogert · February 7, 2017, 5:52pm

I downloaded all the logs I could find and began looking through them. I found this in the rockstor.log:

[07/Feb/2017 08:10:04] ERROR [system.osi:107] non-zero code(1) returned by command: ['/sbin/btrfs', 'subvolume', 'list', '-o', '/mnt2/Primary/Movies']. output: [''] error: ["ERROR: cannot access '/mnt2/Primary/Movies': Input/output error", "ERROR: can't access '/mnt2/Primary/Movies'", '']
[07/Feb/2017 08:10:04] ERROR [storageadmin.middleware:32] Exception occured while processing a request. Path: /api/commands/refresh-share-state method: POST
[07/Feb/2017 08:10:04] ERROR [storageadmin.middleware:33] Error running a command. cmd = ['/sbin/btrfs', 'subvolume', 'list', '-o', '/mnt2/Primary/Movies']. rc = 1. stdout = ['']. stderr = ["ERROR: cannot access '/mnt2/Primary/Movies': Input/output error", "ERROR: can't access '/mnt2/Primary/Movies'", '']
Traceback (most recent call last):
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/core/handlers/base.py", line 132, in get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/views/decorators/csrf.py", line 58, in wrapped_view
return view_func(*args, **kwargs)
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/views/generic/base.py", line 71, in view
return self.dispatch(request, *args, **kwargs)
  File "/opt/rockstor/eggs/djangorestframework-3.1.1-py2.7.egg/rest_framework/views.py", line 452, in dispatch
response = self.handle_exception(exc)
  File "/opt/rockstor/eggs/djangorestframework-3.1.1-py2.7.egg/rest_framework/views.py", line 449, in dispatch
response = handler(request, *args, **kwargs)
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/utils/decorators.py", line 145, in inner
return func(*args, **kwargs)
  File "/opt/rockstor/src/rockstor/storageadmin/views/command.py", line 262, in post
import_shares(p, request)
  File "/opt/rockstor/src/rockstor/storageadmin/views/share_helpers.py", line 82, in import_shares
rusage, eusage = share_usage(pool, share.qgroup)
  File "/opt/rockstor/src/rockstor/fs/btrfs.py", line 753, in share_usage
out, err, rc = run_command(cmd, log=True)
  File "/opt/rockstor/src/rockstor/system/osi.py", line 109, in run_command
raise CommandException(cmd, out, err, rc)
CommandException: Error running a command. cmd = ['/sbin/btrfs', 'subvolume', 'list', '-o', '/mnt2/Primary/Movies']. rc = 1. stdout = ['']. stderr = ["ERROR: cannot access '/mnt2/Primary/Movies': Input/output error", "ERROR: can't access '/mnt2/Primary/Movies'", '']

Sure enough, the subvolume “Movies” within the Primary pool is giving an Input/Output error when trying to access it. This exception could explain why NFS and AFS are acting strange.

Looking though dmesg, it appears that the transid error swaps back and forth between the two physical devices in the Primary pool:

kernel: BTRFS: device label Primary devid 2 transid 83463 /dev/sda5
kernel: BTRFS: device label Primary devid 1 transid 83463 /dev/sdb
kernel: BTRFS error (device sdb): parent transid verify failed on 1721409388544 wanted 19188 found 83121

So it would appear that the problem exists on both sda5 and sdb. So far no clues as to what caused this, the first instance of the error appears to be yesterday morning after I restarted.

Flyer · February 8, 2017, 1:04am

Hi @kbogert and welcome to Rockstor!

The first one is 30GB for the rockstor system. I then have a Raid1 setup with the second partition on sda and the entirety of sdb. sda5 is this 2.7TB “disk” that shows up in the screenshot above. I had to manually create this setup, it was done on the command line by simply adding sda5 to a pool created on sdb in the rockstor gui, then changing the type to raid1 and rebalancing.

“Bad boy!”, WebUI console error explains why this conf is failing: “…serial number is not legitimate or unique”, this time is not unique (sda5, being part of sda, has sda same serial and Rockstor works on unique serials)

Finally, to have a raid 1 conf actually you need 1 hd for rockstor and 2 disks for raid1.

Last word to @phillxnet to add suggestions with his incoming disks PR

Once again, welcome

Mirko

kbogert · February 8, 2017, 1:28am

Hi Mirko,

This config has been fine for a month before this incident, and the error message is new (normally it complains about sda5 being a part of the system disk and thus unchangeable).

Or are you saying that something in rockstor has damaged the btrfs filesystem due to misinterpreting the pool? I find that prospect scary.

Since my last post I have found a snapshot of the Movies subvolume with all the files apparently intact. In fact, the only problem I can find with the primary pool is that the Movies subvolume itself is damaged, giving the transid error when even using ls on it’s parent directory. I can’t seem to find any information on what could cause this, and IRC isn’t being responsive to my questions.

Mounting Primary with -o ro,recovery doesn’t help, the partition mounts like normal but the Movies dir is still damaged. From what I can figure out my next move is a btrfs restore to a backup location (which I’m preparing now). Then try to zero the logs. I’m unsure if I should report this as a bug or not. Thoughts?

phillxnet · February 8, 2017, 12:25pm

@kbogert Hello, just chipping in here that I think the message @Flyer was referencing was the big scary red one on the disks that you posted a screen shot of: “Warning! Disk unusable as pool member - serial number is not legitimate or unique …”

Rockstor doesn’t currently support partitions for data drives. This may change in the future with the referenced pending review changes but your partitioned pool member is also from the system drive. This very much complicates things and really is not and is not likely to be a supported arrangement as it breaks the separation that Rockstor attempts to enforce between system and data storage. Which is of course a good idea all around as well.

If you are wishing to have such configurations then you are better off with a less appliance orientated linux distribution. Much of the ease of use that Rockstor brings is due to compromises on configuration options, but that is always the case with an appliance approach; otherwise you are command line only as this ultimately has the greatest flexibility.

As to your diagnosis I think you are correct that you have a pool / subvolume corruption of sorts and this needs to be sorted. The differing drive name reports are probably due to raid1 doing modulus pid drive access, ie one drive if pid even, the other drive if pid odd. As to this being Rockstor ‘territory’ it is not as you are well outside the understood domain with your use of both a partition as pool member (rather than whole disk) and that this partition is also on the system drive. Internally in Rockstor the system drive is treated as a rather special case and as such so is one, but not all, of your pool members. That is definitely a recipe for confusion for all parties.

Agreed. In fact I would say you would be better off, once you have your data back in order and off the current hardware, to reinstall using something like a SanDisk Extreme USB 3.0 32GB (or larger) drive as your system disk, it’s essentially an ssd on a stick. Internally it’s a usb-sata bridge attached to an ssd all in one. And given it will form a single device btrfs pool the use of the USB bus, with it’s associated instabilities, is minimised. There after you are in Rockstor land, assuming whole disk pool members for the time being. A number of forum members, including my self, use this device as the system drive. It’s actually really quick as well, assuming at least a USB 2.0 port anyway. Certainly faster than you current hdd anyway. Plus it’s very low power.

Hope that helps and good luck with your btrfs adventures re restore. Nice that you had, and found, a snapshot by the way.

kbogert · February 8, 2017, 4:28pm

Ok, so if I understand you correctly the corrupted subvolume is not the fault of rockstor, despite my weird configuration? In other words, we are looking at a bug in btrfs itself or the hardware.

I’m not sure which is worse.

phillxnet:

Agreed. In fact I would say you would be better off, once you have your data back in order and off the current hardware, to reinstall using something like a SanDisk Extreme USB 3.0 32GB (or larger) drive as your system disk, it’s essentially an ssd on a stick. Internally it’s a usb-sata bridge attached to an ssd all in one. And given it will form a single device btrfs pool the use of the USB bus, with it’s associated instabilities, is minimised. There after you are in Rockstor land, assuming whole disk pool members for the time being. A number of forum members, including my self, use this device as the system drive. It’s actually really quick as well, assuming at least a USB 2.0 port anyway. Certainly faster than you current hdd anyway. Plus it’s very low power.

Thank you for this, but if I must reinstall this system after one fscking month of light use due to a bug in btrfs, then the new system will not depend on btrfs in any way. I’ll update this thread with my status, and will open a bug report in btrfs’s bugzilla with a filesystem image available.

I have to say though, judging by the mountain of unanswered bug reports over there I don’t have high hopes that this will be solved.