[Rockstor-bootstrap] allow for share mount to fail?

I couldn’t find a title that would describe what I wanted to ask, but in short, I wonder whether it would be a good idea to allow some shares to fail mounting during Rockstor boot (rockstor-bootstrap.service).

I’ve come to wonder such things as I currently experienced a failure to mount all of my shares after rebooting the machine because one share failed to mount and led the rockstor-bootstrap service to fail:

[root@rockstor ~]# systemctl status rockstor-bootstrap
● rockstor-bootstrap.service - Rockstor bootstrapping tasks
   Loaded: loaded (/etc/systemd/system/rockstor-bootstrap.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2020-08-04 18:40:23 EDT; 4 days ago
 Main PID: 15745 (code=exited, status=1/FAILURE)

Aug 04 18:40:23 rockstor bootstrap[15745]: Exception occured while bootstrapping. This could be because rockstor.service is still starting up. will wait 2 seconds and try again. Exception: ['In...d be decoded']
Aug 04 18:40:23 rockstor bootstrap[15745]: Exception occured while bootstrapping. This could be because rockstor.service is still starting up. will wait 2 seconds and try again. Exception: ['In...d be decoded']
Aug 04 18:40:23 rockstor bootstrap[15745]: Exception occured while bootstrapping. This could be because rockstor.service is still starting up. will wait 2 seconds and try again. Exception: ['In...d be decoded']
Aug 04 18:40:23 rockstor bootstrap[15745]: Exception occured while bootstrapping. This could be because rockstor.service is still starting up. will wait 2 seconds and try again. Exception: ['In...d be decoded']
Aug 04 18:40:23 rockstor bootstrap[15745]: Exception occured while bootstrapping. This could be because rockstor.service is still starting up. will wait 2 seconds and try again. Exception: ['In...d be decoded']
Aug 04 18:40:23 rockstor bootstrap[15745]: Max attempts(15) reached. Connection errors persist. Failed to bootstrap. Error: ['Internal Server Error: No JSON object could be decoded']
Aug 04 18:40:23 rockstor systemd[1]: rockstor-bootstrap.service: main process exited, code=exited, status=1/FAILURE
Aug 04 18:40:23 rockstor systemd[1]: Failed to start Rockstor bootstrapping tasks.
Aug 04 18:40:23 rockstor systemd[1]: Unit rockstor-bootstrap.service entered failed state.
Aug 04 18:40:23 rockstor systemd[1]: rockstor-bootstrap.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

The service failed due to the following error while mounting a particular share:

[04/Aug/2020 18:39:54] ERROR [system.osi:119] non-zero code(1) returned by command: ['/usr/sbin/btrfs', 'qgroup', 'show', '/mnt2/main_pool/Photos']. output: [''] error: ["ERROR: cannot access '/mnt2/main_pool/Photos': Input/output error", '']
[04/Aug/2020 18:39:54] ERROR [storageadmin.middleware:32] Exception occurred while processing a request. Path: /api/commands/bootstrap method: POST
[04/Aug/2020 18:39:54] ERROR [storageadmin.middleware:33] Error running a command. cmd = /usr/bin/mount -t btrfs -o subvolid=720 /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N5LLP56J /mnt2/Photos. rc = 32. stdout = ['']. stderr = ["mount: /dev/sdb: can't read superblock", '']
Traceback (most recent call last):
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/core/handlers/base.py", line 132, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/views/decorators/csrf.py", line 58, in wrapped_view
    return view_func(*args, **kwargs)
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/views/generic/base.py", line 71, in view
    return self.dispatch(request, *args, **kwargs)
  File "/opt/rockstor/eggs/djangorestframework-3.1.1-py2.7.egg/rest_framework/views.py", line 452, in dispatch
    response = self.handle_exception(exc)
  File "/opt/rockstor/eggs/djangorestframework-3.1.1-py2.7.egg/rest_framework/views.py", line 449, in dispatch
    response = handler(request, *args, **kwargs)
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/utils/decorators.py", line 145, in inner
    return func(*args, **kwargs)
  File "/opt/rockstor/src/rockstor/storageadmin/views/command.py", line 121, in post
    import_shares(p, request)
  File "/opt/rockstor/src/rockstor/storageadmin/views/share_helpers.py", line 204, in import_shares
    mount_share(nso, '%s%s' % (settings.MNT_PT, s_in_pool))
  File "/opt/rockstor/src/rockstor/fs/btrfs.py", line 607, in mount_share
    return run_command(mnt_cmd)
  File "/opt/rockstor/src/rockstor/system/osi.py", line 121, in run_command
    raise CommandException(cmd, out, err, rc)
CommandException: Error running a command. cmd = /usr/bin/mount -t btrfs -o subvolid=720 /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N5LLP56J /mnt2/Photos. rc = 32. stdout = ['']. stderr = ["mount: /dev/sdb: can't read superblock", '']

Yes, I know this doesn’t look good for this share (it actually doesn’t matter and is a separate problem for a separate thread, I believe), but what I would to point out is that all the other shares can be mounted successfully individually:

/usr/bin/mount -t btrfs -o subvolid=<subvolid> /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N5LLP56J /mnt2/<share-name>

I was thus wondering whether it would be a good idea to allow for such failures to happen and not fail the bootstrap procedure. In this context, for instance, all the other shares (and services relying on these shares) would still be functioning properly while only the “bad” share would be displayed as problematic (unmounted).

In my one particular case, this would be helpful (as long as there’s an element in the UI letting me know that a problem occured during bootstrap, pointing me to the problematic share(s)), but I’m not sure whether cases where such a mechanism would be more problematic than helpful.

Any feedback/insight?

1 Like

@Flox Re:

I think that’s an excellent idea. We already do something link this when mounting a pool. I.e. we try via each attached member as sometimes that can help, after attempting a mount by label first of course. And as you know we also fail through when doing the restore procedure. I.e. if one bit fails it just drops through and tries the next bit. So this is very much in keeping with ‘do all we can’ type approach and can only help with robustness. And yes, given each subvol is, almost, a file-system in it’s own right it can sometimes happen that a fault can affect only a subvol mount and it would be good if we simply moved on with some error logging to note this. And given, currently, we default to mounting all pools and all Rockstor created shares I very much thing we should fail through to the remaining subvols if one fails to mount on us. And I think, given we highlight in red unmounted subvols, we already have a warning mechanism in place. We could email as well but initially we need the base fail through mechanism in place first I recon.

Nice find and thanks for reporting / details this. And it would be good to use your failure instance to prove the basic approach (likely a try exception catch with maybe a finally or the like) while you have it there as I’m unsure how we could reproduce a legitimate reproducer on this one so always good to have a ‘real’ reproducer at hand with these things.

Cheers.

1 Like

Thanks for the feedback, @phillxnet!

Oh very good point… That’s how I noticed all my shares were unmounted to begin with so we should already have all we need to surface which share is problematic.

Yeah, it’s certainly easy t oreproduce on this system as I simply have to reboot to trigger it. The problem is that it is my “production” machine so I won’t be able to mess with it constantly. It’ll be useful to get as much info and data as needed to “simulate” it, though. I’m actually curious to know whether there would be a way to create this error so that we can have reproducers for such instances…

Anyway, I’ll go on and create an issue on Github so that we can keep track of it.

Thanks!

1 Like

Here’s the corresponding Github issue:

2 Likes