Disk Pool mounted, shared missing. many errors

GambitZA · January 24, 2023, 1:28pm

Oh wondeful internet peoples. I seek wisdom from the mindhive.

Rockstor 4.1.0-0 installed on a ESXI vm. Tried to get vmware-tools installed. followed a guide blindly. vm rebooted, all hell broke loose.

“Parent transid verify failed… wanted 32616 found 32441”
Pool remounts automatically as read-only.

Ive tried nurmerous options without any success.

I’m willing to put on the dunce hat and be tarred and feathered and publically mocked, so long as some kind souls help me to recover the data (not mission critical, but lots of sentimental photos and video)

Please advise what screenshots, prinouts etc you need to understand better where to start.

phillxnet · January 24, 2023, 5:57pm

@GambitZA Welcome to the Rockstor community forum.

Oh dear, this is usually an indication that some writes were lost in transit or the like. I.e. it wants (expects) a newer copy of something but was pointed at what looks like an older copy.
So in short the pool is poorly.

If you have a read-only mount you can refresh your back-ups (or create them) as it may be there is little to nothing of consequence missing. The auto read-only ‘move’ on btrfs’s part is to protect the filesystem (Pool) from further damage. So copy off what you can before you do anything else. As it may be any attempt to repair this pool will end up worse of than it is now. For example if the RAM in the system has caused this corruption, things will only get worse. But they are safer now it’s read-only. Or if there was a kernel hang that prevented a write, or a write was lost some how (drive buffers not flushed etc - say via drive firware issue) or via a failing drive. Whatever the case, you at least now have read-only access, so do all you can with this before proceeding to to anything else.

You may find all is well with the important data. The you options are far more open, i.e. you can wipe the pool, recreate a fresh one, re-load the data; knowing that it is now not the only copy (if that was the case initially). So if the same thing occurs or the same root cause causes some other failure you are still good, or at least better than now.

So the above is mainly advise on what to do to minimise the loss. But we are not btrfs experts here, we are more users / sysadmins. The actual experts are the developers, and for this I can point you to the following doc entry we have on contacting them:

“Asking for help”: Data loss - prevention and recovery in Rockstor — Rockstor documentation

If you are already knowledgeable in btrfs and system administration, see the upstream community Libera Chat - #btrfs channel. Finally, if your needs are extreme, consider seeking help on the btrfs mailing list.

With the following proviso there also:

> The btrfs mailing lists is primarily for btrfs developer use. Time taken-up on trivial interactions there may not be fair to the world of btrfs development. Also take careful note of what you are expected to include: i.e. the “What information to provide when asking a support question” section on the above linked mailing list page.

A further note, once you have all data that you can backed up, is to use a newer kernel. You are likely still using Leap 15.3. However our downloads page:
https://rockstor.com/dls.html
now has Release Candidate 4.5.5-0 Rockstor installers based on Leap 15.4, which in turn has a newer btrfs software stack (kernel and userspace). But note we have an ongoing/outstanding issue regarding import of config saves. But an install of this, to another system disk (disconnect the original 4.1.0-0 Leap 15.3 based one) to preserve your existing system, gives easy access to a newer system under which you can do any repairs if you end up taking that route (after refreshing backups).

We also have the following how-to that enables kernel backports to the base Leap 15.3 OS:

Installing the Stable Kernel Backport: Installing the Stable Kernel Backport — Rockstor documentation
which will also update both the kernel and filesystem subsystems to get any newer code that may help with the repair that you may end up undertaking.

But the first step must be to save what you have given you at least have read-only access, so should be able to pull off likely the vast majority of the data that concerns you: via a back-up copy-off.

Let us know how it goes, but don’t change anything before establishing a backup. Super important as there is no guarantee any repair, irrespective of the OS or kernel version will not make things worse; especially if the underlying hardware is failing/faulty.

Re info to help others help you, the output of the following (run as the root user) could help; hopefully.

uname -a
btrfs fi show

And note these will change if you end up updating, re-installing with RC2, or doing the stable kernel backports. In some cases thought, an older kernel will mount a pool that a newer kernel will not: due to the greater number of sanity checks added as btrfs is developed. But a newer kernel is favoured in repair scenarios. But every mount attempt could, especially by differing kernel versions, runs the risk of modifying the filesystem so that it doesn’t even mount read-only. Hence advising that you first get off what you can as you still have read-only access.

Hope that helps. And let us know how you get on.

GambitZA · January 25, 2023, 6:33am

Hello and thanks for the reply!

Im only allowed to post one image per post, so please excuse the upcoming spam.

As requested, here are some printouts:

uname -a

GambitZA · January 25, 2023, 6:33am

btrfs fi show

GambitZA · January 25, 2023, 6:34am

cd /mnt2/DATA followed by ls

GambitZA · January 25, 2023, 6:34am

cd /mnt2/DATA/DATASTORE (datastore folder is where all the pictures are)

GambitZA · January 25, 2023, 6:35am

As well as some screenshots of the following screens in Rockstor:

Storage / Disks

Storage%20Disks2314×181 67.4 KB

GambitZA · January 25, 2023, 6:35am

Storage / Pools

Storage%20Pools2302×175 72.3 KB

GambitZA · January 25, 2023, 6:36am

Storage / Shares

Storage%20Shares2302×195 63 KB

GambitZA · January 25, 2023, 6:37am

And finally a copy of fstab (which, in my uneducated opinion is where the problem started. I think the process of trying to create and install vmware-tools changed the fstab)

I dont think this issue warrants asking devs. This isnt a bug, this is a user issue (ie i broke it).

GambitZA · March 3, 2023, 12:45pm

This issue is still unresolved. I suspect the actual data is still intact as the volume remains in read-only mode. If a kind sould could help me sort out the errors I would be able to get the data back.

phillxnet · March 3, 2023, 1:50pm

@GambitZA Hello again.
Re:

Yes, you may be right. On the down side here your poorly pool (parent transid verified failed), with it’s single device, has no real real redundancy. So there is only a single copy of the data.

We don’t actually use fstab for our data pool/s mounts, but vmware-tools may have added a kernel module.

Hopefully. Btrfs goes read-only when-ever there is a data-threatening anomaly. But what is most important here is to first ensure you systems is sound hardware wise.

Look to my last post on this front and follow the advise there first.

Basically you need to ensure you don’t make things worse. Doing anything with duff memory will do this.

So first ensure the host has sound memory. You mention:

So that means you need to check the memory of the hardware hosting this ESXI vm instance.

Once that is established via say the advice in our Pre-Install Best Practice (PBP):
https://rockstor.com/docs/installation/pre-install-howto.html

You can try to increase your odds of repairing the said pool (volume in btrfs speak).
There is a btrfs tool that is designed specifically for this scenario - getting off what you can from a poorly pool. But first things first - ensure proven hardware. I.e. memory and PSU stability. ‘Other systems on the same host run OK’ is not good enough. Btrfs is famous for catching memory issues in the form of resulting filesystem corruptions. Those same corruptions use to be way less obvious on filesystems that simply returned corrupted data; btrfs tends to catch the resulting anomalies and throw errors and drop to ro. What you see.

Once you have established memory integrity on the host; difficult to do absolutely but two full runs without error of memtest86+ is better than nothing, and some suggest 24-48 hour runs to be more certain. All depends on the size and speed of the machine. You can move to updating your OS as that way you get more and more fixes in the filesystem stack: hence my suggestion previously regarding our Stable Kernel Backports howto earlier.

Add “ro” to your Extra mount options for this pool. It will tell Rockstor to not even attempt to mount rw give the pool is poorly. That then avoids further changes that may make things worse.

So even more briefly:

hardware integrity - you don’t mention if you have hardware raid in the host under the btrfs volume in the vm. That weakens integrity incidentally as it can pick the wrong version of redundant data. That is why we advise against this: Quick start — Rockstor documentation
upgrade your OS so you are on the latest version of Leap for example. Again see my prior post. You are likely on an End Of Life (EOL) Leap version. It can be upgraded in-place via the command line.
Again see: Distribution update from 15.2 to 15.3 — Rockstor documentation for 15.2 to 15.3 for example. 15.3 to 15.4 is pretty similar but as yet unwritten as we are not yet on stable for 15.4, but an RC4 is in testing.
upgrade your kernel and filesystem stack as indicated in the already referenced:
Installing the Stable Kernel Backport — Rockstor documentation so that you are on an even newer kernel and filesystem userspace than is in the default OS.
Retrieve what you can to refresh/create your backups if you can still mount ro.
Look-up how to use ‘btrfs restore’ Restore - btrfs Wiki as your last ditch attempt at data recovery before attempting to repair the filesystem. Repair can end-up do more harm than good. Plus this is all outside the Rockstor teams expertise so some intervention from the sited btrfs Libra Chat channel would be advised before you attempt any repair. There are progressively more options the newer your system is. Hence my suggestion to go ‘ro’ in Rockstor then jump to command line to do all OS updates and then distribution updates (one at a time) and strive to preserve retrieve he data, not the Rockstor instance. That can be re-installed easily enough.

There may be more sense, overall, in using a separate freshly downloaded Tumbleweed instance to attempt data recovery as you then have cutting edge software from the get go. You can then return to your existing Rockstor install to continue if an effective retrieval/repair was achieved. Just don’t have more than one system disk attached at the same time. It can confuse things. Also if you do any re-install, say to get a newer version of Rockstor (downloads are now Leap 15.4 Rockstor package v4.5.6-0) you must disconnect the data dives. And if they fail to import once you have the newer Rockstor instance in play, look to:

Import unwell Pool Disks — Rockstor documentation

Make sure to read our: Data loss - prevention and recovery in Rockstor — Rockstor documentation also as that can help with how things work, but again this assumes multi-disk btrfs where there is more ‘hope’ of repair etc, given the direct access to multiple drives with multiple copies of the same data.

Hope that helps a bit. And my apologies for our docs not being of more help here already.

phillxnet · March 23, 2023, 4:24pm

@GambitZA As per our side channel chat:

Could you update this thread with the result of trying my suggestions to date. Or state where you are currently at with this current data-loss scenario. Always frustrating to be in this situation but we have, incidentally, in the interim, release a new Tumbleweed based installer with our 4.5.8-0 (RC5) latest rpm pre-instaled. It’s on the downloads page with the following proviso:

Development/Advanced-user/Rescue use only

So as per my last post here:

You can in-fact now have both a Tumbleweed base OS and the familiarity of Rockstor if that is going to help, which it may. But note this is not our production target currently but you are, currently, in the realm of trying to retrieve you data from a Poorly pool. Just a thought.

Anyway if anyone can chip in here with some hand-holding that would be fantastic as I am a little (read a lot) distracted currently with betting up past the last hurdles of our next Stable release (candidate or otherwise) and in establishing our new fiscal setup (more on this soon hopefully). There-after we may be able to offer some one-to-one commercial support for such instances. But alas, not quite just yet.

Hope that helps and do chip in here if you can as @GambitZA needs a little more help that unfortunately I’m able to offer currently.

phillxnet · June 8, 2023, 11:18am

@GambitZA Hello again.

I’m just acknowledging your support email just now on this same data retrieval request re your btrfs volume, and its transit failure as you reported here.

Are there any forum members available to assist further on this issue. @GambitZA has also contacted the btrfs mailing list for assistance and I have advised, as previously in this thread, that the chat linked in our doc support section here is a preferred option:

Support — Rockstor documentation

“For upstream community btrfs support see Libera Chat - #btrfs channel.”

Hope that helps. And if anyone does assist here re data retrieval keep in mind there may be parallel advise/discussions on the btrfs mailing list, or indeed the recommended chat facility.

Again my apologies for not being able to assist you further here.