Frightening experience, how to continue?

@suman

I had a rather unsetteling experience this evening.

I have been in a process of moving data to my Rockstor NAS, and am allmost finished with replacing / installing disks.

I had started a disk removal and let the system work on it. After some time I wanted to check up on the progress, only to find that the system was unresponsive. It didnt react to any input, and I couldn’t get into the command line. It seemed completely frozen. A bad thing to happen, I would guess, during a disk removal.

My only option was to cut power and reboot. The system came up, but Rockstor threw a lot off errors at me. Wouldnt let me see the pools and shares.

Messed around on the cmd line a little, but btrfs refused to mount the pool. btrfs scrub start /dev/sda only ran for 14 seconds scanning 2 GiB of data and showed no errors.

The cmd line also showed errors upon reboot about not being able to see the root tree on different drives.

Tried rebooting a few times, to no avail. Then decided to give up for now.

But reading about reparing btrfs I found something about a restore command.

I decided to try it, and booted the system. I entered the commands on the cmd line, but to my surprise the restore command told me the pool was allready mounted.

I went into the GUI, and (lo and behold) everything was accessible, and I can see the files on the shares, and open files etc. All drives show up as part of the pool, even the one in the process of being removed during the crash, just with less data shown on it (btrfs fi show).

I have started a scrub, which is slowly progressing, showing no errors until now.

Allthough everything seems fine, I am a bit startled by this.

I guess the best thing I could do was to let the scrub finish (looks like it could take about a day or so), see if errors was reported, and then continue where I left of?

Have anybody else experienced things like this? Especially a seemingly dead fs comming back to life like this on its own?

Please explain this bit in more details, for example, do you have a hot-swapping raid controller?

No.
I was using Btrfs’s build in delete function:
Btrfs’s device delete /dev/sdx /mnt2/xpoolx

It was during this the system died.

It has worked fine on 4 other, smaller disks.

The discs are sata discs. So in principle hotswappable.

As I understand Btrfs it should roll back to the latest snapshot in the event of a crash.

@KarstenV Sorry, I was maybe not clear enough, did you shut down the system and then replaced the disk?

No.

The data on the disk had not finished tranferring to the other disks in the raid. So I was unclear what to do.
Btrfs documentation is a little unclear sometimes :smile:

As things stand now, I have the last two disks replaced, and a final balance is ongoing. I haven’t experienced any problems since the mounting problems, except this afternoon when I rebooted the system and again the filesystem wouldn’t mount. This has me a little worried, but the disk Btrfs complained about is now no longer in the system.

When all this balancing is done, I will scrub the entire 10,5TiB Pool, and see if anything shows up.

Then, if everything is allright, and the system boots normally, I will start transferring the rest of my data to the system.

I am jumping into this late and i realize this is a big topic. I just want to say a few important things relating to DR in Rockstor. But first a quick point – after adding/removing disks, you need to do balance, not scrub.

It is pretty cool that you got the pool back from death! BTRFS indeed is awesome. However, note that we are still behind with disk failure notifications, rebuilding pools and bringing system back to good health. Right now, these things have to be done manually and Rockstor doesn’t provide any support to make it easy… yet. Last time we did a bunch of DR experiments was a few kernel versions ago and things were not consistent enough to add support. We’ll take a fresh look at DR after 4.2 kernel update(eta august) with the hope that we can add support sooner than later. Until all of this becomes easy, please do keep regular backups.

I started the scrub to weed out any errors that the abrubtly halted disk deletion may have had brought by. Some data could/must have failed to write.
The scrub did not reveal any errors, which worries me since the volume failed to mount once again after this, complaining about something among the lines of a missing btrfs rt (root tree?) on dev/sdd.

This disk is now deleted from the system and replaced, with a never much bigger disk, and the balance kicked off automatically. The array is now 4x1.5TiB and 3x2TiB in Raid6, giving me 7,8TiB usable space.

I have no chance of backing all this data up, but the most important data is backed up. And eventually one of my old NAS’es, will be set up to start once a week and make further backups.

When the current balance has finished, I will run another scrub to check the system. Any other suggestions on commands I could run to check the health of the pool?

When all this has run its course, I will have to shut down the system to move it to its permanent location, since I am now finished with the build (for now). I very much hope everything works by then :slight_smile:

Well the system seems to be allright.

I have made a few reboots today, and the Btrfs filesystem has mounted without problems every time.

Have also installed the latest update.

All files have been transferred to the NAS, and I will now start using it on a daily basis. We will se how well it will run :slight_smile:

@KarstenV SATA is hot swappable, but you have to make sure that all your hardware is able to do hot swapping. If you have any doubt then I would suggest for future maintenance, let each job run till finished, and then power down, wait a few seconds and replace disks, and before you put a new disk in zero-wipe it. I made it myself a habit like going in the car and put seatbelt on before starting the engine. It will avoid a lot of trouble in the long run. :wink:

For future reference, read about btrfs device replace. You need to add the replacement disk to the system ahead of time, but the command was created to handle the situation where btrfs device delete was stalling due to a failing device. In particular you can use the -r switch to make it read from every other disk except the failing disk to fill the new disk with data.

Unfortunately it doesn’t work with RAID 5/6 yet.

Thanks for your advice.

I know the procedure about deleting a drive, I have actually done it quite a few times, in the process of moving data to the NAS, and expanding it.

Unfortunately, the system hung during one of the delete processes, which was the start of all of my problems.

Lucky for me everything seems to have worked itself out.