Integer out of range

dunkelfalke · January 6, 2022, 3:58pm

[Please complete the below template with details of the problem reported on your Web-UI. Be as detailed as possible. Community members, including developers, shall try and help. Thanks for your time in reporting this issue! We recommend purchasing commercial support for expedited support directly from the developers.]

Brief description of the problem

Getting the error below everywhere I go, mostly with the actual UI below it, except for the pool properties which aren’t loading at all.

Detailed step by step instructions to reproduce the problem

Started after a scrub, reproducable on my system on essentially every page of the web UI

Web-UI screenshot

rockstor|512x500

Error Traceback provided on the Web-UI


            Traceback (most recent call last):
  File "/opt/rockstor/src/rockstor/rest_framework_custom/generic_view.py", line 41, in _handle_exception
    yield
  File "/opt/rockstor/src/rockstor/storageadmin/views/pool_scrub.py", line 46, in get_queryset
    self._scrub_status(pool)
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/utils/decorators.py", line 145, in inner
    return func(*args, **kwargs)
  File "/opt/rockstor/src/rockstor/storageadmin/views/pool_scrub.py", line 65, in _scrub_status
    PoolScrub.objects.filter(id=ps.id).update(**cur_status)
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/db/models/query.py", line 563, in update
    rows = query.get_compiler(self.db).execute_sql(CURSOR)
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/db/models/sql/compiler.py", line 1062, in execute_sql
    cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/db/models/sql/compiler.py", line 840, in execute_sql
    cursor.execute(sql, params)
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/db/utils.py", line 98, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
DataError: integer out of range

phillxnet · January 6, 2022, 6:30pm

@dunkelfalke Welcome to the Rockstor community.

Thanks for the report. This one is a bit of a puzzle actually.

Was the pool scrub initiated at the command line or via the Rockstor Web-UI.

Also could you give us the output of the following commands:

btrfs fi show

and the status of the scrub on the relevant pool, via:

btrfs scrub status /mnt2/Pool-label-here

General notes:

The core of this issue looks to be in attempting to get the pool scrub current status:

/opt/rockstor/src/rockstor/storageadmin/views/pool_scrub.py", line 46, in get_queryset self._scrub_status(pool)

github.com

rockstor/rockstor-core/blob/master/src/rockstor/storageadmin/views/pool_scrub.py#L43-L47


def get_queryset(self, *args, **kwargs):
    with self._handle_exception(self.request):
        pool = self._validate_pool(self.kwargs["pid"], self.request)
        self._scrub_status(pool)
        return PoolScrub.objects.filter(pool=pool).order_by("-id")

Which in turn calls the code a few lines down:

github.com

rockstor/rockstor-core/blob/66fc998bad53b38219b6d0934048e1330e31b2fd/src/rockstor/storageadmin/views/pool_scrub.py#L49-L66


@transaction.atomic
def _scrub_status(self, pool):
    try:
        ps = PoolScrub.objects.filter(pool=pool).order_by("-id")[0]
    except:
        return Response()
    if ps.status == "started" or ps.status == "running":
        cur_status = scrub_status(pool)
        if (
            cur_status["status"] == "finished"
            or cur_status["status"] == "halted"
            or cur_status["status"] == "cancelled"
        ):
            duration = int(cur_status["duration"])
            cur_status["end_time"] = ps.start_time + timedelta(seconds=duration)
            del cur_status["duration"]
        PoolScrub.objects.filter(id=ps.id).update(**cur_status)
    return ps

And somehow receiving or generating an inappropriate number for our database handling.

I think this may be just a reporting issue but given it’s upsetting the database it will likely hold up a lot of the Web-UI. A Browser refresh may help for now and once the associated pool scrub has finished this may clear up.

Were there any special steps that lead up to this issue. I.e. how was the scrub started etc. Once we have a way to reproduce this issue we can move it to GitHub and work on figuring out how we might guard against such locks.

Many of our errors, as reported in the Web-UI are sticky of sorts. And a Web-UI refresh on the browser side can help clear them up. It may be this is a transient error that is sticking within the Web-UI, hence it’s appearance all-over. But if the DB is locked / snagged some-how then many of the pages will be inaccessible.

We need a way to reproduce this occurrence and we can then track down what we are doing/not-doing/not-avoiding or whatever. We’ve not had a report of this nature in this area of the code for a long time and this code hasn’t been changed much for quite some time also so it’s a bit of a puzzle.

What is the system/hardware you are experiencing this one? I.e. CPU core count, memory, etc.
I.e. see: https://rockstor.com/docs/installation/quickstart.html#minimum-system-requirements

Thanks again for the report.

Flox · January 6, 2022, 6:42pm

I also noticed that you are running a 5.15.xxx kernel… I thus suppose this is a customized install in some way?

dunkelfalke · January 6, 2022, 7:09pm

The scrub was started through the UI and has been finished quite a while ago. Unfortunately I don’t speak Python yet, so I cannot debug that script, but at a glance there should be nothing wrong with it if casting a hh:mm:ss string to integer returns the amount of seconds in Python.

The hardware is a two core AMD A4-4000 with 4 gigabytes of RAM, a 32 gigabytes SSD and 8x 3 terabytes HDDs connected to a LSI SAS2008 controller in IT mode.

Here is the output:

helm:~ # btrfs fi show
Label: 'ROOT'  uuid: 0fcf448d-9f7b-4d39-8a97-569251a6eeac
        Total devices 1 FS bytes used 4.62GiB
        devid    1 size 27.75GiB used 4.83GiB path /dev/sda4

Label: 'Helm'  uuid: 7ad2dd67-7c38-4260-84ae-dce3843b717b
        Total devices 8 FS bytes used 5.06TiB
        devid    1 size 2.73TiB used 1.90TiB path /dev/sde
        devid    2 size 2.73TiB used 1.90TiB path /dev/sdb
        devid    3 size 2.73TiB used 1.90TiB path /dev/sdc
        devid    4 size 2.73TiB used 1.90TiB path /dev/sdd
        devid    5 size 2.73TiB used 1.90TiB path /dev/sdh
        devid    6 size 2.73TiB used 1.90TiB path /dev/sdg
        devid    7 size 2.73TiB used 1.90TiB path /dev/sdi
        devid    8 size 2.73TiB used 1.90TiB path /dev/sdf

helm:~ # btrfs scrub status /mnt2/Helm
UUID:             7ad2dd67-7c38-4260-84ae-dce3843b717b
Scrub started:    Mon Jan  3 18:29:55 2022
Status:           finished
Duration:         4:58:07
Total to scrub:   15.18TiB
Rate:             879.79MiB/s
Error summary:    no errors found

It might have something to do with me converting the pool from RAID1 to RAID1C3 for additional data security some time before (hence a newer kernel was required), but the web UI worked fine until I started the scrub.

phillxnet · January 7, 2022, 6:00pm

@dunkelfalke Hello again and thanks for the feedback and command output.

I think @Flox has it here with:

Notice the difference between your 5.15.xxx kernel scrub output:

helm:~ # btrfs scrub status /mnt2/Helm
UUID:             7ad2dd67-7c38-4260-84ae-dce3843b717b
Scrub started:    Mon Jan  3 18:29:55 2022
Status:           finished
Duration:         4:58:07
Total to scrub:   15.18TiB
Rate:             879.79MiB/s
Error summary:    no errors found

and for example one from our currently supported openSUSE Leap 15.3 kernel:

rleap15-3:~ # btrfs scrub status /
scrub status for 475bbc39-01c8-4cdd-bb22-de07b40f7e13
        scrub started at Mon Jan  3 11:50:00 2022 and finished after 00:00:20
        total bytes scrubbed: 4.45GiB with 0 errors
rleap15-3:~ # uname -a
Linux rleap15-3 5.3.18-59.37-default #1 SMP Mon Nov 22 12:29:04 UTC 2021 (d10168e) x86_64 x86_64 x86_64 GNU/Linux

So actually quite different. It looks like we have some adaptation to do there. But this is unlikely to be done before our v4 Stable release however. But it may make for a nice ‘hot fix’ once we are past the initial re-release that our v4 “Built on openSUSE” endeavour represents.

You may well be able to ‘cancel out’ this issue by editing your version of the following code:

github.com

rockstor/rockstor-core/blob/master/src/rockstor/fs/btrfs.py#L1657-L1668


      
          if e.err[0] == emsg:
              logger.info(
                  "Pool: {} is Read-only, skipping qgroup limit.".format(pool.name)
              )
              return out, err, rc
          # quotas disabled results in o = [''], rc = 1 and e[0] =
          emsg2 = "ERROR: unable to limit requested quota group: Invalid argument"
          # quotas disabled is not a fatal failure but here we key from what
          # is a non specific error: 'Invalid argument'.
          # TODO: improve this clause as currently too broad.
          # TODO: we could for example use if qgroup_max(mnt) == -1
          if e.err[0] == emsg2:

and rebooting.

Given it reports on many states of the scrub it will take a little while to develop a dual personality for this procedure but you might want to give it a try. Python is pretty easy as it goes and that parsing procedure is also relatively straightforward to understand, but if you do remember we are still on Python 2.7: but we are working on that and it’s our top priority in the next testing release so lots of breakage to come on that front :).

Thanks for the report, much appreciated. I’ve created the following issue as a result:

github.com/rockstor/rockstor-core

Support future kernel scrub output

opened 05:53PM - 07 Jan 22 UTC

closed 03:38PM - 16 Feb 23 UTC

phillxnet

Thanks to forum member dunkelfalke for highlighting this issue. If one runs a no…n default kernel e.g. a Stable backports kernel such as is referenced in the following pull request: "Remarked Kernel_Head & Kernel_stable_Backport + filesystems repos ..." See: https://github.com/rockstor/rockstor-installer/pull/88 The btrfs scrub status output is radically changed. This in turn throws our scrub reporting and ends up in a database error akin to: ``` DataError: integer out of range ``` As detailed in the following forum thread: https://forum.rockstor.com/t/integer-out-of-range/8183 [EDIT] From a mostly functional point of view this issue has been addressed: see the comment copied from below: > Linking to potentially related work-done in the interim: > > Scrub UI - Integer Out of Range https://github.com/rockstor/rockstor-core/issues/2397 https://github.com/rockstor/rockstor-core/pull/2398 @Hooverdan96 > > And to a prior adaptation on scrub format changes: > > BTRFS SCRUB status enhancements for rockstor https://github.com/rockstor/rockstor-core/pull/2157 @ubenmackin @FroggyFlox But is now awaiting the initially missed test coverage: See the following linked pull request below for this connection. Refactor scrub status parsing to enhance/enable testing #2342 #2493

And remember that given you have no scrub reporting at all currently your ‘modification’ to that scrub_status() procedure could be pretty drastic, as long as the return from it is sane the rest of the system will see nothing out of order.

Hope that helps and thanks again for the report.

dunkelfalke · January 7, 2022, 6:24pm

Sure, will probably need to learn Python some day at my new job, so might just as well start now. Not being really familiar with git either (used subversion until now), would it be okay if I post my changes to the btrf scrub parser here if I manage to get it working instead of creating a pull request?

phillxnet · January 7, 2022, 6:57pm

@dunkelfalke Glad your game. It would be good to have more input from the future :).

There would be a great deal of facility in a basic understanding of git. Plus if you take a look at our:
“Community Contributions” doc section here:
Community Contributions — Rockstor documentation
specifically the developers section:
Contributing to Rockstor - Overview — Rockstor documentation
it walks you through the exact procedure to get a local git working copy, branch it, make changes, and push it to GitHub in the form of a pull request.
Any tricky parts you find within that doc would be another area you could help us with actually.

And to get your existing install in a state where it knows nothing of any scrub status malarkey you could just get that procedure to not:

 return stats

and instead return the fail-over:

return {"status": "unknown"}

This is what should have happened actually. But our existing code thought it did know what it was looking at and ended up not returning the equivalent of “I don’t know” and sending back instead likely some crazy values that ended up throwing the database fields concerned. Hence the:

Keep in mind that you only need a very basic knowledge of git to contribute. But that knowledge can go a long way and our current docs as referenced above should be enough. And if not then please let us know where the fall down and we can hopefully improve them as and when.

I look forward to your any future pull requests. Although this task in particular, having taken another look, is not that trivial: but all the more rewarding to tackle potentially. I’d forgotten that it’s already been extended to accomodate a past format change via this issue:

and this pull request against that issue:

github.com/rockstor/rockstor-core

BTRFS SCRUB status enhancements for rockstor

rockstor:master ← ubenmackin:scrub_status_enhancement

opened 09:39PM - 29 Apr 20 UTC

ubenmackin

+85 -9

Updated the scrub status code to check the version of btrfs-progs that is instal…led. Starting with 5.1.2, the scrub status output changed. This code will handle both cases, either using the newer output format or the older. Additionaly, if running with the newer btrfs progs, modified the Scrub Status Summary and Detail pages to show the ETA as well as the current scrub speed. Fixes #2162 Fixes #1922

And a short-fall on my side in that pull request was to not insist on our unit test extension given this code is in the critical path for basic functions. If you take a look at the following you will find our test for it’s prior function:

github.com

rockstor/rockstor-core/blob/master/src/rockstor/fs/tests/test_btrfs.py#L465-L468


      
          "0/262        16.00KiB     16.00KiB ",
          "0/263        16.00KiB     16.00KiB ",
          "2015/1          0.00B        0.00B ",
          "2015/2          0.00B        0.00B ",

before it was modified. But we now don’t have any test data to prove it’s existing function as well so that any future changes to accommodate our future requirements, the 5.15 you are reporting on, can’t be tested to see if it breaks our current and past users. But another appropriate versioned test data output could sort that before the future changes to accomodate 5.15 were applied. And ideally those changes would be lead by a further example out (test data) such as I ask for below. The changes could then be made to that procedure such that it passes old current and future scrub outputs. Hence my comment on looking more closely at that procedure that it’s a non trivial task. But entirely doable however so don’t let me put you off. But testing can be a little daunting if you’ve not done any before. But again, bit by bit, is the way to go on that front.

My apologies for not maintaining this test in the interim by the way.

Note that you can make changes to your instance and just restart your rockstor services and the changes will take effect there after. For that type of python code you don’t need a full development environment. But given that piece of code has an associated unit test a contribution fix would ideally be accompanied by a test to prove it’s function.

Out of interest could you also give us the output of the actual command that procedure runs?
I.e. from:

    out, err, rc = run_command([BTRFS, "scrub", "status", "-R", mnt_pt])

we have

btrfs scrub status -R /mnt2/Helm

The run_command is a wrapper so we can runs stuff within a harness of sorts.

Hope that helps.

dunkelfalke · January 7, 2022, 7:40pm

Understood. Will learn the basic usage of git as well. It is just very weird and seems counterintuitive compared to all the previous source controls I have used, but it seems to be important to know nowadays.

To be honest, I do have two decades of software development experience, but for legacy embedded systems so I am sort of a dinosaur and the current ways of software development have largerly passed me by. Being too lazy to update the knowledge is now coming back to haunt me.

And sorry for asking such basic question:
Have seen the versioning in the source, the function looks pretty straightforward. Thought about trying out a small change just to get comfortable with the workflow, but unfortunately nothing changes whatsoever. I was assuming that since Python is an interpreted language, changing /opt/rockstor/src/rockstor/fs/btrfs.py would be sufficient, but I get the same error message despite doing this:

    if parse_version(btrfsProgsVers) < parse_version("v5.1.2"):
        statOffset = 1
        durOffset = 1
        fieldOffset = 2
        haltOffset = -3
    elif parse_version(btrfsProgsVers) < parse_version("v5.15"):
        statOffset = 2
        durOffset = 3
        fieldOffset = 4
        haltOffset = -1
    else:
        stats["status"] = "halted"
        stats["duration"] = 4711
        return stats

Am I missing something here?

phillxnet · January 7, 2022, 7:51pm

@dunkelfalke Hello again.
Re:

and

Not really just that you will need to stop and then start all the Rockstor processes. That way you drop the running versions from memory.

See the following doc section on the associated systemd services we run under and what they do:
https://rockstor.com/docs/contribute/contribute.html#code-build

So you just need to stop those services and start the ‘last’ one as it then, in-turn, starts all the others again.
i.e.

systemctl stop rockstor-bootstrap rockstor rockstor-pre

then start the ‘top’/‘last’ one via:

systemctl start rockstor-bootstrap

Then you should be running the new version of the pcode resulting from your source changes.

Hope that helps.

Hooverdan · January 7, 2022, 8:31pm

I think, one thing to remember is that, independent of the output comparisons @Flox and @phillxnet made further up, the code looks at the “raw” output of the command (i.e. using option -R as shown by @phillxnet in the previous answer). That makes some things a little easier in parsing the output.
If you’re crazy enough to check the latest version on the brtfs-progs, you can look here (if I interpreted the status output of the latest versions correctly):

github.com

kdave/btrfs-progs/blob/b16b0a766f06138de2ee32d4d84b7110e469ff49/cmds/scrub.c#L309


	hours = ss->duration / (60 * 60);
	gmtime_r(&seconds, &tm);
	strftime(t, sizeof(t), "%M:%S", &tm);
	printf("Status:           %s\n",
			(ss->in_progress ? "running" :
			 (ss->canceled ? "aborted" :
			  (ss->finished ? "finished" : "interrupted"))));
	printf("Duration:         %u:%s\n", hours, t);
}


static void print_scrub_dev(struct btrfs_ioctl_dev_info_args *di,
				struct btrfs_scrub_progress *p, int raw,
				const char *append, struct scrub_stats *ss)
{
	printf("\nScrub device %s (id %llu) %s\n", di->path, di->devid,
	       append ? append : "");


	_print_scrub_ss(ss);


	if (p) {
		if (raw)

and here:

github.com

kdave/btrfs-progs/blob/b16b0a766f06138de2ee32d4d84b7110e469ff49/cmds/scrub.c#L275


	_SCRUB_FS_STAT_ZMAX(ss, canceled, fs_stat);
	_SCRUB_FS_STAT_MIN(ss, finished, fs_stat);
}


static void init_fs_stat(struct scrub_fs_stat *fs_stat)
{
	memset(fs_stat, 0, sizeof(*fs_stat));
	fs_stat->s.finished = 1;
}


static void _print_scrub_ss(struct scrub_stats *ss)
{
	char t[4096];
	struct tm tm;
	time_t seconds;
	unsigned hours;


	if (!ss || !ss->t_start) {
		printf("\tno stats available\n");
		return;
	}

and here:

github.com

kdave/btrfs-progs/blob/b16b0a766f06138de2ee32d4d84b7110e469ff49/cmds/scrub.c#L115


	struct scrub_progress *shared_progress;
	pthread_mutex_t *write_mutex;
};


struct scrub_fs_stat {
	struct btrfs_scrub_progress p;
	struct scrub_stats s;
	int i;
};


static void print_scrub_full(struct btrfs_scrub_progress *sp)
{
	printf("\tdata_extents_scrubbed: %lld\n", sp->data_extents_scrubbed);
	printf("\ttree_extents_scrubbed: %lld\n", sp->tree_extents_scrubbed);
	printf("\tdata_bytes_scrubbed: %lld\n", sp->data_bytes_scrubbed);
	printf("\ttree_bytes_scrubbed: %lld\n", sp->tree_bytes_scrubbed);
	printf("\tread_errors: %lld\n", sp->read_errors);
	printf("\tcsum_errors: %lld\n", sp->csum_errors);
	printf("\tverify_errors: %lld\n", sp->verify_errors);
	printf("\tno_csum: %lld\n", sp->no_csum);
	printf("\tcsum_discards: %lld\n", sp->csum_discards);

but that might of course be overkill I just went down a rabbithole …

dunkelfalke · January 10, 2022, 2:42pm

Now this is getting really weird.
Finally found the time to sit down and try to pinpoint the cause, saved the output to json, loaded it into visual studio for easier debugging and ran the scrub_status function with the saved output data. No errors.
I thought maybe that is due to playing around with the code, restored the original btrfs.py, used the Rockstor UI to go to the pools page, and suddenly everything works just fine.
Started a new scrub and it still works.
Tested all the original offsets in the code for versions >= 5.1.2 and they seem to be fine as they are.

Well, I guess this can be closed then since I cannot reproduce it anymore. Maybe it was a glitch somewhere.

phillxnet · January 10, 2022, 8:00pm

@dunkelfalke Hello again.

Yes that is weird. But it’s still highlighted that we don’t have test coverage of the most recent additions to that rather critical function. So I’ll leave the associated issue open where we can also add test data from the latest kernel.

Thanks again for the feedback here. We definitely have some robustness work to be done in this area.

KarstenV · April 30, 2022, 5:46am

The problem is still there to some degree.

I updated my Rockstor hardware yesterday, and installed 4.1.0.0 and updated to the newest available kernel (I like to have the newest btrfs code running) and got this:

Other than that the installation of Rockstor was really smooth, and it seems to work really well.

Brief description of the problem

When I click on my Pool, I get an error message. Ran a scrub yesterday and before that it worked.

Detailed step by step instructions to reproduce the problem

I dont know ecactly.
I started a scrub yesterday, and went to see its progress, but got this error messsage.
According to btrfs scrub status via Putty, the scrub is finished and with no errors.

Web-UI screenshot

Error Traceback provided on the Web-UI


        Traceback (most recent call last):

File “/opt/rockstor/src/rockstor/rest_framework_custom/generic_view.py”, line 41, in _handle_exception yield File “/opt/rockstor/src/rockstor/storageadmin/views/pool_scrub.py”, line 46, in get_queryset self._scrub_status(pool) File “/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/utils/decorators.py”, line 145, in inner return func(*args, **kwargs) File “/opt/rockstor/src/rockstor/storageadmin/views/pool_scrub.py”, line 65, in _scrub_status PoolScrub.objects.filter(id=ps.id).update(**cur_status) File “/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/db/models/query.py”, line 563, in update rows = query.get_compiler(self.db).execute_sql(CURSOR) File “/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/db/models/sql/compiler.py”, line 1062, in execute_sql cursor = super(SQLUpdateCompiler, self).execute_sql(result_type) File “/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/db/models/sql/compiler.py”, line 840, in execute_sql cursor.execute(sql, params) File “/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/db/backends/utils.py”, line 64, in execute return self.cursor.execute(sql, params) File “/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/db/utils.py”, line 98, in exit six.reraise(dj_exc_type, dj_exc_value, traceback) File “/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/db/backends/utils.py”, line 64, in execute return self.cursor.execute(sql, params) DataError: integer out of range

KarstenV · April 30, 2022, 6:23am

Reading another thread, where initiating a scrub from the command line, seemed to fix it, I did the same.

And now the UI works.

I’ll keep an eye on it, and see if that changes again.

KarstenV · April 30, 2022, 2:34pm

Oh, well after the scrub finished, the error returned.

I guess it will be fixed at some point, its not critical functionality for me, and I can monitor from the commandline, the NAS itself works well.

phillxnet · April 30, 2022, 8:16pm

@KarstenV Hello again.
Re:

did you also add the filesystem repo as described in our howto on running a backported kernel:
“Installing the Stable Kernel Backport”
https://rockstor.com/docs/howtos/stable_kernel_backport.html

as you then also get newer userland btrfs utils which is important.

OK, that’s good to know. We may well have a corner case going on here with pending scrubs.

This may well be down to the output from the newer kernel being a little different. This has happened before. Are you using cutting edge kernel or the suggested in that howto backported stable.

Thanks for the report by the way. Once we have a reproducer for this we can create an issue and get to the fix. Does it happen only after you install a backported or newer kernel for instance.

Cheers. And I’m glad your now on v4.1.0 and that it’s mostly going well. Far newer software across the board now and we get to finally use a standard upstream distro kernel (by default anyway ).

KarstenV · May 1, 2022, 6:17am

I am on the cuting edge kernel. I think BTRFS reports 5.16.2, so only one step behind the newly released 5.17. I did follow the steps in the guide you linked, to add the repos.

I had to go to 4.1.0, as my new HW wouldnt boot the old version
But had planned to anyhow for a while.
I’m in the middle og moving from an apartment to a house, so I dont have much time for these projects.

Apart from this bug, the experience is good, and the new SW and HW seems solid.

dunkelfalke · August 31, 2022, 7:27am

I think we may have been looking at the wrong function.
The data scrub_status() returns looks fine, but my array might be large enough to cause an integer overflow somewhere (maybe a value set to int32 where it should be int64 or something)

_scrub_status(self, pool) crashes at this line:
PoolScrub.objects.filter(id=ps.id).update(**cur_status)

so I tried to set some values to 0 just in case and one of these three lines removed the error:
cur_status[“csum_discards”] = 0
cur_status[“data_extents_scrubbed”] = 0
cur_status[“last_physical”] = 0
Unfortunately since the issue is intermittent and goes away for a while if the filter runs once correctly, I cannot pinpoint it any further yet.
Anyway, the whole cur_status object looks like this:

{'status': 'finished', 'csum_discards': 4158515046, 'super_errors': 0, 'data_extents_scrubbed': 260229903, 'last_physical': 2840081727488, 'tree_bytes_scrubbed': 17397153792, 'no_csum': 0, 'read_errors': 0, 'verify_errors': 0, 'uncorrectable_errors': 0, 'rate': '541.20MiB/s', 'malloc_errors': 0, 'unverified_errors': 0, 'tree_extents_scrubbed': 1061838, 'duration': 30046, 'kb_scrubbed': 16634060184, 'csum_errors': 0, 'corrected_errors': 0}

dunkelfalke · August 31, 2022, 3:31pm

mwahahahahahaha I found it!

storageadmin/models/scrub.py
csum_discards = models.IntegerField(default=0)

An integer. Values from -2147483648 to 2147483647 are safe in all databases supported by Django.

My value was 4158515046, almost twice as large.
My suggestion would be to replace
csum_discards = models.IntegerField(default=0)
with
csum_discards = models.BigIntegerField(default=0)
but I have zero experience with django, so I’ll leave it to the Rockstor devs

phillxnet · August 31, 2022, 5:19pm

@dunkelfalke Nice find: well done.

This looks very much like a likely candidate indeed. Changing models does involve the creation (auto usually) of a migration file, we cover this here:
https://rockstor.com/docs/contribute/contribute.html#database-migrations
So yes here is a slight complexity with shipping this change and an additional burden in testing the resulting migration against at least the last stable version that proceeds it and the most recent git tagged version also. But that is the only additional change required bar the model definition change such as you’ve indicated.

Do you fancy submitting such a pr to address this, given your tenacity in sticking to this one and likely finding the rather deep root cause. If so do take into account our pending contributor doc changes detailed in our doc repo issue here:

github.com/rockstor/rockstor-doc

From 4.1.0-0 stable release, the new testing branch is default for contributions.

opened 02:27PM - 20 Jan 22 UTC

closed 04:47PM - 11 Sep 22 UTC

phillxnet

Our current developer on-boarding docs mention only the master branch. As from o…ur recent 4.1.0-0 stable git tag rpm release we are now to develop our testing channel rpms from a newly established testing branch within git. The master branch is now reserved for our stable releases and so our docs should reflect this new split. The new split to branch based development is to enable the massive code wide changes that are required for our pending Language and tools wide upgrade. E.g.: https://github.com/rockstor/rockstor-core/issues/1877 and it's indicated, in-development, sub-part dependency of: https://github.com/rockstor/rockstor-core/issues/2254 This 'movement' within the project should also be made clear on the developers page so that folks can choose where they think their contributions are to be submitted. But in almost all cases all code submissions will now go to our testing branch first. Cherry picking may then take place but the projects main focus is now to lift our dependencies across the board and move testing towards the next stable but via newer Python and django and associated libraries. The following issue in an example of this efforts beginning by moving what was one of our oldest dependencies ztaskd to be replaced by one one of our newest: https://github.com/rockstor/rockstor-core/pull/2283 However such 'up-lift' is nigh on impossible until we approach first our Django up-lift and then our Python up-lift. Which in turn will allow for a further Django upgrade and we can then approach the remaining dependencies piecemeal; including our now in need of some attention build system. The above details should at least be indicated to new contributors to provide context. But in short all new contributions hit testing first and only long time contributors should be contributing directly to master give it's essentially a stop-gap to maintain a usable build as we undertake much needed maintenance work. It should also be noted that testing is again soon to be non-functional as there is soon to be the contribution of ongoing Django updates that are extensive and will have breaking changes that are to be addressed in an ongoing manner in the new testing channel.

Incidentally our Django version is currently way older in stable (master branch has 1.8.16) but has begun it’s new growth towards a string of updates to come (once we move to Python 3) in our current testing channel (1.11.29).

I’ve been working a little in the docs of late so I may get to those changes soon hopefully.

If not, no worries and thanks a bunch for hunting this one down. Much appreciated.

Feel free to at least create an issue for your observed failure scenario, with the reproducer scrub output and proposed fix. That way we have easier attribution in the event you don’t fancy doing the actual pull request.

@Hooverdan @Flox & @KarstenV what do you think of @dunkelfalke proposed fix here? From a quick look over what’s happened here to date it does seem to fit the evidence.

And reproduced by @KarstenV

And would be a relatively simple, if db migration ladened, important fix. The db migrations always worry me as it goes.

Hope that helps.

@Flox Regarding the core nature of his failure we may want to consider this for a stable (master branch) cherry pick also. It’s just those pesky db migrations that raise the risk level of such an update.
It would also be grand to have our tests extended to exercise this failure to both prove the fix and avoid future regressions.