Hey,
I am glad you like Rockstor and thanks for your request and help in making it better.
I’ve submitted an issue for this: https://github.com/rockstor/rockstor-core/issues/547
You can track the progress there. Btw, feel free to open issues like this that are important to you here, or even better, on github directly.
I am not really sure if/when/how we can support this. We do understand that dedup is very important to a lot of users. So the above issue is filed so we don’t forget this request. Specifics will emerge over time as and when we get to it. Thank you!
It would be good if this could use built-in BTRFS dedup, rather than running on top of another layer. I have been doing some tests using duperemove ( https://github.com/markfasheh/duperemove ) but it doesn’t seem like it’s production ready yet
You are right sprint. opendedup as suggested by Robsi is not even possible on top of BTRFS as far as I can tell, but I plan to spend some time playing with it anyway in future and decide if there is a fit.
BTRFS dedup is happening, just a matter of time.
Thanks for duperemove link. Please do share specifics of your test results if you think they are useful for the community.
AFAIK opendedup just presents a block device that you should theoretically be able to make a BTRFS filesystem on top of.
I tested duperemove on a filesystem containing ~700GB of data in 500,000 files. I had to increase the block size to 512kb to get hashing to complete. Hashing took about 8hrs. Deduping ran for about 12hrs before hanging about 3/4 of the way down the list. In the end about 40GB was saved.
It did not play nicely with snapshots. I ended up deleting the snapshots because it was impractical to dedupe them as well, otherwise there would have been no saving at all.
I expect that once duperemove gets an incremental mode things will go a lot smoother.
Inband dedupe would be really nice. Nearly impossible to determine the state of this in BTRFS at the moment though. Wiki says it’s under development. Mailing list shows several iterations of patches.
Does anyone here know where BTRFS stands with Inband dedupe at the moment?
Wiki is pretty outdated. I think the support is there in the kernel for dedupe. IMO inband dedupe is generally disliked. Perhaps the author of bedup or the friendly people on the mailing list can provide an accurate answer.
What are some usecases for which inband dedupe is nice?
Provisioning many virtual guests from the same storage device. Base OS barely changes at all, so dedupe saves a lot of storage. Doing this inband rather than scheduled batch ensures that the disk usage is more uniform (i.e. your storage use doesn’t grow through the week, then drop back when the dedupe job runs)
Thanks for articulating some important usecases. It’s very educational. It seems like at least some of these usecases would be ok with the cpu, memory and perhaps some fs performance hit that comes with inband dedupe. To answer you question, I think the main reason for not preferring inband dedupe is performance implication.
I have some experience using bedup and would like to add offline dedupe support into Rockstor(perhaps using duperemove) soon. This seems like the straight forward thing we can do to help a lot of users. We already have an issue open for that.
It’s not clear what the status of inband dedupe support for btrfs is right now. Currently we don’t have the bandwidth to research that, but hope to some day.
So I am going to resurrect this one here for a few reasons. I want OOB dedupe.
Are there any implications to just installing at command line level outside of Rockstor?
Anyone here running bedup / duperemove on Rockstor? Is it advised against?
For the philisophical piece of this:
I am going to agree that in band dedupe is out of favor for cost. It does have it’s use cases, but those use cases are usually specific to the needs of products that do not directly compete with Rockstor.
If you’re a large enterprise then you have a budget and SLA targeting SPOC Tier 1 storage providers. You are going with a big name like EMC/netapp/etc. with a 4 hour part turnaround 30 min callback 24x7 contract on supported verified hardware. Most likely they are running iSCSI and COW isn’t good for VM targets.
ZFS has in band deduplication that works quite well if you can afford all the extra resources. See Freenas for a product that already has this. Ram lost to Dedupe is better used for cache and money better spent on spindles vs non cache ram use.
In band has a VERY niche use case in the software solution space, but OOB is a free lunch for everyone who has periods of inactivity. Rockstor isn’t really the product for 24x7 high SLA mission critical business applications. It’s for SMB’s, home power users, and maybe some backup targets. These are people with available disk time and less data churn.
Agreed a 100%.
Now that deduplication doesn’t change mtime anymore (since Kernel 4.2) there basically is no downside to OOB deduplication anymore.
I use it on a small ARM Server (Archlinux) for quite a while now (duperemove) and had very good results. Unfortunately duperemove lacks a few features that would certainly help the long-term requirements of deduplication (incremental indexation for instance). Still I think it would be a cool feature to implement. I haven’t come round to using it on RockStor because it’s now in the repos and I didn’t have the time yet to look for alternative repos or installing it manually…
Good to hear you’ve bee using duperemove. I’ve been using it on and off and would love add the feature soon, we do have an open issue for it.For now though, duperemove IS available in the Testing channel. Just run yum install duperemove
to install it.
Suman,
I’m curious how duperemove has been working for you?
Are there any implications that could get in the way of the existing Rockstor UI functionality?
Been quite a while since we talked about this.
Is bedup still the flavor of choice for when it hits?