Deduplication

Hey, 


Really liking this product. Is there any plans to implement deduplication? I was looking at http://www.opendedup.org/ but I don’t like how you use it… Not very user friendly. 

But their dedupe technology is really good. Any chance of getting something like that implemented in this product?

I am glad you like Rockstor and thanks for your request and help in making it better.

I’ve submitted an issue for this: https://github.com/rockstor/rockstor-core/issues/547

You can track the progress there. Btw, feel free to open issues like this that are important to you here, or even better, on github directly.

I am not really sure if/when/how we can support this. We do understand that dedup is very important to a lot of users. So the above issue is filed so we don’t forget this request. Specifics will emerge over time as and when we get to it. Thank you!

It would be good if this could use built-in BTRFS dedup, rather than running on top of another layer. I have been doing some tests using duperemove ( https://github.com/markfasheh/duperemove ) but it doesn’t seem like it’s production ready yet

You are right sprint. opendedup as suggested by Robsi is not even possible on top of BTRFS as far as I can tell, but I plan to spend some time playing with it anyway in future and decide if there is a fit.

BTRFS dedup is happening, just a matter of time.

Thanks for duperemove link. Please do share specifics of your test results if you think they are useful for the community.

AFAIK opendedup just presents a block device that you should theoretically be able to make a BTRFS filesystem on top of.

I tested duperemove on a filesystem containing ~700GB of data in 500,000 files. I had to increase the block size to 512kb to get hashing to complete. Hashing took about 8hrs. Deduping ran for about 12hrs before hanging about 3/4 of the way down the list. In the end about 40GB was saved.
It did not play nicely with snapshots. I ended up deleting the snapshots because it was impractical to dedupe them as well, otherwise there would have been no saving at all.

I expect that once duperemove gets an incremental mode things will go a lot smoother.

Inband dedupe would be really nice. Nearly impossible to determine the state of this in BTRFS at the moment though. Wiki says it’s under development. Mailing list shows several iterations of patches.
Does anyone here know where BTRFS stands with Inband dedupe at the moment?

Wiki is pretty outdated. I think the support is there in the kernel for dedupe. IMO inband dedupe is generally disliked.  Perhaps the author of bedup or the friendly people on the mailing list can provide an accurate answer.

What are some usecases for which inband dedupe is nice?

Provisioning many virtual guests from the same storage device.  Base OS barely changes at all, so dedupe saves a lot of storage.  Doing this inband rather than scheduled batch ensures that the disk usage is more uniform (i.e. your storage use doesn’t grow through the week, then drop back when the dedupe job runs)

Mail servers are also a reasonable use-case, but I’ve never tried this.  It’s not unreasonable to see that a company where e-mails & attachments are forwarded around the place can create a lot of duplicate data on a mail server.  Inband dedupe works well here.
Development work can also create huge amounts of duplicate data.  I’ve worked places where each developer created a personal development environment with their own copy of the development database and their own virtual guests running AppServers & web front ends against that.  The biggest pinch point was the storage requirements of the database.  Sad thing is, each developer made a negligible amount of change to their own personal copy of the database - dedupe would have worked extremely well here.
Backups to disk are another use case.  It’s not unknown for systems to take a full backup of data to a slow disk storage platform.  Incremental backups help lower the space requirement, but there will usually be another full backup done every week or so.  If there is little change from week to week, there is an argument that it’s only necessary to store 1 copy of the data - hence inband dedupe.
I guess it’s easier to say that any circumstance where there are large amounts of duplicate data can be seen as a use-case for inband dedupe.

I’ve heard loads of comments about not needing inband dedupe, but have generally put them down to people not seeing a need for it in their own circumstances.  Which is perfectly OK.  You don’t see a need for it, you don’t turn it on.  But if I do have a circumstance that would benefit from inband dedupe, it should also be OK for me to run it.

What are the reasons for not running inband dedupe?

Thanks for articulating some important usecases. It’s very educational. It seems like at least some of these usecases would be ok with the cpu, memory and perhaps some fs performance hit that comes with inband dedupe. To answer you question, I think the main reason for not preferring inband dedupe is performance implication.

I have some experience using bedup and would like to add offline dedupe support into Rockstor(perhaps using duperemove) soon. This seems like the straight forward thing we can do to help a lot of users. We already have an issue open for that.

It’s not clear what the status of inband dedupe support for btrfs is right now. Currently we don’t have the bandwidth to research that, but hope to some day.

So I am going to resurrect this one here for a few reasons. I want OOB dedupe.
Are there any implications to just installing at command line level outside of Rockstor?

Anyone here running bedup / duperemove on Rockstor? Is it advised against?

For the philisophical piece of this:
I am going to agree that in band dedupe is out of favor for cost. It does have it’s use cases, but those use cases are usually specific to the needs of products that do not directly compete with Rockstor.
If you’re a large enterprise then you have a budget and SLA targeting SPOC Tier 1 storage providers. You are going with a big name like EMC/netapp/etc. with a 4 hour part turnaround 30 min callback 24x7 contract on supported verified hardware. Most likely they are running iSCSI and COW isn’t good for VM targets.

ZFS has in band deduplication that works quite well if you can afford all the extra resources. See Freenas for a product that already has this. Ram lost to Dedupe is better used for cache and money better spent on spindles vs non cache ram use.

In band has a VERY niche use case in the software solution space, but OOB is a free lunch for everyone who has periods of inactivity. Rockstor isn’t really the product for 24x7 high SLA mission critical business applications. It’s for SMB’s, home power users, and maybe some backup targets. These are people with available disk time and less data churn.

Agreed a 100%.
Now that deduplication doesn’t change mtime anymore (since Kernel 4.2) there basically is no downside to OOB deduplication anymore.
I use it on a small ARM Server (Archlinux) for quite a while now (duperemove) and had very good results. Unfortunately duperemove lacks a few features that would certainly help the long-term requirements of deduplication (incremental indexation for instance). Still I think it would be a cool feature to implement. I haven’t come round to using it on RockStor because it’s now in the repos and I didn’t have the time yet to look for alternative repos or installing it manually…

2 Likes

Good to hear you’ve bee using duperemove. I’ve been using it on and off and would love add the feature soon, we do have an open issue for it.For now though, duperemove IS available in the Testing channel. Just run yum install duperemove to install it.

1 Like

Suman,
I’m curious how duperemove has been working for you?
Are there any implications that could get in the way of the existing Rockstor UI functionality?
Been quite a while since we talked about this.
Is bedup still the flavor of choice for when it hits?