I need a library which would help me to save and query data in a condensed format (a mini DSL in essence) here’s a sample of what I want:
Update 1 – Please note, figures in the samples above are made small just to make is easier to follow the logic, the real figures are limited with c# long type capacity, ex:
1,18,28,29,39,18456789,18456790,18456792,184567896.
Sample Raw Data set: 1,2,3,8,11,12,13,14
Condensed Sample Data set:
1..3,8,11..14
What would be absolute nice to have is to be able to present 1,2,4,5,6,7,8,9,10 as 1..10-3.
Querying Sample Data set:
Query 1 (get range):
1..5 -> 1..3
Query 2 (check if the value exists)
?2 -> true
Query 3 (get multiple ranges and scalar values):
1..5,11..12,14 -> 1..3,11..12,14
I don’t want to develop it from scratch and would highly prefer to use something which already exists.
Here are some ideas I’ve had over the days since I read your question. I can’t be sure any of them really apply to your use case but I hope you’ll find something useful here.
Storing your data compressed
Steps you can take to reduce the amount of space your numbers take up on disk:
long, use auint. (4 bytes per number.)uint. Store your numbers 7 bits to a byte, with the remaining bit used to say “there are more bytes in this number”. (Then 1-127 will fit in 1 byte, 128-~16k in 2 bytes, ~16k-~2M in 3 bytes, ~2M-~270M in 4 bytes.)This should reduce your storage from 8 bytes per number (if you were originally storing them as
longs) to, say, on average 3 bytes. Also, if you end up needing bigger numbers, the variable-byte storage will be able to hold them.Then I can think of a couple of ways to reduce it further, given you know the numbers are always increasing and may contain lots of runs. Which works best for you only you can know by trying it on your actual data.
2,3,4,5,6=>2,4). You’ll have to store lone numbers as e.g.8,0so will increase storage for those, but if your data has lots of runs (especially long ones) this should reduce storage on average. You could further store “single gaps” in runs as e.g.1,2,3,5,6,7=>1,6,4(unambiguous as4is too small to be the start of the next run) but this will make processing more complex and won’t save much space so I wouldn’t bother.3,4,5,7,8,9=>3,1,1,2,1,1. This will reduce the number of bytes used for storing larger numbers (e.g.15000,15005(4 bytes) =>15000,5(3 bytes)). Further, if the data contains a lot of runs (e.g. lots of1bytes), it will then compress (e.g. zip) nicely.Handling in code
I’d simply advise you to write a couple of methods that stream a file from disk into an
IEnumerable<uint>(orulongif you end up with bigger numbers), and do the reverse, while handling whatever you’ve implemented from the above.If you do this in a lazy fashion – using
yield returnto return the numbers as you read them from disk and calculate them, and streaming numbers to disk rather than holding them in memory and returning them at once, you can keep your memory usage down whatever the size of the stored data.(I think, but I’m not sure, that even the GZipStream and other compression streams will let you stream your data without having it all in memory.)
Querying
If you’re comparing two of your big data sets, I wouldn’t advise using LINQ’s
Intersectmethod as it requires reading one of the sources completely into memory. However, as you know both sequences are increasing, you can write a similar method that needs only hold an enumerator for each sequence.If you’re querying one of your data sets against a user-input, small list of numbers, you can happily use LINQ’s
Intersectmethod as it is currently implemented, as it only needs the second sequence to be entirely in memory.