I am relatively new to cassandra and its data model. I have a large set of data that are described by locations on chromosomes (chromosome:start-end) where we have 24 chromosomes and start and end are integers. The query I would like to support is to find all locations in the genome that overlap with a set of other locations. I can create a simple R-tree-based “indexing” scheme if there are not other ideas, but I thought someone might have run into this problem and come up with a solution.
Share
As you need to query on 2 dimensions, either you could use other db like mongodb that support these kind of geospacial indexing/queries see Bounds Queries
In Cassandra, I think the best you could do is use geocell (doc) or other Space filling curves
you will convert start and end to a geohash, for each of your data, then you will be able to search for the bounding box, with start in [s1,s2] and end in [e1,e2], by searching geocells between geohash(s1, e1) and geohash(s2, e2) that gives contiguous locations in the bouding box