I face some problem with my query in mongodb(pymogno driver).
Here is my problem:
I have to insert(update) about 100 million(100000000) documents to mongodb per day.
I gave up on using update the same key field I have to update append, and revised to use bulk insert (update performance is slower than bulk insert).
Here is sketch scheme in my db.
{_id:xxx, F1:1 , F2:"test1", TS": 2011/01}
{_id:xxx, F1:1 , F2:"test2", TS": 2011/02}
{_id:xxx, F1:2 , F2:"test1", TS": 2011/03}
{_id:xxx, F1:3 , F2:"test1", TS": 2011/04}
{_id:xxx, F1:2 , F2:"test1", TS": 2011/05}
.....
(4 billion up or more)
When I query, I just want to retrieve the latest TS group by F1(field1).
I know that “group” aggregation framework can do that, but I have sharding my db and group operation not allow in sharding db.
I also tried to use map-reduce to do that, but it is not providing good enough query performance.
The only query I am using is “$in” operation.
db.test.find({"F1":{"$in":[1,2,3,....]}})
It retrieves all docs in the target array, but i only want to get the latest document per key F1.
{_id:xxx, F1:1 , F2:"test2", TS": 2011/02}
{_id:xxx, F1:2 , F2:"test1", TS": 2011/05}
{_id:xxx, F1:3 , F2:"test2", TS": 2011/03}
How can I get that?
ps.
The target array might contain a million elements that I want to bulk query.
Is there is good way to do that?
While there’s no single step solution to this problem as you can’t use the aggregation framework in a shard as you mentioned (and it likely wouldn’t perform well even if it did), you might want to explore a solution like: