Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8985429
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 15, 20262026-06-15T21:19:41+00:00 2026-06-15T21:19:41+00:00

I am trying to model a database that needs a very high write throughput,

  • 0

I am trying to model a database that needs a very high write throughput, and reasonable read throughput. I have a distributed set of systems that are adding “event” data into the database.

Currently, the id for the event record is a Guid. I have been reading that guids don’t tend to create great indexes because their random distribution means that recent data will be scattered in the disk, which can lead to paging problems.

So here is the first assumption I would like to validate:
I am assuming that I wan’t to choose an _id that creates a right balanced tree, such as something like an autonumber. This would be beneficial because the 2 most recent events would essentially be right next to each other on disk. Is this a correct assumption?

Assuming that (1) is correct, then I am trying to work out the best way to generate such an id. I know Mongo natively supports ObjectId, which is convenient for applications that are ok tying their data to Mongo, but my application isn’t such. Since there are multiple systems producing data, simulating an “auto-number” field is a little problematic because mongo doesn’t support auto-number at the server side, so the producer would have to assign the id, which is hard if they don’t know what the other systems are doing.

In order to solve for this, what I am considering doing is making the _id field a compound key on { localId, producerId } where local id is an autonumber that the producer can generate because producerId will make it unique. ProducerId is something that I can negotiate among producers so that they can come up with unique ids.

So here is my next question:
If my goal is to get the most recent data from all producers, then { localId, producerId } should be the preferred key ordering since localId will be right-ist and producerId will be a small cluster, and I would prefer that the 2 most recent events stay local to each other. If I inverted that order, then my reasoning for how the tree would eventually look would be something like the following:

               root
        /        |           \
       p0        p1          p2
       /         |            \
     e0..n      e0..n        e0..n

where p# is the producer Id, and e# is an event. This seems like it would fragment my index into p# clusters of data, and new events wouldn’t necessarily be next to each other. My assumption for the preferred ordering should (please verify) look something like this instead:

               root
      /          |          \
     e0          e1         e2
     /            |           \
  p0..n         p0..n        p0..n

which would seem to keep recent events near each other. ( I know that Mongo uses B-trees for indexes, but I am just trying to simplify the visual here ).

The only caveat to { localId, producerId } that I can see is that a common query by the user would be to list the most recent events by producer, which { producerId, localId } would actually handle much better. In order to get this query to work with { localId, producerId }, I am thinking that I will also need to add the producerId as a field to the document, and index that.

To be explicit about what my question here really is, I want to know if I am thinking about this problem correctly, or if there is an obviously better way to approach this.

Thanks

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-15T21:19:42+00:00Added an answer on June 15, 2026 at 9:19 pm

    To answer your question: a compound like this: {a,b} will end in scatter queries if you just query by b and then sort by a. but it will use the index for sorting.

    If you use a Document instead of ObjectId, _id will be indexed but not used but it is not a compound index!

    Example:

    Given this Documents in Collection ‘a’ and no additional index:

    { "_id" : { "e" : 1, "p" : 1 } }
    { "_id" : { "e" : 1, "p" : 2 } }
    { "_id" : { "e" : 2, "p" : 1 } }
    { "_id" : { "e" : 1, "p" : 3 } }
    { "_id" : { "e" : 2, "p" : 3 } }
    { "_id" : { "e" : 2, "p" : 2 } }
    { "_id" : { "e" : 3, "p" : 1 } }
    { "_id" : { "e" : 3, "p" : 2 } }
    { "_id" : { "e" : 3, "p" : 3 } }
    

    a query like this:

    db.a.find({'_id.p' : 2}).sort({'_id.e' : 1}).explain()
    

    will NOT use an index:

    {
        "cursor" : "BasicCursor",
        "nscanned" : 9,
        "nscannedObjects" : 9,
        "n" : 3,
        "scanAndOrder" : true,
        "millis" : 0,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "isMultiKey" : false,
        "indexOnly" : false,
        "indexBounds" : {   
        }
    }
    

    Just because the Documents are indexed.

    If you create an index like this:

    db.a.ensureIndex({'_id.e' : 1, '_id.p' : 1})
    

    and then query again:

    db.a.find({'_id.p' : 2}).sort({'_id.e' : 1}).explain()
    
    {
        "cursor" : "BtreeCursor _id.e_1__id.p_1",
        "nscanned" : 9,
        "nscannedObjects" : 3,
        "n" : 3,
        "millis" : 0,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "isMultiKey" : false,
        "indexOnly" : false,
        "indexBounds" : {
            "_id.e" : [
                [
                    {
                        "$minElement" : 1
                    },
                    {
                        "$maxElement" : 1
                    }
                ]
            ],
            "_id.p" : [
                [
                    2,
                    2
                ]
            ]
        }
    }
    

    it will query on the index (nscanned: 9) because of the sort and then fetches the objects : 3, which is better than sorting by _id (nscanned and nscannedObjects would be 9).

    Documentation .explain()

    So for high write throughput (over 15k writes a sec) you would probably shard. Both Indexes would guarantee uniqueness if option isset. But only a compound shard key will help you for direct queries and no scatter gather.

    Using ({‘_id.e’ : 1, ‘_id.p’ : 1}) as a shard key will route all “_id.e” queries directly but not “_id.p” (without ‘e’) queries, so these queries will send to every host and end in index lookups there but could be fast aswell (depends ond network etc). If you want to cluster these queries by “p” you have to put ‘_id.p’ as the first part of the compound key like so:

    {'_id.p' : 1, '_id.e' : 1}
    

    So all “p” queries are direct queries. But yes, this would scatter recent events across the cluster. So a separate index using the time based key might speed up those scatter queries.

    I would generate me some sample data and would play around with it in a setup with two shards on a dev system and use .explain() for choosing the shard key + indexes.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am trying to model a simple domain by using DDD. Database layer is
I'm trying to model a class that has a method with a variable argument
I'm trying to model bind a set of dynamically generated checkboxes so as to
I am trying to model a network using C++. I have a struct called
I have a database that consists of 5 tables : Course, Category, Location, CourseCategories,
I have a set of subclassed domain objects that I fetch with Linq and
I'm trying to come up with a relational model and database implementation, but keep
I am trying to set up a synchronization model to sync my consolidated Oracle
I have created a repository that is returning data from my database using Entity
I have a test server that uses data from a test database. When I'm

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.