Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8792687
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 13, 20262026-06-13T23:00:53+00:00 2026-06-13T23:00:53+00:00

ABSTRACT Talking with some colleagues we came accross the extract random row from a

  • 0

ABSTRACT

Talking with some colleagues we came accross the “extract random row from a big database table” issue. It’s a classic one and we know the naive approach (also on SO) is usually something like:

SELECT * FROM mytable ORDER BY RAND() LIMIT 1

THE PROBLEM

We also know a query like that is utterly inefficient and actually usable only with very few rows. There are some approaches that could be taken to attain better efficiency, like these ones still on SO, but they won’t work with arbitrary primary keys and the randomness will be skewed as soon as you have holes in your numeric primary keys. An answer to the last cited question links to this article which has a good explanation and some bright solutions involving an additional “equal distribution” table that must be maintained whenever the “master data” table changes. But then again if you have frequent DELETEs on a big table you’ll probably be screwed up by the constant updating of the added table. Also note that many solutions rely on COUNT(*) which is ridiculously fast on MyISAM but “just fast” on InnoDB (I don’t know how it performs on other platforms but I suspect the InnoDB case could be representative of other transactional database systems).

In addition to that, even the best solutions I was able to find are fast but not Ludicrous Speed fast.

THE IDEA

A separate service could be responsible to generate, buffer and distribute random row ids or even entire random rows:

  • it could choose the best method to extract random row ids depending on how the original PKs are structured. An ordered list of keys could be maintained in ram by the service (shouldn’t take too many bytes per row in addition to the actual size of the PK, it’s probably ok up to 100~1000M rows with standard PCs and up to 1~10 billion rows with a beefy server)
  • once the keys are in memory you have an implicit “row number” for each key and no holes in it so it’s just a matter of choosing a random number and directly fetch the corresponding key
  • a buffer of random keys ready to be consumed could be maintained to quickly respond to spikes in the incoming requests
  • consumers of the service will connect and request N random rows from the buffer
  • rows are returned as simple keys or the service could maintain a (pool of) db connection(s) to fetch entire rows
  • if the buffer is empty the request could block or return EOF-like
  • if data is added to the master table the service must be signaled to add the same data to its copy too, flush the buffer of random picks and go on from that
  • if data is deleted from the master table the service must be signaled to remove that data too from both the “all keys” list and “random picks” buffer
  • if data is updated in the master table the service must be signaled to update corresponding rows in the key list and in the random picks

WHY WE THINK IT’S COOL

  • does not touch disks other than the initial load of keys at startup or when signaled to do so
  • works with any kind of primary key, numerical or not
  • if you know you’re going to update a large batch of data you can just signal it when you’re done (i.e. not at every single insert/update/delete on the original data), it’s basically like having a fine grained lock that only blocks requests for random rows
  • really fast on updates of any kind in the original data
  • offloads some work from the relational db to another, memory only process: helps scalability
  • responds really fast from its buffers without waiting for any querying, scanning, sorting
  • could easily be extended to similar use cases beyond the SQL one

WHY WE THINK IT COULD BE A STUPID IDEA

  • because we had the idea without help from any third party
  • because nobody (we heard of) has ever bothered to do something similar
  • because it adds complexity in the mix to keep it updated whenever original data changes

AND THE QUESTION IS…

Does anything similar already exists? If not, would it be feasible? If not, why?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-13T23:00:53+00:00Added an answer on June 13, 2026 at 11:00 pm

    The biggest risk with your “cache of eligible primary keys” concept is keeping the cache up to date, when the origin data is changing continually. It could be just as costly to keep the cache in sync as it is to run the random queries against the original data.

    How do you expect to signal the cache that a value has been added/deleted/updated? If you do it with triggers, keep in mind that a trigger can fire even if the transaction that spawned it is rolled back. This is a general problem with notifying external systems from triggers.

    If you notify the cache from the application after the change has been committed in the database, then you have to worry about other apps that make changes without being fitted with the signaling code. Or ad hoc queries. Or queries from apps or tools for which you can’t change the code.

    In general, the added complexity is probably not worth it. Most apps can tolerate some compromise and they don’t need an absolutely random selection all the time.

    For example, the inequality lookup may be acceptable for some needs, even with the known weakness that numbers following gaps are chosen more often.

    Or you could pre-select a small number of random values (e.g. 30) and cache them. Let app requests choose from these. Every 60 seconds or so, refresh the cache with another set of randomly chosen values.

    Or choose a random value evenly distributed between MIN(id) and MAX(id). Try a lookup by equality, not inequality. If the value corresponds to a gap in the primary key, just loop and try again with a different random value. You can terminate the loop if it’s not successful after a few tries. Then try another method instead. On average, the improved simplicity and speed of an equality lookup may make up for the occasional retries.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Can some one explain the below, rather complex recursive generic template usage? public abstract
Coming from a mostly C++ background I am now writing some Java in anger.
Abstract: reading images from file with toggled bits to make unusable for preview tools
I heard this a lot when talking about software engineering and abstract data types,
Abstract class: abstract class PersistentList<T> public static PersistentList<T> GetInstanceOfDerivedClass() { //??? } Derived class:
Abstract: You select the modules you are registered for. Each module has a number
Abstract: Can you propose a mathematical-ish algorithm over a plane of pixels that will
abstract class base { abstract public function test(); public function run() { self::test(); }
abstract class Base {} class A extends Base class B extends Base How do
Abstract What I require is a technique, given a single, but layered Flash animation,

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.