When you search in Google (i’m almost sure that Altavista did the same thing) it says ‘Results 1-10 of about xxxx’…
This has always amazed me… What does it mean ‘about’?
How can they count roughly?
I do understand why they can’t come up with a precise figure in a reasonable time, but how do they even reach this ‘approximate’ one?
I’m sure there’s a lot of theory behind this one that I missed…
Most likely it’s similar to the sort of estimated row counts used by most SQL systems in their query planning; a number of rows in the table (known exactly as of the last time statistics were collected, but generally not up-to-date), multiplied by an estimated selectivity (usually based on a sort of statistical distribution model calculated by sampling some small subset of rows).
The PostgreSQL manual has a section on statistics used by the planner that is fairly informative, at least if you follow the links out to pg_stats and various other sections. I’m sure that doesn’t really describe what google does, but it at least shows one model where you could get the first N rows and an estimate of how many more there might be.