The database schema is
CREATE TABLE sites
(
site_id INTEGER PRIMARY KEY AUTOINCREMENT,
netloc TEXT UNIQUE NOT NULL,
last_visited REAL DEFAULT 0,
crawl_rate REAL DEFAULT 2,
crawl_frequency REAL DEFAULT 604800,
robots_txt TEXT DEFAULT 0,
robots_last_fetch REAL DEFAULT 0,
allow_all NUMERIC DEFAULT 0,
disallow_all NUMERIC DEFAULT 0,
active NUMERIC DEFAULT 0
)
CREATE TABLE urls
(
url_id INTEGER PRIMARY KEY AUTOINCREMENT,
site_id INTEGER REFERENCES sites (id) NOT NULL,
scheme TEXT NOT NULL,
path TEXT NOT NULL,
last_visited REAL DEFAULT 0,
UNIQUE( site_id, scheme, path)
)
As you can probably, guess, this is for a web crawler.
I want to get N of the sites that have crawlable urls associated with them and all of the
aforementioned urls. A url is crawlable if url.last_visited + site.crawl_frequency < current_time where current_time comes from pythons time.time() function. What I’m looking for will probably begin with something like:
SELECT s.*, u.* FROM sites s, urls u ON s.site_id = u.site_id ...
Beyond that all I can think is that GROUP BY might have some role to play.
Here is a graceless query. There’s probably a more clever way to do this.
The subquery is supposed to return up to
ndistinctsite_ids with least one crawlable URL. TheORDER BYattribute needn’t besite_id. ActuallyORDER BYisn’t necessary at all. I just threw it in there because consistency is nice when playing with a new query.The enclosing query returns all
urls associated withndistinctsites, where eachsitehas at least one crawlableurl. Note that not allurls returned are necessarily crawlable; the only guarantee is that at least oneurlpersiteis crawlable. A returnedsitecould have non-crawlableurls, too.If only crawlable
urls should be returned, the timing condition can be copied in the enclosing query. I couldn’t tell which behavior was required from the question.P.S. I’m indulging in pedantry now, but the way
crawl_frequencyis used makes me think it could be calledcrawl_periodorcrawl_delayinstead