Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 943995
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 15, 20262026-05-15T22:31:18+00:00 2026-05-15T22:31:18+00:00

The database schema is CREATE TABLE sites ( site_id INTEGER PRIMARY KEY AUTOINCREMENT, netloc

  • 0

The database schema is

CREATE TABLE sites
(
    site_id           INTEGER PRIMARY KEY AUTOINCREMENT,
    netloc            TEXT UNIQUE NOT NULL,
    last_visited      REAL DEFAULT 0,
    crawl_rate        REAL DEFAULT 2,
    crawl_frequency   REAL DEFAULT 604800,
    robots_txt        TEXT DEFAULT 0,
    robots_last_fetch REAL DEFAULT 0,
    allow_all         NUMERIC DEFAULT 0,
    disallow_all      NUMERIC DEFAULT 0,
    active            NUMERIC DEFAULT 0                           
 )

CREATE TABLE urls
(
     url_id       INTEGER PRIMARY KEY AUTOINCREMENT,
     site_id      INTEGER REFERENCES sites (id) NOT NULL,
     scheme       TEXT NOT NULL,
     path         TEXT NOT NULL,
     last_visited REAL DEFAULT 0,
     UNIQUE( site_id, scheme, path)                                   
 )

As you can probably, guess, this is for a web crawler.

I want to get N of the sites that have crawlable urls associated with them and all of the
aforementioned urls. A url is crawlable if url.last_visited + site.crawl_frequency < current_time where current_time comes from pythons time.time() function. What I’m looking for will probably begin with something like:

SELECT s.*, u.* FROM sites s, urls u ON s.site_id = u.site_id ...

Beyond that all I can think is that GROUP BY might have some role to play.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-15T22:31:19+00:00Added an answer on May 15, 2026 at 10:31 pm

    Here is a graceless query. There’s probably a more clever way to do this.

    SELECT s.*, u.* 
    FROM sites s, urls u ON s.site_id = u.site_id
    WHERE s.site_id IN 
        (SELECT DISTINCT site_id
         FROM urls uu INNER JOIN sites ss ON uu.site_id = ss.site_id
         WHERE uu.last_visited + ss.crawl_frequency < current_time 
         ORDER BY ss.site_id
         LIMIT n);
    

    The subquery is supposed to return up to n distinct site_ids with least one crawlable URL. The ORDER BY attribute needn’t be site_id. Actually ORDER BY isn’t necessary at all. I just threw it in there because consistency is nice when playing with a new query.

    The enclosing query returns all urls associated with n distinct sites, where each site has at least one crawlable url. Note that not all urls returned are necessarily crawlable; the only guarantee is that at least one url per site is crawlable. A returned site could have non-crawlable urls, too.

    If only crawlable urls should be returned, the timing condition can be copied in the enclosing query. I couldn’t tell which behavior was required from the question.

    P.S. I’m indulging in pedantry now, but the way crawl_frequency is used makes me think it could be called crawl_period or crawl_delay instead

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Consider the following database schema: create table UserGroup ( id int not null auto_increment,
We have a database with a very simple schema: CREATE TABLE IF NOT EXISTS
I have the next database schema: CREATE TABLE A (id INTEGER); CREATE TABLE Single_Values
In SQLite, given this database schema CREATE TABLE observations ( src TEXT, dest TEXT,
Here is the most relevant part of my database schema: create table TEST (
I have database schema for users. It looks like... CREATE TABLE `users` ( `id`
I have a table with the following schema: CREATE TABLE table ( msg_id TEXT,
In my SQL Server database schema I have a data table with a date
I have a database schema where the convention for a foreign key's name is:
I have a vessel tracking database schema with the following table for the schdual

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.