I’m trying to learn SQLite and searching for techniques to speed up my query. I see some here trying to go squeeze out ms, when I’m easily in the mega seconds. I have one SQLite db with four tables, although I’m only querying three tables. Here’s the query (I am using R to invoke the query):
SELECT a.date, a.symbol, SUM (a.oi*a.contract_close) AS oi, c.ret, c.prc
FROM (SELECT date, symbol, oi, contract_close FROM ann
UNION
SELECT date, symbol AS sym, oi, contract_close FROM qtr
WHERE oi > 100 AND contract_close > 0 AND date > 20090600) a
INNER JOIN
(SELECT date, symbol || '1C' AS sym, ret, prc FROM crsp
WHERE prc > 5 AND date>20090600) c
ON a.date = c.date AND a.symbol = c.sym
GROUP BY a.date, a.symbol
I have a an index on each table by date and symbol and just VACUUMed, but it’s still very slow, as in an hour plus (and notice that I’m looking for a six month subset… I really want to query back to 2003).
Is this just a cache size issue? I have a relatively new laptop (MacBook Pro with 4gb RAM). Thanks!
Here’s the .schema:
CREATE TABLE ann
( "date" INTEGER,
symbol TEXT,
contract_type_1 TEXT,
contract_type_2 TEXT,
product_type TEXT,
block_volume INTEGER,
oi_change INTEGER,
oi INTEGER,
efp_volume INTEGER,
total_volume INTEGER,
name TEXT,
contract_change INTEGER,
contract_open INTEGER,
contract_high INTEGER,
contract_low INTEGER,
contract_close INTEGER,
contract_settle INTEGER
);
CREATE TABLE crsp
( "date" INTEGER,
symbol TEXT,
permno INTEGER,
prc REAL,
ret REAL,
vwretd REAL,
ewretd REAL,
sprtrn REAL
);
CREATE TABLE dly
( "date" INTEGER,
symbol TEXT,
expiration INTEGER,
product_type TEXT,
shares_per_contract INTEGER,
"open" REAL,
high REAL,
low REAL,
"last" REAL,
settle REAL,
change REAL,
total_volume INTEGER,
efp_volume INTEGER,
block_volume INTEGER,
oi INTEGER
);
CREATE TABLE qtr
( "date" INTEGER,
symbol TEXT,
total_volume INTEGER,
block_volume INTEGER,
efp_volume INTEGER,
contract_high INTEGER,
contract_low INTEGER,
contract_open INTEGER,
contract_close INTEGER,
contract_settle INTEGER,
oi INTEGER,
oi_change INTEGER,
shares_per_contract INTEGER,
expiration INTEGER,
product_type TEXT,
unk TEXT,
name TEXT
);
CREATE INDEX idx_ann_date_sym ON ann (date, symbol);
CREATE INDEX idx_crsp_date_sym ON ann (date, symbol);
CREATE INDEX idx_dly_date_sym ON ann (date, symbol);
CREATE INDEX idx_qtr_date_sym ON ann (date, symbol);
You don’t mention the critical piece of information, which is how many rows are in each table and how many are in your result set. A query shouldn’t take an hour unless you have really enormous data sets.
That said, a few things I notice about your query:
I assume you’re aware that in your UNION the WHERE clause only applies to the second table and you’re getting the entire “ann” table included?
UNION ALL is generally faster than plain UNION unless you really need the de-duplication provided by plain UNION.
You do not need to repeat the filter for the date field on both sides on the JOIN. One side is enough, and you may achieve different speed results depending on which side of the JOIN you put the filter. By using it in both places you could possibly be tricking the query optimizer.
I’m not sure what “AS sym” is doing in the second SELECT in the UNION, because that column will be named “symbol” in the output (from the first SELECT in the UNION) and you’re relying on the name symbol in your main SELECT statement.
In your main SELECT statement you don’t have c.ret and c.prc in aggregate functions, but you don’t include them in the GROUP BY, so it’s not clear to me what value you expect to see in the results in the event that c contains multiple rows for a GROUP BY set.
The JOIN cannot be optimized because you’re calculating one of the JOIN values as part of an inner SELECT. I’m not sure if there’s a clever way to rewrite the JOIN conditions to be optimizable without storing a calculated symbol value in crsp.
Depending on the distribution of symbol and date values, you might want to reverse the order of the columns in your indexes (but only if you solve the problem of calculating the symbol value).