I have an API where I need to log which ids from a table that were returned in a query, and in another query, return results sorted based on the log of ids.
For example:
Tables products had a PK called id and users had a PK called id . I can create a log table with one insert/update per returned id. I’m wondering about performance and the design of this.
Essentially, for each returned ID in the API, I would:
INSERT INTO log (product_id, user_id, counter)
VALUES (@the_product_id, @the_user_id, 1)
ON DUPLICATE KEY UPDATE counter=counter+1;
.. I’d either have an id column as PK or a combination of product_id and user_id (alt. having those two as a UNIQUE index).
So the first issue is the performance of this (20 insert/updates and the effect on my select calls in the API) – is there a better/smarter way to log these IDs? Extracting from the webserver log?
Second is the performance of the select statements to include the logged data, to allow a user to see new products every request (a simplified example, I’d specify the table fields instead of * in real life):
SELECT p.*, IFNULL(
SELECT log.counter
FROM log
WHERE log.product_id = p.id
AND log.user_id = @the_user_id
, 0 ) AS seen_by_user
FROM products AS p
ORDER BY seen_by_user ASC
In our database, the products table has millions of rows, and the users table is growing rapidly. Am I right in my thinking to do it this way, or are there better ways? How do I optimize the process, and are there tools I can use?
Callie, I just wanted to flag a different perspective to keymone, and it doesn’t fit into a comment hence this answer.
Performance is sensitive to the infrastructure environment: are you running in a shared hosting service (SHS), a dedicated private virtual service (PVS) or dedicate server, or even a multiserver config with separate web and database servers.
What are your transaction rates and volumetics? How many insert/updates are you doing per min in your 2 peaks trading hours in the day? What are your integrity requirements v.v the staleness of log counters?
Yes, keymone’s points are appropriate if you are doing, say, 3-10 updates per second, and as you move into this domain, some form of collection process to batch up inserts to allow bulk insert becomes essential. But just as important here are Qs are choice of storage engine, transactional vs batch split and the choice of infrastructure architecture itself (in-server DB instance vs separate DB server, master/slave configurations …).
However, if you are averaging <1/sec then INSERT ON DUPLICATE KEY UPDATE has comparable performance to the equivalent UPDATE statements and it is the better approach if doing single row insert/updates as it ensures ACID integrity of the counts.
Any form of PHP process start-up will typically take ~100mSec on your web server, so even thinking of this to do an asynchronous update is just plain crazy as the performance hit is significantly larger than the update itself.
Your SQL statement just doesn’t jive with your comment that you have “millions of rows” in the products table as it will do a full fetch of the product table executing a correlated subquery on every row. I would have used a LEFT OUTER JOIN myself, with some sort of strong constraint to filter which product items are appropriate to this result set. However it runs, it will take materially longer to execute that any count update.