I have this table
attendance (4M rows at the moment, growing 1.2M per week):
-------------------------------------------------------------
| member_id | attendance_week | attendance_date | event_id |
------------------------------------------------------------
| INT (10) | TINYINT(2) | TIMESTAMP |TINYINT(3) |
-------------------------------------------------------------
attendance indeces:
--------------------------------------------------
| PRIMARY (attendance_week, member_id, event_id) |
| member_id (member_id) |
| event_id (event_id, attendance_week)
| total (attendance_week, event_id) |
--------------------------------------------------
members (400k rows at the moment growing 750 a week):
-------------------------
| member_id | dept_id |
-------------------------
| INT (10) |SMALLINT(5)|
-------------------------
member indeces:
-----------------------
| PRIMARY (member_id) |
|
-----------------------
Events are weekly, meaning you’ll see pairs of member_id and event_id for each week.
Now I have to generate a report of for a certain department each event, current attendance (i.e. if this member already checked-in), and their attendance over at least 4 weeks (i.e. attended / total events for a duration)
This is for the current_attendance part of the report. I fetch all members for a department and LEFT JOIN it with this week’s event to get NULL for absences:
SELECT
m.member_id AS id,
a.event_id AS attended
FROM
members AS m
LEFT JOIN
attendance AS a
ON
a.member_id = m.member_id AND
a.attendance_week = :week AND
a.event_id = :event
WHERE
m.dept_id = :dept
GROUP BY
m.member_id
This is for the attended part of the report. :
SELECT
a.member_id,
COUNT(a.event_id)
FROM
attendance a
JOIN
members m
ON
a.member_id = m.member_id AND
m.dept_id = :dept
WHERE
a.attendance_week BETWEEN :start AND :end
GROUP BY
a.member_id
I could probably merge these two queries by simply LEFT JOIN-ing the attendance table again on the first query.
And finally for the total part
SELECT
attendance_week,
COUNT(DISTINCT event_id)
FROM
attendance
WHERE
attendance_week BETWEEN :start AND :end
GROUP BY
attendance_week
These are the main queries that will be run for these tables. At this moment, the queries run for an average of 150 – 200ms (according to phpMyAdmin) which I think is slow. EXPLAIN tells me that my indeces are being used.
So here are my questions:
- Is there any other way that I can revise my indeces and queries to make this faster?
- I assume that MySQL has a cache of compiled statements. I’m not talking about the results cache, think PHP opcode vs HTML cache. I already tried
SQL_NO_CACHEand I still get the same response time, andquery_cache_sizeis 0. I could swear that I saw phpMyAdmin report the queries at around 800ms once (which is unacceptable) but I don’t get them now. How do I measure the true speed of my queries everytime they are run? - Will these be faster if I put these queries in a stored procedure?
- Any thoughts for storage methods? The database is currently around 400MB in size. After a year, I don’t know, maybe 3GB? Is this scalable? I’m really new when it comes to DBA, I’ve read master-slave replication and partitioning but I don’t know if it is good for this.
If you ever need more info, please comment below. I’ll try to provide it. I really did try to do this alone, but given the demands of a huge database (my largest so far) and high performance, I really need some advice 😀
Thanks
EDIT
I just realized a terrible flaw in my logic, newly registered members will show up having low attendance performance since the 3rd query doesn’t take registration date into account. I have a registration_date column in my members table, is there any way I can incorporate that variable into the query? Or merge all three queries in just once? Since they all return values that are dependent on each user.
EDIT
I’ve managed to merge the first two queries:
SELECT
m.member_id AS id,
a.event_id AS attended,
COUNT(b.event_id) AS total_attended
FROM
members AS m
LEFT JOIN
attendance AS a
ON
a.member_id = m.member_id AND
a.attendance_week = :week AND
a.event_id = :event
LEFT JOIN
attendance AS b
ON
b.member_id = m.member_id AND
b.attendance_week BETWEEN :start AND :end
WHERE
m.dept_id = :dept
GROUP BY
m.member_id
This query runs for 925ms on the first run and 15ms on subsequent requests.
This is the result of the above query’s EXPLAIN
members table:
id: 1
select_type: SIMPLE
table: m
type: ref
possible_keys: dept_id
key: dept_id
key_len: 3
ref: const
rows: 88
Extra: Using where; Using index
attendance table 1 (for the boolean attended part):
id: 1
select_type: SIMPLE
table: a
type: eq_ref
possible_keys: PRIMARY,member_id,event_id,total
key: PRIMARY
key_len: 6
ref: const,arms_db.m.member_id,const
rows: 1
Extra: Using index
attendance table 2 (for the total attendanded part):
id: 1
select_type: SIMPLE
table: b
type: ref
possible_keys: PRIMARY,member_id,total
key: member_id
key_len: 4
ref: arms_db.m.member_id
rows: 5
Extra: Using index
And the EXPLAIN for the last query:
id: 1
select_type: SIMPLE
table: attendance
type: range
possible_keys: PRIMARY,toral
key: total
key_len: 2
ref: NULL
rows: 9
Extra: Using where; Using index for groub-by
Adding covering or clustered indexes on tables will give you the best performance:
You can add extra index on table member also:
member indeces:(member_id, dept_id)
you can enable Query Cache to cache query output but Query Cache doesn’t work with procedures. To measure exact speed of queries you can use
mysqlslap client utility.Queries inside stored procedure won’t make much difference in terms of speed but it will save some additional overhead of query parsing and sending output to client.
Distributing data over different servers using sharding or replication will help you in terms of scalability. Partitioning on huge tables will also benefit you.