I’m planning the structure of a MySql database and could use some advice from more seasoned professionals. The site which the DB belongs to gathers 90-days of weather data for EACH registered user, and has to support millions of users.
I already have a table for the users, with their login and contact information, but assume that I need a second table for all the weather data…
What I intend to do is basically store the average temperature, humidity, wind-direction and so fourth – per day – for every user. And each day the DB is updated with the new day’s data, while keeping yesterday’s entries (but limited to 89-days of old data + the current day’s data) – for all users.
Now, does it make most sense to have one huge “data” table that has 90 rows for EVERY user (with millions of users)? Or is there a more clever way to do this that is better for performance reasons or similar?
The 90-days of data will be accessed (READ and displayed etc.) every time a user logs in and views his own profile or if she browses someone else’s profile. But it will only be updated once per day (overwriting the oldest entry, maintaining the limit of 90 rows per user.)
Edit: saw just now that each user has different weather data. Keeping the “shared data” in the answer, but you’re interested in the second case.
Users share weather data
Based, say, on their nearest weather station ID.
I’d store a (userId, stationId, isActive, isPreferred) table to know what data the user is interested in, and then I’d run a query against stationWeatherData to fetch the 90 rows of weather data for that station.
Each user has his own weather data
There shouldn’t be particular problems in handling 900 million users. If you really had to, you could “shard” on different tables based on userId, e.g, table weather174 would hold data of all users for which (userId % 1000) gives 174, and you’d find yourself with 1000 tables – possibly on different servers – of one thousandth the size.
So you start with one big table, and prepare for sharding (or moving to cloud storage and a no-SQL keystore database, e.g. MongoDB, VoltDB). Or partition based on UserID as soon as UserID reaches, say, one million.
Or even, you don’t use a database at all. A DB makes sense if you need to search or correlate/join data — here you are just accessing a user’s “weather station”.
If you know you’re never going to query “How many users have 60% humidity?”, but always only “What data are there for user 1234567?”, then you might save the data in a rolling buffer in binary, JSON or HTML format (on cloud storage, S3, or again MongoDB – now only one document per user). Much would then depend on how the data to be updated is arriving, i.e., in one big batch from a concentrator or each user uploading its own.