I’m working on an application that routinely gathers information from a large number of websites and saves it to a mysql database with a table for each site. The idea is to create a sort of customizable news feed.
- stackoverflow_table(id, url, title, date)
- reddit_table(id, title, url, author, date)
- github_commit_table(id,
commit_message, author, repository,
branch, date) - twitter_table(id, tweet, author, url, date)
- etc…
I want the ability to request any number of news items and filter out certain sites too. As an example:
Show newest 100 items but exclude items from Twitter and GitHub.
It seems like the best way to handle this is to create a table that just has foreign keys and website names.
master_table(id, website, date, foreign_key)
and I can just query the foreign ids I need from this table.
Am I going about this horribly wrong?
I’ve actually been working on a similar site. Not for other sites, but a kind of Facebook-like site for a niche community with newsfeeds from various sources. I’ve been pondering this question very heavily the past couple of weeks.
One issue, probably not gamebreaking, but still an issue for me, is that since your
foreign_keycolumn isn’t literally a foreign key due to referencing multiple tables, so it can’t get the benefits from things such as referential integrity enforcement.What I’m considering is making a GUID table that serves as the source of ids for all of the other tables, and having a table specifically dedicated to the news feed. It might be defined as something like:
You could still store information about site postings in their own tables, but now you’re just referencing the one newsfeed table for what to actually display on the page, with the ref_id being a pointer to individual source tables if someone wants to deep-dive into the information. It’s still not ideal because ref_id still isn’t a true foreign key, but it’s arguably a little better.
You might even want to do something like this instead of ref_id:
with the contents of that column for any given entry a source-specific data payload. For example, for Github posts, it could contain a JSON string such as: