I’m building a website that will be an open-source, user-contributed content kind of thing, and I think if developers had access to nightly production SQL dumps, they’d be more likely to check out the code from github and play with it.
In line with that idea, I’m considering either:
- Not collecting private user information at all, using open-id for accounts and making heavy use of memcache for things like session authentication.
- Anonymizing sensitive data before publishing
Sometimes I get carried away with ‘wouldn’t it be cool if…?’ ideas, so I’m hoping for a sanity check here. Any obvious flaws in either approach? Is this a sane idea?
Speaking generally, I think you should do both. Any private data you collect is simply a liability for you, and not just because you intend to publish your databases. The less you can collect, the better.
By the same token, however, you probably realize that it is not just IDs and passwords which are sensitive. Remember the AOL search data leak? Or the Netflix database publication? Even without having IDs, people managed to figure out the real identities of some of the accounts, simply by piecing together trails of user behavior, and corresponding that with data from other places. Some people are embarrassed by their search histories and their movie rentals. Go figure.
Therefore, I think the general rule should be to collect as little as possible, and anonymize what is left. Even if you don’t store the identity of the person corresponding to a certain account, you may want to scramble what the various logins did.
On the other hand, there some cases where you simply don’t care about this kind of privacy. In Wikipedia, for example, pretty much everything you can do on the site is public anyway. At least, everything which gets recorded in the database. If the information is already available through the API, there is no point in hiding it in a database download.