I am posing this question looking for practical advice on how to design a system.
Sites like amazon.com and pandora have and maintain huge data sets to run their core business. For example, amazon (and every other major e-commerce site) has millions of products for sale, images of those products, pricing, specifications, etc. etc. etc.
Ignoring the data coming in from 3rd party sellers and the user generated content all that “stuff” had to come from somewhere and is maintained by someone. It’s also incredibly detailed and accurate. How? How do they do it? Is there just an army of data-entry clerks or have they devised systems to handle the grunt work?
My company is in a similar situation. We maintain a huge (10-of-millions of records) catalog of automotive parts and the cars they fit. We’ve been at it for a while now and have come up with a number of programs and processes to keep our catalog growing and accurate; however, it seems like to grow the catalog to x items we need to grow the team to y.
I need to figure some ways to increase the efficiency of the data team and hopefully I can learn from the work of others. Any suggestions are appreciated, more though would be links to content I could spend some serious time reading.
Use visitors.
Even if you have one person per item, there will be wrong records, and customers will find it. So, let them mark items as “inappropiate” and make a short comment. But don’t forget, they’re not your employees, don’t ask them too much; see Facebook’s “like” button, it’s easy to use, and requires not too much energy from the user. Good performance/price. If there would be a mandatory field in Facebook, which asks “why do you like it?”, no one should use that function.
Visitors also helps you implicite way: they visit item pages, and use search function (I mean both internal search engine and external ones, like Google). You can gain information from visitors’ activity, say, set up the order of the most visited items, then you should concentrate more human forces on the top of the list, and less for the “long tail”.