This is a “big” question, that I don’t know how to start, so I hope some of you can give me a direction. And if this is not a “good” question, I will close the thread with an apology.
I wish to go through the database of Wikipedia (let’s say the English one), and do statistics. For example, I am interested in how many active editors (which should be defined) Wikipedia had at each point of time (let’s say in the last 2 years).
I don’t know how to build such a database, how to access it, how to know which types of data it has and so on. So my questions are:
- What tools do I need for this (besides basic R) ? MySQL on my computer? RODBC database connection?
- How do you start planning for such a project?
You’ll want to start here:
http://en.wikipedia.org/wiki/Wikipedia:Database_download
Which will take you to here:
http://download.wikimedia.org/enwiki/20100312/
And the file you probably want is:
http://download.wikimedia.org/enwiki/20100312/enwiki-20100312-pages-logging.xml.gz
You’ll then import the xml into MySQL. Generating a histogram of users per day, week, year, etc. won’t require R. You’ll be able to do that with a single MySQL query. Something like:
etc.
(I’m not sure what their actual schema is, but it’ll be something like that.)
You’ll run into issues, no doubt, but you’ll learn a lot too. Good luck!