I am trying to scrape some data from a web site. This is the kind of thing that I would usually do in Perl, but I would really like to wean myself off Perl. (I’m not dissing Perl; it’s been a valuable tool but I am distressed by how much I still struggle with the language after more than a decade.) As my needs are simple and performance is seldom an issue for me, I want to shift my web scraping to R. I know some R but I have never used RCurl or similar libraries.
The task is to scrape a database of publicly available data. The issue is complicated by my not knowing exactly how to pass the arguments, as I am just looking at the JS source and trying to work out what to include in the RCurl postForm request. The code below does not throw any obvious errors, but neither does it return anything useful.
Q. What am I doing wrong?
[Edited: to reflect changes suggested, but not yet resolved]
require(RCurl)
## -----------> Form:
## http://jamaserv.jama.or.jp/newdb/eng/index.html
## -----------> Result:
## http://jamaserv.jama.or.jp/newdb/eng/prod4/prod4TsMkEntry.html
#POST /newdb/eng/prod4/prod4TsMkEntry.html makerCd=5&additionBase=1&additionInterval=1&chkSelCnd3=0&car4Cd=100005&termFrom=201103&termTo=201203&prod4TsMkEntryForm%3AdoAction=Server&prod4TsMkEntryForm%2Feng%2Fprod4%2Fprod4TsMkEntry.html=prod4TsMkEntryForm
#POST /newdb/eng/prod4/prod4TsMkEntry.html?pass chkSelCnd3=0&prod4TsMkEntryForm%2Feng%2Fprod4%2Fprod4TsMkEntry.html=prod4TsMkEntryForm&makerCd=5&additionBase=1&termTo=201203&prod4TsMkEntryForm%3AdoAction=Server&additionInterval=1&termFrom=201103&car4Cd=100005
x <- postForm('http://jamaserv.jama.or.jp/newdb/eng/prod4/prod4TsMkEntry.html?pass',
chkSelCnd3 = '0',
'prod4TsMkEntryForm/eng/prod4/prod4TsMkEntry.html' = 'prod4TsMkEntryForm',
makerCd = '5',
additionBase = '1',
termTo = '201203',
'prod4TsMkEntryForm:doAction' = 'Server',
additionInterval = '1',
termFrom = '201103',
car4Cd = '100005',
.opts = curlOptions(
referer = 'http://jamaserv.jama.or.jp/newdb/eng/prod4/prod4TsMkEntry.html',
verbose = TRUE,
header = TRUE,
followLocation = TRUE,
useragent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13'
)
)
When using a browser the form looks like this:

And the above settings return (on a separate page) this:

This turned out to be a far more complex problem that it originally appeared, involving server-side Javascript and all kinds of stuff. It doesn’t seem feasible with the simple approach I used in this question. So, answering my own question and moving on…