I’m trying to write a script to visit links for movies at boxofficemojo.com and extract gross earnings for the specific movie. I’m writing these scripts as a Google Apps script because I want to plug it into a spreadsheet.
My original implementation worked well when it was just looking for the Domestic Total listed very prominently on the movie page. (http://boxofficemojo.com/movies/?id=clashofthetitans2.htm for example would extract the “$80,882,168” right below the “Domestic total as of [date]). I wanted to extend this script so that I would be able to get the Worldwide total gross listed under Total Lifetime Grosses, but I am unable to do so and I’m not sure why.
Here is the code in question:
function gross(aUrl)
{
var page = UrlFetchApp.fetch(aUrl).getContentText();
var matched = page.match(/Worldwide:<\/b><\/td>.*(\$.*)<td width="25%">/m);
var amt = "$0";
if (matched == null)
{
matched = page.match(/<b>(\$.*)<\/b>.*Distributor:/m);
if (matched != null)
{
amt = matched[1];
}
} else
{
amt = matched[1];
}
return amt;
}
function testGross()
{
var result = gross("http://boxofficemojo.com/movies/?id=clashofthetitans2.htm");
Logger.log(result);
}
It should be worth nothing that the second regexp works fine but the first one doesn’t. The output of running testGross() would result in the following data in the Logs:
null
$80,882,168
I tested the regexp at http://www.rubular.com with the data that comes from just viewing the page source when I am at the movie page. I’m certain that the page being returned for matching hasn’t been truncated anymore because when I replace the page.match line with a line that sends an email to me with the full content of the page variable, I get a page identical to what I would get if I chose to view page source.
Any help would be greatly appreciated.
By looking at the page source of the example you used, I can see that you forgot the “closing” part in the regex. Here is the relevant part:
In your regex, after the
(\$.*)part, you don’t consider the</b></td>and space chars.Also, the
mmodifier does not work as you expect. Actually it does not make a difference here. The.will not match new lines. Here is your “fixed” regex:Anyway, here’s how I’d do it: