I’m trying to write a script to visit links for movies at boxofficemojo.com and

Question

0

Asked: June 3, 20262026-06-03T03:39:07+00:00 2026-06-03T03:39:07+00:00

I’m trying to write a script to visit links for movies at boxofficemojo.com and

0

I’m trying to write a script to visit links for movies at boxofficemojo.com and extract gross earnings for the specific movie. I’m writing these scripts as a Google Apps script because I want to plug it into a spreadsheet.

My original implementation worked well when it was just looking for the Domestic Total listed very prominently on the movie page. (http://boxofficemojo.com/movies/?id=clashofthetitans2.htm for example would extract the “$80,882,168” right below the “Domestic total as of [date]). I wanted to extend this script so that I would be able to get the Worldwide total gross listed under Total Lifetime Grosses, but I am unable to do so and I’m not sure why.

Here is the code in question:

function gross(aUrl)
{
  var page = UrlFetchApp.fetch(aUrl).getContentText();
  var matched = page.match(/Worldwide:<\/b><\/td>.*(\$.*)<td width="25%">/m);
  var amt = "$0";
  if (matched == null)
  {
    matched = page.match(/<b>(\$.*)<\/b>.*Distributor:/m);
    if (matched != null)
    {
      amt = matched[1];
    }

  } else
  {
    amt = matched[1];
  }
  return amt;
}

function testGross()
{
  var result = gross("http://boxofficemojo.com/movies/?id=clashofthetitans2.htm");
  Logger.log(result);
}

It should be worth nothing that the second regexp works fine but the first one doesn’t. The output of running testGross() would result in the following data in the Logs:

null
$80,882,168

I tested the regexp at http://www.rubular.com with the data that comes from just viewing the page source when I am at the movie page. I’m certain that the page being returned for matching hasn’t been truncated anymore because when I replace the page.match line with a line that sends an email to me with the full content of the page variable, I get a page identical to what I would get if I chose to view page source.

Any help would be greatly appreciated.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-03T03:39:09+00:00

By looking at the page source of the example you used, I can see that you forgot the “closing” part in the regex. Here is the relevant part:

<td width="40%">=&nbsp;<b>Worldwide:</b></td>
<td width="35%" align="right">&nbsp;<b>$289,732,168</b></td>
<td width="25%">&nbsp;</td>

In your regex, after the (\$.*) part, you don’t consider the </b></td> and space chars.
Also, the m modifier does not work as you expect. Actually it does not make a difference here. The . will not match new lines. Here is your “fixed” regex:

/Worldwide:<\/b><\/td>[\s\S]*(\$.*)<\/b>[\s\S]*<td width="25%">/m

Anyway, here’s how I’d do it:

/Worldwide:<\/b><\/td>[\s\S]*?<b>(\$.+)<\/b><\/td>/

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to write a script to visit links for movies at boxofficemojo.com and

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply