I am trying to write a program using the lynx command on this page “http://www.rottentomatoes.com/movie/box_office.php” and I can’t seem to wrap my head around a certain problem…. getting the title by itself. My problem is a title can contain special characters, numbers, and all titles are variable in length. I want to write a regex that could parse the entire page and find lines like this….
(I added spaces between the title and the next number, which is how many weeks it has been out, to distinguish between title and weeks released)
1 -- 30% The Vow 1 $41.2M $41.2M $13.9k 2958
2 -- 53% Safe House 1 $40.2M $40.2M $12.9k 3119
3 -- 42% Journey 2: The Mysterious Island 1 $27.3M $27.3M $7.9k 3470
4 -- 57% Star Wars: Episode I - The Phantom Menace (in 3D) 1 $22.5M $22.5M $8.5k 2655
5 1 86% Chronicle 2 $12.1M $40.0M $4.2k 2908
the regex I have started out with is:
/(\d+)\s(\d+|\-\-)\s(\d+\%)\s
If someone can help me figure out how to grab the title successfully that would be much appreciated! Thanks in advanced.
Capture all the things!!
Explained:
So to be serious this is what I’ve done, I cheated a bit and captured everything (as I think you’ll do in the end) to get a lookahead for the title capture.
In a non-greedy regex
(.*)[or(.*?)if you want to force the “ungreedyness”] will capture the least possible characters, and the end of the regex tries to capture everything else.Your regex ends up capturing only the title (the only thing left).
What you can do is using an actual lookahead and make assertions.
Resources: