I have to retrieve this url from a dirty html page:
……… http://www.imdb.com/title/tt0092699/ ……
Obviously url can also be
http://www.imdb.co.uk/title/tt0092699/
http://www.imdb.es/title/tt0092699/
http://www.imdb.com/title/tt0092699
https://www.imdb.com/title/tt0092699/
https://www.imdb.com/title/tt0092699
(.domain, http/https or without final slash)
Use this regex:
The url you want will be in
$matches[0].Here’s the regex meaning, broken down piece by piece:
/=> start regexhttps?=> literalhttpfollowed by optionals:\/\/www.imdb\.=> literal://www.imdb..*?\/=> matches the shortest string possible before a slash, then a slash; will match the domain end, whatever it is (com,co.uk,es, etc…) and the first slash following ittitle\/=> literaltitle/tt\d+=> literalttfollowed by at least one digit (and it’s a greedy match, so it will match the most number of consecutive digits it can); will match ids in the format you provided\/?=> optional final//=> end regex