I have an unfinished binary file that has some info that I can recover using regex. The contents are:
G $12.Angry.Men.1957.720p.HDTV.x264-HDLH Lhttp://site.com/forum/f89/12-angry-men-1957-720p-hdtv-x264-hdl-538403/ L I Š M ,ABBA.The.Movie.1977.720p.BluRay.DTS.x264-iONN Phttp://site.com/forum/f89/abba-movie-1977-720p-bluray-dts-x264-ion-428687/&
How can I parse it so I can at least get links that are:
http://site.com/forum/f89/abba-movie-1977-720p-bluray-dts-x264-ion-428687/
where 428687 is the id number.
So I would have a full link and an id.
The other names that comes before are the name of the links:
ABBA.The.Movie.1977.720p.BluRay.DTS.x264-iON
Though I am not sure if these can be parsed. I noticed they all have a character before and after the LINKS and the NAMES. So maybe this can narrow down the problem?
Btw I am willing to give 500 bounty for the correct answer.
Something like the following regular expression?
which will grab links (starting
http://) then everything not a space (spaces are guaranteed not around in HTTP (URI) links) and assumes it ends with digits and a trailing slash (this will correctly remove the&in your example or other trailing text).EDIT: the whole match is the link, the ID is in the first capturing parentheses, updated code to show how to get the info.
Update: if dash+digits+slash can occur more then once in the URL, then greediness must be used, but then consecutive links (with no additional text having spaces) will be matched together. If dash+digits+slash occurs only once per URL, then laziness is preferred. This is the solution currently in the code above.
Alternative approach
From the updates and the extra information, I understand that there’s a lot unclear about the text. Another approach might be easier: split everything on
http://and go through the results. This prevents having to make a complex look-forward/backward regex and makes sure that consecutive links (i.e., without text in-between) are correctly treated:Update: alternative approach updated. The text (name) is first, then url. Note the negative look behind expression to split on a zero-width spot, taking anything before the url up to the end of the url.