I’ve been working on this simple script all day trying to figure it out. I’m new to regex so please keep that in mind. On top of that, I’ve tried just about anything and everything I could to get this to work.
I’m trying to (to learn, please don’t point me to the API) download a TSV file from Yahoo Site Explorer via either cURL or file_get_contents (both work, just messing with different things) and then using regex to get only the URL column to appear. I realize I might have more luck with other functions, but I can’t find anything dealing with TSV and now it’s become a challenge. I’ve literally spent the entire day trying to get this correct.
So a URL would be:
https://siteexplorer.search.yahoo.com/search?p=www.google.com&bwm=i&bwmo=&bwmf=s
And my regex currently looks like this (I know it’s horrible…it’s probably the millionth attempt):
preg_match_all('((http(s?)://?(([^/]+(\/.+))))^[\t]$)', $dl, $matches);
My issue right now is that there’s 4 columns. TITLE URL SIZE FORMAT. I’m able to strip out everything from the first column (TITLE) and the last (FORMAT) column, but I cannot seem to strip out the SIZE column and get rid of the last slash in case the sites linking in don’t have that last slash.
Another thing – I’ve actually accomplished getting JUST the URL to appear, but they all had ending slashes which leave out links from, say, Twitter.
Any help would be greatly appreciated!
Don’t know much about PHP, but this regex works in python (should be the same in PHP):
Just match it and get the content of group 1. FWIW, code in Python: