Possible Duplicate:
R regular expression: http matching
I’m working to capture URLs from a chunk of source code using regex.
The URL’s follow a pattern and are in the following form:
- http://www.google.com/…./1-1,1″
- http://www.google.com/…./1-2,2″
- http://www.google.com/…./1-20,20″
so far I can get to the url using the following code:
pattern = paste("1-", 1:20,",", 1:20, "\"", sep="")
this gives me a vector of:
- 1-1,1
- 1-2,2
- …..
- 1-20,20
then I can use these vectors to give me a position or the URLs inside the soure code .
Let’s say for example that the whole source code is simply: “http://www.google.com/word/1-1,1>”
`regexpr("1-1,1", test1k, TRUE)`
gives me:
[1] 28 attr(,”match.length”) [1] 5
this means that the pattern 1-1,1 starts at length 28. Given this information, how would I select the whole URL starting at “http://ww…” until the end “1-1,1>”.
I guess what I’m asking is, give the position 28, is there a function to select the nearest “http://” string going backwards (this marks the start of the URL). Similarly, given the position 28, is there a way to select the nearest “>” character going forward (this marks the end of the URL).
Rather than creating all possible combinations, just use the
\\dcharacter, which will match any digit. For example:To select the whole URL, you want to start the regular expression with “http”, and then have it continue until the first time this pattern is matched. One simple way is:
The
.*pattern has three parts. The.matches any character, the*means “any number of that character, and the?means that it’s not greedy (otherwise, this will take up the entire string from the first http to the last1-\\d+,\\d+.For example: