I’m having some difficulty figuring out a regular expression for stripping part of the string within a particular XML tag and replacing it. I have a number of URL paths with variable parts, so I need to find everything between a certain string and the last slash in the URL. For example, I might have tags and URLS that look like this:
<bpoc:resourceMetadataLoc>http://app01/media/images/I//1951-1960_Embark_Object_Photos/1957.59.jpg</bpoc:resourceMetadataLoc>
or
<bpoc:resourceMetadataLoc>http://app01/media/images/CONTEMPORARY/1986-2005/1991.2.jpg</bpoc:resourceMetadataLoc>
The output should look like
<bpoc:resourceMetadataLoc>http://app01/media/Previews/1957.59.jpg</bpoc:resourceMetadataLoc>
This is about as far as I got, but it captures the last slash in the string, and not the second-to-last slash:
(<bpoc:resourceMetadataLoc>http://app01/media/images)+(.*[/])
That regex will capture the following:
<bpoc:resourceMetadataLoc>http://app01/media/images/I//1951-1960_Embark_Object_Photos/1957.59.jpg</
What would I need to add to the regex to exclude the </bpoc:resourceMetadataLoc> bit from the query and then capture everything prior to the last slash in the URL?
Because this is XML, there can’t be a (non-escaped)
<or>in the URL itself. You can use this to your advantage:This should capture the last segment (e.g. “1957.59.jpg”) of the URL. It works by greedily matching everything up to the start of the end-of-tag (the first
[^<]*), then backtracking to match the nearest (i.e. last)/, then capturing everything after that slash (the([^<]*)) into group 1 so that you can use it during the replacement step.