I am trying to extract javascript code from HTML content that I receive via CFHTTP request.
I have this simple regex that catches everyting as long as there is no linebreak in the code between the tags.
var result=REMatch("<script[^>]*>(.*?)</script>",html);
This will catch:
<script>testtesttest</script<
but not
<script>
testtest
</script>
I have tried to use (?m) for multiline, but it doesn’t work like that.
I am using the reference to figure it out but I am just not getting it with regex.
Heads up, normally there would be javascript between the script tags, not simple text so also characters like {}();:-_ etc.
Can anyone help me out?
Cheers
[[UPDATE]]
Thanks guys, I will try the solutions. I favor regex because but I will look into the HTML Parser too.
(?m)multiline mode is for making^and$match on line breaks (not just start/end of string as is default), but what you’re trying to do here is make.include newlines – for that you want(?s)(dot-all mode).However, I probably wouldn’t do this with regex – a HTML parser is a more robust solution. Here’s how to do it with jSoup:
More details on using jSoup in CF are available here, or alternatively you can use the TagSoup parser, which ships with CF10 (so you don’t need to worry about jars/etc).
If you really want regex, then you can use this:
Unlike using
(?s).*?this avoids matching empty blocks (but it will still fail in certain edge cases – if accuracy is required use a HTML parser).To extract just the text from the first script block, you can strip the script tag with this: