In the following content samples, I wrapped the lines to make it easier to read on Stackoverflow (so you don’t have to scroll to the right to look at the examples).
Content A:
"Lorem Ipsum\r\n
[img]http://example.org/first.jpg[/img]\r\n
[img]http://example.org/second.jpg[/img]\r\n
more lorem ipsum ..."
Content B:
"Lorem Ipsum\r\n
[img caption="Sample caption"]http://example.org/third.jpg[/img]
[img]http://example.org/fourth.jpg[/img]"
Content C:
"Lorem Ipsum [img]http://example.org/fifth.jpg[/img]\r\n
more lorem ipsum\r\n\r\n
[img caption="Some other caption"]http://example.org[/img]"
What I’ve tried:
content.match(/\[img\]([^<>]*)\[\/img\]/imu)
return example: "[img]...[/img]\r\n[img]...[/img]
content.scan(/\[img\]([^<>]*)\[\/img\]/imu)
return example: "...[/img]\r\n[img]..."
What I would like to accomplish when running the scan/match/regex solution over the above 3 Content Examples is to get every occurence of [img]...[/img] and [img caption="?"]...[/img] and put it in an array for later use.
Array
1 : A : [img]http://example.org/first.jpg[/img]
2 : A : [img]http://example.org/second.jpg[/img]
3 : B : [img caption="Sample caption"]http://example.org/third.jpg[/img]
4 : B : [img]http://example.org/fourth.jpg[/img]
5 : C : [img]http://example.org/fifth.jpg[/img]
6 : C : [img caption="Some other caption"]http://example.org[/img]
It would also be helpful to limit the “stripped content” to only where there is an open and closign tag, meaning when there is a [img] / [img caption="?"] and a missing [/img] afterwards, to ignore it.
I’ve read the http://www.ruby-doc.org/core-1.9.3/String.html up and down but can’t find anything that seem to work for this.
Update:
So I figured that this:
\[img([^<>]*)\]([^<>]*)\[\/img\]
will find either:
[img]something[/img]
and:
[img caption="something"]something[/img]
Now I just need to know how to catch every occurence inside the different contents. I can always just get it from the first to the last [img][/img] tags, so when there is other Lorem Ipsum in between it will get grabbed too.
You can use
/\[img(?:\s+caption=".+")?\].+?\[\/img\]/to scan the documents:Which generates:
If you want to ignore the tags and only grab the content, change the regexp to:
Running again with that change returns:
(Rubular proof)
If you need to look for different tags, you can generate an “OR” list easily:
If you need to make sure that “magic” characters are escaped beforehand: