i have the following text:
<span term="db6ff2ffe2df7b8cfc0d9542bdce27dc" class="yellowback">Lorem</span> <span term="e78f5438b48b39bcbdea61b73679449d" class="yellowback">ipsum</span> dolor sit amet, consectetur adipiscing elit.
Ut ut mattis sapien. Suspendisse at felis nisl. Vestibulum nec risus leo, in consectetur dolor. Duis suscipit arcu quis nibh dapibus gravida. Ut vel rhoncus neque. Sed et dolor quis est sollicitudin vulputate. Nam vehicula, tortor at consectetur laoreet, nulla erat ultrices dui, vehicula varius odio sem sed ligula.
Vivamus porttitor odio sed ligula cursus non placerat dolor posuere.
Pellentesque vitae metus vel dolor lobortis feugiat. Nunc faucibus commodo viverra. Aliquam porta nisl eu turpis vulputate id laoreet odio lobortis. Proin sit amet neque nibh, eget tincidunt est. Etiam accumsan erat at mauris lacinia porta.
Suspendisse auctor, quam sit amet congue consequat, dolor orci placerat diam, sed ultricies diam ipsum nec tortor. Vestibulum egestas ipsum ut leo fermentum imperdiet. Mauris varius iaculis magna, id luctus risus vestibulum vel.
I would like to split it into words but if you look closely some words may be contained within some tags. What i want to do is this: if the word is within a tag it should treat the tag overall as the word. Right now I have the following regex to accomplish this:
(<span.+>|\w+|<\/span>)
This works but if there are 2 adiacent tags it will capture them both and treat them as a word which is not something i would want.
I am not fond of using Regex for this thing but it seems the most appropriate solution giving the fact that it has to be in javascript and there is no way I can use a 3rd party library. I am open however to a different approach, using some sort of algorithm…if not Regex is just fine.
A satisfactory result would be the following
["<span term=\"db6ff2ffe2df7b8cfc0d9542bdce27dc\" class=\"yellow\">Lorem</span>", "<span term=\"e78f5438b48b39bcbdea61b73679449d\" class=\"yellow\">Ipsum</span>", "is", "simply", "dummy", "text", "of", "the", "printing", "and", "typesetting", "industry", ".
", "Lorem", "Ipsum", "has", "been", "the", "industry", " ' ", "s", "standard", "<span term=\"e78f5438b48b39bcbdea61b73679449d\" class=\"yellow\">Ipsum</span>", "text", "ever", "since", "the", "1500s", ",
", "when", "an", "unknown", "printer", "took", "a", "galley", "of", "type", "and", "scrambled", "it", "to", "make", "a", "type", "specimen", "book", ".
", "It", "has", "survived", "not", "only", "five", "centuries", ", ", "but", "also", "the", "leap", "into", "electronic", "typesetting", ",
", "remaining", "essentially", "unchanged", ". ", "It", "was", "<span term=\"e78f5438b48b39bcbdea61b73679449d\" class=\"yellow\">Ipsum</span>", "in", "the", "1960s", "with", "the", "release", "of", "Letraset", "sheets", "containing", "Lorem", "Ipsum", "passages", ", ", "and", "more", "recently", "with", "desktop", "publishing", "software", "like", "Aldus", "PageMaker", "including", "versions", "of", "Lorem", "Ipsum", ".
"]
Not a good result would be the following:
["<span term=\"db6ff2ffe2df7b8cfc0d9542bdce27dc\" class=\"yellow\">Lorem</span> <span term=\"e78f5438b48b39bcbdea61b73679449d\" class=\"yellow\">Ipsum</span>", "is", "simply", "dummy", "text", "of", "the", "printing", "and", "typesetting", "industry", ".
", "Lorem", "Ipsum", "has", "been", "the", "industry", " ' ", "s", "standard", "<span term=\"e78f5438b48b39bcbdea61b73679449d\" class=\"yellow\">Ipsum</span>", "text", "ever", "since", "the", "1500s", ",
", "when", "an", "unknown", "printer", "took", "a", "galley", "of", "type", "and", "scrambled", "it", "to", "make", "a", "type", "specimen", "book", ".
", "It", "has", "survived", "not", "only", "five", "centuries", ", ", "but", "also", "the", "leap", "into", "electronic", "typesetting", ",
", "remaining", "essentially", "unchanged", ". ", "It", "was", "<span term=\"e78f5438b48b39bcbdea61b73679449d\" class=\"yellow\">Ipsum</span>", "in", "the", "1960s", "with", "the", "release", "of", "Letraset", "sheets", "containing", "Lorem", "Ipsum", "passages", ", ", "and", "more", "recently", "with", "desktop", "publishing", "software", "like", "Aldus", "PageMaker", "including", "versions", "of", "Lorem", "Ipsum", ".
"]
Notice how the 2 spans form 1 array element in the 2nd example while in the first one they are 2 different elements.
How about: