i have the following text: Lorem ipsum dolor sit amet,

Question

0

Asked: May 31, 20262026-05-31T01:39:10+00:00 2026-05-31T01:39:10+00:00

i have the following text: Lorem ipsum dolor sit amet,

0

i have the following text:

    <span term="db6ff2ffe2df7b8cfc0d9542bdce27dc" class="yellowback">Lorem</span> <span term="e78f5438b48b39bcbdea61b73679449d" class="yellowback">ipsum</span> dolor sit amet,   consectetur adipiscing elit.
Ut ut mattis sapien.   Suspendisse at felis nisl.   Vestibulum nec risus leo,   in consectetur dolor.   Duis suscipit arcu quis nibh dapibus gravida.   Ut vel rhoncus neque.   Sed et dolor quis est sollicitudin vulputate.   Nam vehicula,   tortor at consectetur laoreet,   nulla erat ultrices dui,   vehicula varius odio sem sed ligula.
Vivamus porttitor odio sed ligula cursus non placerat dolor posuere.
Pellentesque vitae metus vel dolor lobortis feugiat.   Nunc faucibus commodo viverra.   Aliquam porta nisl eu turpis vulputate id laoreet odio lobortis.   Proin sit amet neque nibh,   eget tincidunt est.   Etiam accumsan erat at mauris lacinia porta.
Suspendisse auctor,   quam sit amet congue consequat,   dolor orci placerat diam,   sed ultricies diam ipsum nec tortor.   Vestibulum egestas ipsum ut leo fermentum imperdiet.   Mauris varius iaculis magna,   id luctus risus vestibulum vel.

I would like to split it into words but if you look closely some words may be contained within some tags. What i want to do is this: if the word is within a tag it should treat the tag overall as the word. Right now I have the following regex to accomplish this:

(<span.+>|\w+|<\/span>)

This works but if there are 2 adiacent tags it will capture them both and treat them as a word which is not something i would want.

I am not fond of using Regex for this thing but it seems the most appropriate solution giving the fact that it has to be in javascript and there is no way I can use a 3rd party library. I am open however to a different approach, using some sort of algorithm…if not Regex is just fine.

A satisfactory result would be the following

["<span term=\"db6ff2ffe2df7b8cfc0d9542bdce27dc\" class=\"yellow\">Lorem</span>", "<span term=\"e78f5438b48b39bcbdea61b73679449d\" class=\"yellow\">Ipsum</span>", "is", "simply", "dummy", "text", "of", "the", "printing", "and", "typesetting", "industry", ".
     ", "Lorem", "Ipsum", "has", "been", "the", "industry", " ' ", "s", "standard", "<span term=\"e78f5438b48b39bcbdea61b73679449d\" class=\"yellow\">Ipsum</span>", "text", "ever", "since", "the", "1500s", ",
     ", "when", "an", "unknown", "printer", "took", "a", "galley", "of", "type", "and", "scrambled", "it", "to", "make", "a", "type", "specimen", "book", ".
     ", "It", "has", "survived", "not", "only", "five", "centuries", ",  ", "but", "also", "the", "leap", "into", "electronic", "typesetting", ",
     ", "remaining", "essentially", "unchanged", ".  ", "It", "was", "<span term=\"e78f5438b48b39bcbdea61b73679449d\" class=\"yellow\">Ipsum</span>", "in", "the", "1960s", "with", "the", "release", "of", "Letraset", "sheets", "containing", "Lorem", "Ipsum", "passages", ",  ", "and", "more", "recently", "with", "desktop", "publishing", "software", "like", "Aldus", "PageMaker", "including", "versions", "of", "Lorem", "Ipsum", ".
"]

Not a good result would be the following:

["<span term=\"db6ff2ffe2df7b8cfc0d9542bdce27dc\" class=\"yellow\">Lorem</span> <span term=\"e78f5438b48b39bcbdea61b73679449d\" class=\"yellow\">Ipsum</span>", "is", "simply", "dummy", "text", "of", "the", "printing", "and", "typesetting", "industry", ".
         ", "Lorem", "Ipsum", "has", "been", "the", "industry", " ' ", "s", "standard", "<span term=\"e78f5438b48b39bcbdea61b73679449d\" class=\"yellow\">Ipsum</span>", "text", "ever", "since", "the", "1500s", ",
         ", "when", "an", "unknown", "printer", "took", "a", "galley", "of", "type", "and", "scrambled", "it", "to", "make", "a", "type", "specimen", "book", ".
         ", "It", "has", "survived", "not", "only", "five", "centuries", ",  ", "but", "also", "the", "leap", "into", "electronic", "typesetting", ",
             ", "remaining", "essentially", "unchanged", ".  ", "It", "was", "<span term=\"e78f5438b48b39bcbdea61b73679449d\" class=\"yellow\">Ipsum</span>", "in", "the", "1960s", "with", "the", "release", "of", "Letraset", "sheets", "containing", "Lorem", "Ipsum", "passages", ",  ", "and", "more", "recently", "with", "desktop", "publishing", "software", "like", "Aldus", "PageMaker", "including", "versions", "of", "Lorem", "Ipsum", ".
    "]

Notice how the 2 spans form 1 array element in the 2nd example while in the first one they are 2 different elements.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T01:39:11+00:00

Editorial Team

2026-05-31T01:39:11+00:00Added an answer on May 31, 2026 at 1:39 am

How about:

str.split(/(<span[^>]*>[^<]+<\/span>|\w+)/)

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

i have the following text: <span term=db6ff2ffe2df7b8cfc0d9542bdce27dc class=yellowback>Lorem</span> <span term=e78f5438b48b39bcbdea61b73679449d class=yellowback>ipsum</span> dolor sit amet,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply