I have very large HTML that, if being parsed into DOM tree, would take much time, so this option despite being “proper” is not available. I need to remove all the inside-tag style declarations.
There is a regular expression that seem to work in most cases:
> re
/\sstyle\s*=(\"[^\">]*\"*|\'[^\'>]*\'*|[^\s>]*)/gi
> test
[ '<img src="some.jpg" style="width:auto" width="50" height="60">',
'<img style=\'width:auto\'>',
'<img style=\'width:auto>',
'<img style=width:auto>',
'<div style=\'\'>',
'<div style=\'background-image:url(\'paper.gif\');\'',
'<div style=\'background-image:url(\\\'paper.gif\\\');\'' ]
> test.forEach(function(t){console.log(t.replace(re,''))})
<img src="some.jpg" width="50" height="60">
<img>
<img>
<img>
<div>
<divpaper.gif');'
<divpaper.gif\');'
As you see, in case there are repeated quotes inside the value part, either with or without proper escaping, the regular expression doesn’t work. Any ideas how I can improve it?
Why would you want to write one big regular expression to do all of that at once?
Parsing it into a DOM tree might take too much time, but writing a hand-crafted parser will probably be better.
You can also mix the two: use a regular expression to isolate each and every tag (which is easy), then parse the attributes inside the tag, isolating (and removing) any
styleattribute you encounter.