Let’s say you have a HTML file with a couple duplicate scripts, meaning multiple external script tags for the same resource, like loading jquery 3 times on the page. Is there an efficient regular expression that can remove the duplicates but keep the first one in place. The duplicates will be all with the same exact src name.
Language is PHP and here is a good example:
Before:
<script src="js/jquery.js" type="text/javascript"></script>
some content
<script src="js/jquery.js" type="text/javascript"></script>
more content
<script src="js/jquery.js" type="text/javascript"></script>
After:
<script src="js/jquery.js" type="text/javascript"></script>
some content
more content
Disclaimer:
Many will rightfully state that using regular expressions to parse non-regular languages such as HTML is fraught with peril. And they are correct. The only way to reliably parse these languages is with a parser specifically designed for the task. A solution using regular expressions will typically have many special cases of subject text that will cause it to fail, resulting in false positives, and missing matches.
That said…
If one insists upon using regular expressions to process HTML/XML markup, and they are aware of the inherent limitations, there are ways to craft a regex solution that can minimize these potential pitfalls, and do a “pretty good” job (depending on the specific requirements of the question). However, to correctly handle many of the rare (but valid and possible) edge cases (e.g. correctly handling HTML tag attributes containing
<>angle brackets for instance), the correct regex can frequently be rather complex and not for the faint-of-heart.Understanding the following regex solution requires a fairly deep understanding of the regex language and the underlying mechanics of the regex engine. There are certainly examples of markup text that will cause it to fail, but the following solution should do pretty good job for many cases of typical markup.
Here is a tested PHP function that removes
SCRIPTelements having duplicateSRCattribute values:The function above uses one regex which is applied recursively until no matches are found. Although at first glance the regex looks like a monster, its actually quite straight-forward (if you are well versed in regex syntax) and most of the text consists of descriptive comments. The complexity of this regex is required to handle the variety of attribute/value formats allowed by HTML. For example, the
SCRIPTtags may have any number of attributes before and after theSRCattribute. TheSRCattribute value may be single or double quoted. All other attributes may have values that are either quoted or unquoted and may have no value at all. Quoted attributes may contain<>angle brackets.