I have a string with some HTML code in, for example:
This is <strong id="c1-id-8">some</strong> <em id="c1-id-9">text</em>
I need to strip out the id attribute from every HTML tag, but I have zero experience with regular expressions, so I searched here and there from the internet and I wrote this pattern: [\s]+id=\".*\"
Unfortunately it’s not working as I would expect. Infact, I was hoping that the regular expression would catch the id=" followed by any character repeated for any number of times and terminated with the nearest double quote; Practically in this example I was expecting to catch id="c1-id-8" and id="c1-id-9".
But instead the pattern returned me the substring id="c1-id-8">some</strong> <em id="c1-id-9", it finds the first occurrence of id=" and the last occurrence of a double quote character.
Could you tell me what is wrong in my pattern and how to fix it, please?
Thank you very much
The quantifier
.*in your regex is greedy (meaning it matches as much as it can). In order to match the minimum required you could use something like/\s+id=\"[^\"]*\"/. The brackets[]indicate a character class. So it will match everything inside of the brackets. The carat[^]at the beginning of your character class is a negation, meaning it will match everything except what is specified in the brackets.An alternative would be to tell the
.*quantifier to be lazy by changing it to.*?which will match as little as it can.