i know, that i should utilize htmlAgilityPack – but in my case, i do not have any chance for that … sad but true …
we have following regex: <a(.+?)(href=["|'](.+?)["|'])(.+?)>(.+?)</a>
and following sample input:
<A href="
http://dummy.domain/dummy.html
" target="_blank"><b><font face="Arial" color="#0000FF" size="2">
Dummy text
</font></b></a>
if i remove the line-breaks inside the groups, everything works fine. i’m running on this on .net c# with ignore-case-option.
does . not capture any \r\n-things?
I’m guessing you placed the pipe symbol to signify “OR” in the character class–if that is the case, remove the pipes, the
[]implies an “OR” of any of its members.Also, remember that there’s the possibility of
\nanywhere within the HTML, and “.” won’t catch those characters (it will catch\r).To match newlines, you’ll need to either use the SingleLine option, or change the
.to an alternative such as[.\n]or[\s\S]in place of the plain.. Here’s an example with the singleline mode specified inline:Note also the
[^>]*used here, it’s a little simpler than using the non-greedy match.