I need to strip all xml tags from an xml document, but keep the space the tags occupy, so that the textual content stays at the same offsets as in the xml. This needs to be done in Java, and I thought RegExp would be the way to go, but I have found no simple way to get the length of the tags that match my regular expression.
Basically what I want is this:
Pattern p = Pattern.compile("<[^>]+>[^<]*]+>");
Matcher m = p.matcher(stringWithXMLContent);
String strippedContent = m.replaceAll("THIS IS A STRING OF WHITESPACES IN THE LENGTH OF THE MATCHED TAG");
Hope somebody can help me to do this in a simple way!
In the spirit of You Can’t Parse XML With Regexp, you do know that’s not an adequate pattern for arbitrary XML, right? (It’s perfectly valid to have a > character in an attribute value, for example, not to mention other non-tag constructs.)
Instead of using replaceAll, repeatedly call find on the Matcher. You can then read start/end to get the indexes to replace, or use the appendReplacement method on a buffer. eg.
(StringUtils comes from Apache Commons. For more background and library-free alternatives see this question.)