I have the following Perl script with is meant to indent a XML file correctly:
@files = glob "*.xml";
undef $/;
for $file (@files) {
$indent = 0;
open FILE, $file or die "Couldn't open $file for reading: $!";
$_ = readline *FILE;
close FILE or die "Couldn't close $file: $!";
# Remove whitespace between > and < if that is the only thing separating them
s/(?<=>)\s+(?=<)//g;
# Indent
s{ # Capture a tag <$1$2$3>,
# a potential closing slash $1
# the contents $2
# a potential closing slash $3
<(/?)([^/>]+)(/?)>
# Optional white space
\s*
# Optional tag.
# $4 contains either undef, "<" or "</"
(?=(</?))?
}
{
# Adjust the indentation level.
# $3: A <foo/> tag. No alteration to indentation.
# $1: A closing </foo> tag. Drop one indentation level
# else: An opening <foo> tag. Increase one indentation level
$indent +=
$3 ? 0 :
$1 ? -1 :
1;
# Put the captured tag back into place
"<$1$2$3>" .
# Two closing tags in a row. Add a newline and indent the next line
($1 and defined($4) and ($4 eq "</") ? "\n" . (" " x $indent) :
$4 ? "\n" . (" " x $indent) :
""
)
# /g repeat as necessary
# /e Execute the block of perl code to create replacement text
# /x Allow whitespace and comments in the regex
}gex;
open FILE, ">", $file or die "Couldn't open $file for writing: $!";
print FILE or die "Couldn't write to $file: $!";
close FILE or die "Couldn't close $file: $!";
}
First, it’s indenting my tabs, and I wanted two whitespaces. Also, it’s producing tags in the same indentation to be on the same line, instead of in the next line, but with the same indent:
<?xml version="1.0" encoding="iso-8859-1"?><!DOCTYPE kit SYSTEM "tc.dtd"><kit><contact/><description>
where it is supposed to be:
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE kit SYSTEM "tc.dtd">
<kit>
<contact/>
<description>
…
I acknowledge there are Perl tools to indent XML, such as XML-Tidy but due to tc.dtd tag, I always get an error complaining about unsolvable dependencies on the tc.dtd file, while I just care about the indentation of the same (formatting), not the dependencies itself.
What’s wrong with my Perl regex?
You can use the tool xmllint which doesn’t necessarily validate. Example:
Input (badly formatted):
Run
xmllint --format file.xmland you get: