I have a set of HTML files with illegal syntax in the href attribute of <a> tags. For example,
<a name="Conductor, "neutral""></a>
or
<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
or
<b>Table of Contents:</b><ul class="xoxo"><li><a href="1.html" title="Page 1: What are "series" and "parallel" circuits?">What are "series" and "parallel" circuits?</a>
I’m trying to process the files with Perl’s XML::Twig module using parsefile_html($file_name). When it reads a file that has this syntax, it gives this error:
x has an invalid attribute name 'y""' at C:/strawberry/perl/site/lib/XML/Twig.pm line 893
What I need is either a way to make the module accept the bad syntax and deal with it, or a regular expression to find and replace double quotes in attributes with single quotes.
Given your html sample, the code below works:
Output:
I’m concerned that a variable length look-behind is not implemented, so if there’s some space before or after the equals signs, the pattern match will fail. However, it’s most likely that the pages were consistently created, so the match will not fail.
Of course, try the substitutions on copies of the files, first.