i have converted the word file in to html file, but there is a problem, the MS-word automatically adds some style to the pages.
for example
<div align="center"></div>
<p style=""></p>
<table cellpadding="0">
<tr><img src="...."></img></tr>
</table>
i want to output to be as
<div></div>
<p></p>
<table>
<tr><img src="...."></img></tr>
</table>
i dont want the img inline styles to be removed.
thanks in advance
update: if it is very hard to keep img style in the file. please give me the code excluding that part. it is very urgent for me and i cant edit 1000 pages manually
If you want to remove styles for a known list of tags, I don’t think its necessary to use a full weight HTML parser. Something like
works just fine. Of course, you would use an array of tags you want to replace to generate the (?<=
If you have a list of style tags that you want to remove, it’s even easier. Just generate an expression like
with re.sub.
If you need a dynamic combination thereof, use a parser.
Edited in response to comments by OP:
Unfortunately, lookbehind needs fixed-size expressions, so <.* or similar won’t work. If you don’t have a fixed tag list, it’s probably better to use a preexisting framework.
An ugly way around this would be something like:
But that’s pretty bad style.