I’m trying to write a python script that modifies the contents of <script> tag in files I’m parsing. I’m using lxml.html (as opposed to BeautifulSoup, etc.) for this due to its speed. The contents of script tag are surrounded in comment tags (<!– and –>):
<script>
<!--
...
-->
</script>
The problem is when I try something like scriptNode.text = '<!-- ... lxml modifies the angle brackets to their html representations (& lt; and & gt;) when I write the html back to file. I tried escaping them in the string (‘\< …’), but that doesn’t seem to help.
Looking at most modern websites, it looks like those comment tags are not needed. I can remove them, but many of the scripts also use some html within them and if those get modified to their HTML representation as well, that’s a problem.
I’m surprised that lxml is modifying this data at all, last I heard HTML parsers are designed to avoid modifying/interpreting data within <script> tags.
Is there a setting/command I can use to prevent this from happening?
Thanks
Put them in a CDATA section.