This small program:
from lxml.html import tostring, fromstring
e = fromstring('''
<html><head>
<link href="/comments.css" rel="stylesheet" type="text/css">
<link href="/index.css" rel="stylesheet" type="text/css">
</head>
<body>
<span></span>
<span></span>
</body>
</html>''')
print (tostring(e, encoding=str)) #unicode on python 2
will print:
<html><head><link href="/comments.css" rel="stylesheet" type="text/css"><link
href="/index.css" rel="stylesheet" type="text/css"></head><body>
<span></span>
<span></span>
</body></html>
The spaces and line breaks in head removed.
This happens even if we place the two <link> elements in <body>.
It seems blank text nodes (\s*) between head elements are removed.
How I can preserve spaces and line breaks between <link>s? (I expect output to be exactly same as input)
Finally, I used html5lib to parse html and generate lxml like tree with it.
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("lxml"), namespaceHTMLElements=False)