An example of fragments that have identical hierarchical structure:
(1)
<div>
<span>It's a message</span>
</div>
(2)
<div>
<span class='bold'>This is a new text</span>
</div>
An example of fragments that have different structure:
(1)
<div>
<span><b>It's a message</b></span>
</div>
(2)
<div>
<span>This is a new text</span>
</div>
So, fragments with a similar structure correspond to one hierarchical tree (the same tag names, the same hierarchical structure).
How can I detect if 2 elements (html fragments) have the same structure simply with lxml?
I have a function that does not work properly for some more difficult case (than the example):
def _is_equal( el1, el2 ):
# input: 2 elements with possible equal structure and tag names
# e.g. root = lxml.html.fromstring( buf )
# el1 = root[ 0 ]
# el2 = root[ 1 ]
# move from top to bottom, compare elements
result = False
if el1.tag == el2.tag:
# has no children
if len( el1 ) == len( el2 ):
if len( el1 ) == 0:
return True
else:
# iterate one of them, for example el1
i = 0
for child1 in el1:
child2 = el2[ i ]
is_equal2 = _is_equal( child1, child2 )
if not is_equal2:
return False
return True
else:
return False
else:
return False
The code fails to detect that 2 divs with class=’tovar2′ have an identical structure:
<body>
<div class="tovar2">
<h2 class="new">
<a href="http://modnyedeti-krsk.ru/magazin/product/333193003">
Куртка д/д
</a>
</h2>
<ul class="art">
<li>
Артикул: <span>1759</span>
</li>
</ul>
<div>
<div class="wrap" style="width:180px;">
<div class="new">
<img src="shop_files/new-t.png" alt="">
</div>
<a class="highslide" href="http://modnyedeti-krsk.ru/d/459730/d/820.jpg" onclick="return hs.expand(this)">
<img src="shop_files/fr_5.gif" style="background:url(/d/459730/d/548470803_5.jpg) 50% 50% no-repeat scroll;" alt="Куртка д/д" height="160" width="180">
</a>
</div>
</div>
<form action="" onsubmit="return addProductForm(17094601,333193003,3150.00,this,false);">
<ul class="bott ">
<li class="price">Цена:<br>
<span>
<b>
3 150
</b> руб.
</span>
</li>
<li class="amount">Кол-во:<br><input class="number" onclick="this.select()" value="1" name="product_amount" type="text">
</li>
<li class="buy"><input value="" type="submit">
</li>
</ul>
</form>
</div>
<div class="tovar2">
<h2 class="new">
<a href="http://modnyedeti-krsk.ru/magazin/product/333124803">Куртка д/д</a>
</h2>
<ul class="art">
<li>
Артикул: <span>1759</span>
</li>
</ul>
<div>
<div class="wrap" style="width:180px;">
<div class="new">
<img src="shop_files/new-t.png" alt="">
</div>
<a class="highslide" href="http://modnyedeti-krsk.ru/d/459730/d/820.jpg" onclick="return hs.expand(this)">
<img src="shop_files/fr_5.gif" style="background:url(/d/459730/d/548470803_5.jpg) 50% 50% no-repeat scroll;" alt="Куртка д/д" height="160" width="180">
</a>
</div>
</div>
<form action="" onsubmit="return addProductForm(17094601,333124803,3150.00,this,false);">
<ul class="bott ">
<li class="price">Цена:<br>
<span>
<b>3 150</b> руб.
</span>
</li>
<li class="amount">Кол-во:<br><input class="number" onclick="this.select()" value="1" name="product_amount" type="text">
</li>
<li class="buy">
<input value="" type="submit">
</li>
</ul>
</form>
</div>
</body>
You are complicating things a little, you only need to return
Falseat the end when things have been proven to not beTrue.Two elements are equal when their tags match, their lengths match, and each paired child element is the same.
Python makes testing if all elements in a sequence are
Truereally easy with theall()function, and by usingzip()we can pair up the element children nicely.all()will terminate early if any child pair is not equal: