I’m trying to achieve the following replacement in python. Replace all html tags with {n}
& create a hash of [tag, {n}]
Original string -> “<h> This is a string. </H><P> This is another part. </P>“
Replaced text -> “{0} This is a string. {1}{2} This is another part. {3}”
Here’s my code. I’ve started with replacement but I’m stuck at the replacement logic as I cannot figure out the best way to replace each occurrence in a consecutive manner i.e with {0}, {1} and so on:
import re
text = "<h> This is a string. </H><p> This is another part. </P>"
num_mat = re.findall(r"(?:<(\/*)[a-zA-Z0-9]+>)",text)
print(str(len(num_mat)))
reg = re.compile(r"(?:<(\/*)[a-zA-Z0-9]+>)",re.VERBOSE)
phctr = 0
#for phctr in num_mat:
# phtxt = "{" + str(phctr) + "}"
phtxt = "{" + str(phctr) + "}"
newtext = re.sub(reg,phtxt,text)
print(newtext)
Can someone help with a better way of achieving this? Thank you!
prints
Works for simple tags only (no attributes/spaces in tags). For extended tags, you have to adapt the regexp.
Also, not reinitializing
cnt = it.count()will keep the numbering going on.UPDATE to get a mapping dict:
prints: