I have the following input. I want to parse it to a CSV delimited string. I can get the SKUs through regex patterns, but as I am new to regex parsing, I don’t know complex patterns. It would be nice if anyone could help me with this.
Thanks!
charset="iso-8859-1"
BODY {
}
TD {
}
TH {
}
H1 {
}
TABLE,IMG,A {
}
**PO Number:** 35102
**Ship To:**
Georgie Clements
6902 Stonegate Drive
Odessa, TX 79765
432-363-8459
SKU
Product
Qty
JJ-Rug-Zebra-PK
Zebra Pink Rug
1
JJ-Zebra-PK-Twin-4
Zebra Pink 4 Piece Twin Comforter Set
1
JJ-TwinSheets-Zebra-PK
Zebra Pink 3 Piece Twin Sheet Set
1
JJ-Memo-Zebra-PK
Zebra Pink Memory Board
1
I want it to format like this:
PONumber, Shipping info, SKU, Product, Qty
'35102', '[ShipToAddress]', 'JJ-Rug-Zebra-PK', 'Zebra Pink Rug', '1'
'35102', '[ShipToAddress]', 'JJ-Zebra-PK-Twin-4', 'Zebra Pink 4 Piece Twin Comforter Set', '1'
'35102', '[ShipToAddress]', 'JJ-TwinSheets-Zebra-PK', 'Zebra Pink 3 Piece Twin Sheet Set', '1'
'35102', '[ShipToAddress]', 'JJ-Memo-Zebra-PK', 'Zebra Pink Memory Board', '1'
The current code is the following:
pattern = re.compile(r'(\b\w*JJ-\S*)')
pos = 0
while True:
match = pattern.search(msgStr, pos)
if not match:
break
a = match.start()
e = match.end()
print ' %2d : %2d = %s' % (a, e-1, msgStr[a:e])
pos = e
Here’s another solution, not using regular expressions:
which gives the final result
Edit: second data file returns
which on inspection appears to be correct?
Final Summary: I discovered that he was using html2text to convert the html email to text, then trying to parse it. The solution was to instead parse the html directly using BeautifulSoup, taking advantage of the page structure to identify the fields he wanted.