I’m writing a script to go through a product database with poorly, inconsistently formatted

Question

0

Asked: June 3, 20262026-06-03T08:27:08+00:00 2026-06-03T08:27:08+00:00

I’m writing a script to go through a product database with poorly, inconsistently formatted

0

I’m writing a script to go through a product database with poorly, inconsistently formatted product descriptions to make its HTML uniform. One problem I’m having is capturing and replacing lines of code formatted the same way. For example, I’d like to replace all their

&bull; item 1
&bull; item 2
&bull; item 3

with

<ul>
  <li>item 1</li>
  <li>item 3</li>
  <li>item 2</li>
</ul>

Replacing each • line with a <li>content</li> line is easy enough, but I can’t for the life of me figure out the regex to get before and after the list. My though is to capture everything starting with • until there is a newline that does not start with •. Here’s my latest try (python):

In  : p = re.compile(
        r'&bull;.*(?!^&bull;)'
      )

In  : p.findall(text, re.MULTILINE, re.DOTALL)
Out : []

In  : p.findall(text, re.MULTILINE)
Out : ['&bull; item 1', '&bull; item 2', '&bull; item 3']

In  : p.findall(text, re.DOTALL)
Out : ['&bull; item 1', '&bull; item 2', '&bull; item 3']

In  : p.findall(text)
Out : ['&bull; item 1', '&bull; item 2', '&bull; item 3']

Any ideas on how to capture something like ['• item 1\n• item 2\n• item 3']?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-03T08:27:09+00:00

Here’s a non-regex based solution:

with open('/tmp/example.txt') as f:
  lines_in = f.readlines()

inside_block = False
lines_out = []

for line in lines_in:
  if line.startswith('&bull; '):
    if not inside_block:
      lines_out.append('<ul>\n')
      inside_block = True
    lines_out.append('<li>{}</li>\n'.format(line.strip().replace('&bull; ','')))
  else:
    if inside_block:
      lines_out.append('</ul>\n')
      inside_block = False
    lines_out.append(line)

print ''.join(lines_in)
print '-'*78
print ''.join(lines_out)

Test run:

[~/Desktop]
|7>run /tmp/spam.py
spam
&bull; item 1
&bull; item 2
&bull; item 3
and eggs

------------------------------------------------------------------------------
spam
<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>
and eggs

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m writing a script to go through a product database with poorly, inconsistently formatted

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply