I am writing a python script to parse the content of WordPress Export XML

Question

0

Asked: June 16, 20262026-06-16T14:24:22+00:00 2026-06-16T14:24:22+00:00

I am writing a python script to parse the content of WordPress Export XML

0

I am writing a python script to parse the content of WordPress Export XML (wp xml) to generate a LaTex document. So far the wp xml is parsed via lxml.etree and the code generates a new xml tree to be processed by texml, which in turn generates the tex file.

Currently I extract each post along with certain metadata (title, publication date, tags, content). The metadata poses no problem, but the content part is a bit problematic. Inside the wp xml the content is included as a CDATA structure in plain HTML/Wordpress Markup. To convert it into latex I choose pandoc to parse the content. TeXml supports inline LaTeX, so the content is added as plain LaTeX into the tree.

I decided to use pandoc in this case as it already converts most of the html tags nicely (a, strong, em…), the only problem I have is how it deals with images.

I use a subprocess to interface with pandoc:

args = ['pandoc', '-f', 'html', '-t', 'latex']
p = Popen(args, stdout=PIPE, stdin=PIPE, stderr=PIPE)
tex_result = p.communicate(input=(my_html_string).encode('utf-8'))[0]

a sample post might look like this

<strong>Lorem ipsum dolor</strong>  sit amet, consectetur adipiscing elit.

<a href="http://link_to_source_image.jpg"><img class="alignnone size-medium wp-image-id" title="Title_text" src="http://link_to_scaled_down_version.jpg" alt="Some alt text" width="262" height="300" /></a>

Nam nulla ante, vestibulum a euismod sed, accumsan at magna. Cras non augue risus, vitae gravida quam.

I need images with captions embedded as figures e.g.

\begin{figure}
\includegraphics{link_to_image.jpg}
\label{fig:some_label}
\caption{Some alt text}
\end{figure}

pandoc seems to convert html img tags into a simple inlined image, discarding any title or alt texts.

\href{http://link\_to\_source\_image.jpg}{\includegraphics{http://link_to_scaled_down_version.jpg}}

I did peek into the source and it looks like img is only treated as inline element.
(pandoc parsing function). I don’t know Haskell so this is how far I got.

If you convert the html into markdown though, it keeps the alt and title and the result is similar to

![Some alt text](http://link_to_scaled_down_version.jpg "Title_text")

With markdown you can either have inlined images or figures in the resulting latex document. If you convert this markdown into latex the result is

\begin{figure}[htbp]
\centering
\includegraphics{http://link_to_scaled_down_version.jpg}
\caption{Some alt text}
\end{figure}

First pandoc seemed like a simple solution to parse the content, but I am a bit stuck as pandoc also doesn’t support inline latex in html so I could first process all the images and the rest through pandoc.

Do you guys have any idea on how to (better) process img tags in html to be embedded in a figure environment in latex having captions?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T14:24:23+00:00

Editorial Team

2026-06-16T14:24:23+00:00Added an answer on June 16, 2026 at 2:24 pm

Pandoc treats paragraphs containing only an image specially, as images with captions. These will be turned into LaTeX figures with captions. Thus:

% pandoc -f html -t latex
<p><img src="myimg.jpg" alt="my text" title="my title"/></p>
^D
\begin{figure}[htbp]
\centering
\includegraphics{myimg.jpg}
\caption{my text}
\end{figure}

This might help you.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am writing a python script to parse the content of WordPress Export XML

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply