Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9057245
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 16, 20262026-06-16T14:24:22+00:00 2026-06-16T14:24:22+00:00

I am writing a python script to parse the content of WordPress Export XML

  • 0

I am writing a python script to parse the content of WordPress Export XML (wp xml) to generate a LaTex document. So far the wp xml is parsed via lxml.etree and the code generates a new xml tree to be processed by texml, which in turn generates the tex file.

Currently I extract each post along with certain metadata (title, publication date, tags, content). The metadata poses no problem, but the content part is a bit problematic. Inside the wp xml the content is included as a CDATA structure in plain HTML/Wordpress Markup. To convert it into latex I choose pandoc to parse the content. TeXml supports inline LaTeX, so the content is added as plain LaTeX into the tree.

I decided to use pandoc in this case as it already converts most of the html tags nicely (a, strong, em…), the only problem I have is how it deals with images.

I use a subprocess to interface with pandoc:

args = ['pandoc', '-f', 'html', '-t', 'latex']
p = Popen(args, stdout=PIPE, stdin=PIPE, stderr=PIPE)
tex_result = p.communicate(input=(my_html_string).encode('utf-8'))[0]

a sample post might look like this

<strong>Lorem ipsum dolor</strong>  sit amet, consectetur adipiscing elit.

<a href="http://link_to_source_image.jpg"><img class="alignnone size-medium wp-image-id" title="Title_text" src="http://link_to_scaled_down_version.jpg" alt="Some alt text" width="262" height="300" /></a>

Nam nulla ante, vestibulum a euismod sed, accumsan at magna. Cras non augue risus, vitae gravida quam.

I need images with captions embedded as figures e.g.

\begin{figure}
\includegraphics{link_to_image.jpg}
\label{fig:some_label}
\caption{Some alt text}
\end{figure}

pandoc seems to convert html img tags into a simple inlined image, discarding any title or alt texts.

\href{http://link\_to\_source\_image.jpg}{\includegraphics{http://link_to_scaled_down_version.jpg}}

I did peek into the source and it looks like img is only treated as inline element.
(pandoc parsing function). I don’t know Haskell so this is how far I got.

If you convert the html into markdown though, it keeps the alt and title and the result is similar to

![Some alt text](http://link_to_scaled_down_version.jpg "Title_text")

With markdown you can either have inlined images or figures in the resulting latex document. If you convert this markdown into latex the result is

\begin{figure}[htbp]
\centering
\includegraphics{http://link_to_scaled_down_version.jpg}
\caption{Some alt text}
\end{figure}

First pandoc seemed like a simple solution to parse the content, but I am a bit stuck as pandoc also doesn’t support inline latex in html so I could first process all the images and the rest through pandoc.

Do you guys have any idea on how to (better) process img tags in html to be embedded in a figure environment in latex having captions?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-16T14:24:23+00:00Added an answer on June 16, 2026 at 2:24 pm

    Pandoc treats paragraphs containing only an image specially, as images with captions. These will be turned into LaTeX figures with captions. Thus:

    % pandoc -f html -t latex
    <p><img src="myimg.jpg" alt="my text" title="my title"/></p>
    ^D
    \begin{figure}[htbp]
    \centering
    \includegraphics{myimg.jpg}
    \caption{my text}
    \end{figure}
    

    This might help you.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

We're writing a Python script to parse application logfiles. Most of the logfiles will
I'm writing a Python script to parse some data. At the moment I'm trying
I am currently writing a Python script and trying to dynamically generate some arguments.
I'm writing my second python script to try and parse the contents of a
I am writing a python script that will parse through a file quickly by
I am writing python script which gets links from website. But when I tried
I am writing python scripts and execute them in a Makefile. The python script
I'm writing a Python script, and it needs to print a line to console.
I'm writing a python script to extract data out of our 2GB Apache access
I'm writing a Python script at work that contains a part with a large

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.