I’m dealing with a lot of .xml files. (Millions – an .xml formatted dump

Question

0

Asked: June 9, 20262026-06-09T02:12:37+00:00 2026-06-09T02:12:37+00:00

I’m dealing with a lot of .xml files. (Millions – an .xml formatted dump

0

I’m dealing with a lot of .xml files. (Millions – an .xml formatted dump of Wikipedia) and they’re a lot more unreadable than I imagined.

For the time being, I’ve written a .css file to display them in a readable manner in a browser, and wrote a script to plug a reference to this .css into all the files.

(I know there’s other solutions, like XSLT – but all the information I found made it seem document-level which didn’t suit – I’m really trying not to expand the size of these files if possible)

The .css works fine for some of the files, but many contain entities like &nbsp and I get errors like:

“XML Parsing Error: undefined entity” with a nice little illustration pointing to &nbsp or it’s kin within a quote.

There is an articles.dtd file, which seems like it should connect the dots ( keyword -> Unicode ) for the browser. It is referenced in each file like:

 <!DOCTYPE article SYSTEM "../article.dtd">

and contains a lot of entries like:

<!ENTITY nbsp   "&#160;"> <!-- no-break space = non-breaking space,
                              U+00A0 ISOnum -->

but either I’m entirely misunderstanding what this file is for, or it’s not working correctly.

In any case; How can I make these documents display; Either by:

displaying the entities (like “&nbSp” as plain-text)
removing the entities altogether (by any means other than just a linear search/removal of them in the actual files)
Interpreting the entities as unicode, as they were intended

Naturally, the latter being preferable; absolutely ideally, by referencing some sort of external file that maps identities to Unicode (if that’s not what the articles.dtd file is for….)

EDIT: I’m not working with a powerful machine here.. extracting the .rars took days. Any sort of edits to each file would take a very long time.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T02:12:39+00:00

so I’ve since solved my problem: if it helps anyone in future:

It turned out the guts of my problem was that external .dtd files are totally deprecated.

The function of the .dtd was in fact to declare the entities I was having trouble with( etc) as I thought; but because external .dtd files are not supported by browsers any more (the browsers simply don’t fetch/parse them – and the only way to force them to depends on files in the install of the browser on the client-machine) the entities went undeclared.

I had sourced an .XML collection that was simply too old to be up to standards; without realizing it.

The solution best for my circumstances turned out to be lazy-processing of each file as it was requested. with a simple flag to differentiate processed from not.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m dealing with a lot of .xml files. (Millions – an .xml formatted dump

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply