Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 5992117
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 22, 20262026-05-22T23:29:07+00:00 2026-05-22T23:29:07+00:00

I’m using php to take xml files and convert them into single line tab

  • 0

I’m using php to take xml files and convert them into single line tab delimited plain text with set columns (i.e. ignores certain tags if database does not need it and certain tags will be empty). The problem I ran into is that it took 13 minutes to go through 56k (+ change) files, which I think is ridiculously slow. (average folder has upwards of a million xml files) I’ll probably cronjob it overnight anyways, but it is completely untestable at a reasonable pace while I’m at work for things like missing files and corrupt files and such.

Here’s hoping someone can help me make the thing faster, the xml files themselves are not too big (<1k lines) and I don’t need every single data tag, just some, here’s my data node method:

function dataNode ($entries) {
    $out = "";

    foreach ($entries as $e) {
        $out .= $e->nodeValue."[ATTRIBS]";
        foreach ($e->attributes as $name => $node)
            $out .= $name."=".$node->nodeValue;
    }

    return $out;
}

where $entries is a DOMNodeList generated from XPath queries for the nodes I need. So the question is, what is the fastest way to go to a target data node or nodes (if I have 10 keyword nodes from my XPath query then I need all of them to be printed from that function) and output the nodevalue and all it’s attributes?

I read here that iterating through a DOMNodeList isn’t constant time but I can’t really use the solution given because a sibling to the node I want might be one that I don’t need or need to call a different format function before I write it to file and I really don’t want to run the node through a gigantic switch statement for every iteration trying to format out the data.

Edit: I’m an idiot, I had my write function inside my processing loop so every iteration it had to reopen the file I was writing to, thanks for both of your help, I’m trying to learn XSLT right now as it seems very useful.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-22T23:29:08+00:00Added an answer on May 22, 2026 at 11:29 pm

    A comment would be a little short, so I write it as an answer:

    It’s hard to say where actually your setup can benefit from optimizing. Perhaps it’s possible to join multiple of your many XML files together before loading.

    From the information you give in your question I would assume that it’s more the disk operations that are taking the time than the XML parsing. I found DomDocument and Xpath quite fast even on large files. An XML file with up to 60 MB takes about 4-6 secs to load, a file of 2MB only a fraction.

    Having many small files (< 1k) would mean a lot of work on the disk, opening / closing files. Additionally, I have no clue how you iterate over directories/files, sometimes this can be speed up dramatically as well. Especially as you say that you have millions of file nodes.

    So perhaps concatenating/merging files is an option for you which can be run quite safe so to reduce the time to test your converter.

    If you encounter missing or corrupt files, you should create a log and catch these errors. So you can let the job run through and check for errors later.

    Additionally, if possible, you can try to make your workflow resumeable. E.g. if an error occurs, the current state is saved and next time you can continue at this state.

    The suggestion above in a comment to run an XSLT on the files is a good idea as well to transform them first. Having a new layer in the middle to transpose data can help to reduce the overall problem dramatically as it can reduce complexity.

    This workflow on XML files has helped me so far:

    1. Preprocess the file (plain text filters, optional)
    2. Parse the XML. That’s loading into DomDocument, XPath iterating etc.
    3. My Parser sends out events with the parsed data if found.
    4. The Parser throws a specific exception if data is encountered that is not in the expected format. That allows to realize errors in the own parser.
    5. Every other errors are converted to Exceptions as well.
    6. Exceptions can be caught and operations finished. E.g. go to next file etc.
    7. Logger, Resumer and Exporter (file-export) can hook onto the events. Sort of the visitor pattern.

    I’ve build such a system to process larger XML files which formats change. It’s flexible enough to deal with changes (e.g. replace the parser with a new version while keeping logging and exporting). The event system really pushed it for me.

    Instead of a gigantic switch statement I normally use a $state variable for the parsers state while iterating over a domnodelist. $state can be handy to resume operations later. Restore the state and go to the last known position, then continue.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

link Im having trouble converting the html entites into html characters, (&# 8217;) i
I have a bunch of posts stored in text files formatted in yaml/textile (from
I want to count how many characters a certain string has in PHP, but
I'm new to using the Perl treebuilder module for HTML parsing and can't figure
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I ran into a problem. Wrote the following code snippet: teksti = teksti.Trim() teksti
I'm making a simple page using Google Maps API 3. My first. One marker
We're building an app, our first using Rails 3, and we're having to build
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
I want use html5's new tag to play a wav file (currently only supported

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.