I have a huge XML(>400MB) containing products. Using a DOM parser is therefore excluded,

Question

0

Asked: May 14, 20262026-05-14T22:07:30+00:00 2026-05-14T22:07:30+00:00

I have a huge XML(>400MB) containing products. Using a DOM parser is therefore excluded,

0

I have a huge XML(>400MB) containing products. Using a DOM parser is therefore excluded, so i tried to parse and process it using a pull parser. Below is a snippet from the each_product(&block) method where i iterate over the product list.

Basically, using a stack, i transform each <product> ... </product> node into a hash and process it.

while (reader.read)
  case reader.node_type
    #start element
    when Nokogiri::XML::Node::ELEMENT_NODE
      elem_name = reader.name.to_s
      stack.push([elem_name, {}])

    #text element
    when Nokogiri::XML::Node::TEXT_NODE, Nokogiri::XML::Node::CDATA_SECTION_NODE
      stack.last[1] = reader.value

    #end element
    when Nokogiri::XML::Node::ELEMENT_DECL
      return if stack.empty?

      elem = stack.pop
      parent = stack.last
      if parent.nil?
        yield(elem[1])
        elem = nil
        next
      end

      key = elem[0]
      parent_childs = parent[1]
    # ... 
      parent_childs[key] =  elem[1]
    end

The issue is on self-closing tags (EG <country/>), as i can not make the difference between a ‘normal’ and a ‘self-closing’ tag. They both are of type Nokogiri::XML::Node::ELEMENT_NODE and i am not able to find any other discriminator in the documentation.

Any ideas on how to solve this issue?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-14T22:07:31+00:00

Editorial Team

2026-05-14T22:07:31+00:00Added an answer on May 14, 2026 at 10:07 pm

There is a feature request on project page regarding this issue (with the corresponding failing test).

Until it will be fixed and pushed into the current version, we’ll stick with good’ol

input_text.gsub! /<([^<>]+)\/>/, '<\1></\1>'

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a huge XML(>400MB) containing products. Using a DOM parser is therefore excluded,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply