I’m having problems parsing the SEC Edgar files Here is an example of this

Question

0

Editorial Team

Asked: May 21, 20262026-05-21T18:26:19+00:00 2026-05-21T18:26:19+00:00

I’m having problems parsing the SEC Edgar files Here is an example of this

0

I’m having problems parsing the SEC Edgar files

Here is an example of this file.

The end result is I want the stuff between <XML> and </XML> into a format I can access.

Here is my code so far that doesn’t work:

scud = open("http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt")
full = scud.read
full.match(/<XML>(.*)<\/XML>/)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-21T18:26:19+00:00

Ok, there are a couple of things wrong:

sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt is NOT XML, so Nokogiri will be of no use to you unless you strip off all the garbage from the top of the file, down to where the true XML starts, then trim off the trailing tags to keep the XML correct. So, you need to attack that problem first.
You don’t say what you want from the file. Without that information we can’t recommend a real solution. You need to take more time to define the question better.

Here’s a quick piece of code to retrieve the page, strip the garbage, and parse the resulting content as XML:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::XML(
  open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt').read.gsub(/\A.+<xml>\n/im, '').gsub(/<\/xml>.+/mi, '')
)
puts doc.at('//schemaVersion').text
# >> X0603

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m having problems parsing the SEC Edgar files Here is an example of this

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply