I’m using Nokogiri code to extract text between HTML nodes, and getting these errors

Question

0

Asked: May 26, 20262026-05-26T07:35:43+00:00 2026-05-26T07:35:43+00:00

I’m using Nokogiri code to extract text between HTML nodes, and getting these errors

0

I’m using Nokogiri code to extract text between HTML nodes, and getting these errors when I read in a list of files. I didn’t get the errors using simple embedded HTML. I’d like to eliminate or suppress the warnings but don’t know how. The warnings come at the end of each block:

extract.rb:18: warning: already initialized constant EXTRACT_RANGES
extract.rb:25: warning: already initialized constant DELIMITER_TAGS

Here is my code:

#!/usr/bin/env ruby -wKU
require 'rubygems'
require 'nokogiri'
require 'fileutils'

source = File.open('/documents.txt')
source.readlines.each do |line|
  line.strip!
  if File.exists? line
    file = File.open(line)

doc = Nokogiri::HTML(File.read(line))

# suggested by dan healy, stackoverflow 
# Specify the range between delimiter tags that you want to extract
# triple dot is used to exclude the end point
# 1...2 means 1 and not 2
EXTRACT_RANGES = [
  1...2
 ]

# Tags which count as delimiters, not to be extracted
DELIMITER_TAGS = [
  "h1",
  "h2",
  "h3"
]

extracted_text = []

i = 0
# Change /"html"/"body" to the correct path of the tag which contains this list
(doc/"html"/"body").children.each do |el|

  if (DELIMITER_TAGS.include? el.name)
    i += 1
  else
    extract = false
    EXTRACT_RANGES.each do |cur_range|
      if (cur_range.include? i)
        extract = true
        break
      end
    end

    if extract
      s = el.inner_text.strip
      unless s.empty?
        extracted_text << el.inner_text.strip
      end
    end
  end
end

print("\n")
puts line
print(",\n")
# Print out extracted text (each element's inner text is separated by newlines)
puts extracted_text.join("\n\n")
  end
end

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T07:35:44+00:00

Didn’t notice earlier. Just move the constants out of the each block

EXTRACT_RANGES = [
  1...2
]

# Tags which count as delimiters, not to be extracted
DELIMITER_TAGS = [
 "h1",
 "h2",
 "h3"
]

source.readlines.each do |line|
 line.strip!
  if File.exists? line
    file = File.open(line)

doc = Nokogiri::HTML(File.read(line))

extracted_text = []

i = 0
# Change /"html"/"body" to the correct path of the tag which contains this list
(doc/"html"/"body").children.each do |el|

  if (DELIMITER_TAGS.include? el.name)
    i += 1
  else
    extract = false
    EXTRACT_RANGES.each do |cur_range|
      if (cur_range.include? i)
        extract = true
        break
      end
    end

    if extract
     s = el.inner_text.strip
      unless s.empty?
        extracted_text << el.inner_text.strip
      end
    end
  end
end

print("\n")
puts line
print(",\n")
# Print out extracted text (each element's inner text is separated by newlines)
puts extracted_text.join("\n\n")
  end
end

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m using Nokogiri code to extract text between HTML nodes, and getting these errors

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply