I have a directory containing over 100 html files. I need to extract only

Question

0

Asked: May 31, 20262026-05-31T21:23:40+00:00 2026-05-31T21:23:40+00:00

I have a directory containing over 100 html files. I need to extract only

0

I have a directory containing over 100 html files. I need to extract only the contents inside <TITLE></TITLE> and <BODY></BODY> tags and then format them as:

TITLE, “BODY CONTENT” (That is one line per document)

It would be be beneficial if results from each file in the array can be written to 1 giant text file. I have found following command to format the document to one line:

grep '^[^<]' test.txt | tr -d '\n' > test.txt

Although no specific programming language is preferred, the following will be helpful if i need to modify it further: perl, shell(.sh), sed

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T21:23:41+00:00

Here’s something in Ruby using Nokogiri.

require 'rubygems' # This line isn't needed on Ruby 1.9
require 'nokogiri'

ARGV.each do |input_filename|
  doc = Nokogiri::HTML(File.read(input_filename))
  title, body = doc.title, doc.xpath('//body').inner_text
  puts %Q(#{title}, "#{body}")
end

Save that to a .rb file, for example extractor.rb. Then you need to make sure Nokogiri is installed by running gem install nokogiri.

Use this script like so:

ruby extractor.rb /path/to/yourhtmlfiles/*.html > out.txt

Note that I don’t handle newlines in this script, but you seem to have that figured out.

UPDATE:

This time it strips newlines and beginning/ending spaces.

require 'rubygems' # This line isn't needed on Ruby 1.9
require 'nokogiri'

ARGV.each do |input_filename|
  doc = Nokogiri::HTML(File.read(input_filename))
  title, body = doc.title, doc.xpath('//body').inner_text.gsub("\n", '').strip
  puts %Q(#{title}, "#{body}")
end

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a directory containing over 100 html files. I need to extract only

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply