I have a directory containing over 100 html files. I need to extract only the contents inside <TITLE></TITLE> and <BODY></BODY> tags and then format them as:
TITLE, “BODY CONTENT” (That is one line per document)
It would be be beneficial if results from each file in the array can be written to 1 giant text file. I have found following command to format the document to one line:
grep '^[^<]' test.txt | tr -d '\n' > test.txt
Although no specific programming language is preferred, the following will be helpful if i need to modify it further: perl, shell(.sh), sed
Here’s something in Ruby using Nokogiri.
Save that to a
.rbfile, for exampleextractor.rb. Then you need to make sure Nokogiri is installed by runninggem install nokogiri.Use this script like so:
Note that I don’t handle newlines in this script, but you seem to have that figured out.
UPDATE:
This time it strips newlines and beginning/ending spaces.