I’m trying to create a master index file for a bunch of HTML files

Question

0

Asked: May 19, 20262026-05-19T22:42:39+00:00 2026-05-19T22:42:39+00:00

I’m trying to create a master index file for a bunch of HTML files

0

I’m trying to create a master index file for a bunch of HTML files sitting in a directory. There could be anywhere from 5 to 5000. These files aren’t clean or nice, so some of the libs I looked at don’t seem like they would play nice. Many of these files come from the temp directory or are carved out of the file slack (ergo incomplete files in many cases). Plus, sometimes people just write sloppy HTML.

I’ve basically decided to enumerate through the directory and use something like

string[] FileEntries = Directory.GetFiles(WhichDirectory);

        foreach (string FileName in FileEntries)
        {
            using (StreamReader sr = new StreamReader(FileName))
            {
                HTMLContents = sr.ReadToEnd();
            }

I’m hoping that the StreamReader can dump the contents into a character array the same way it would a text file.

Anyways, given that this might not be the cleanest HTML in the world, there a few things I’d like to parse out of the array.

Any Instance of a date in ANY format (e.g. 1/1/11, January 1st, 2011, 1-1-11, Jan-1-2011, etc) and dump these into a string to be read back later. Hopefully there is a lib or something for finding "instances" of dates.
Read a text file line by line with various "keywords" to look for in the mess of HTML. Things like "Bob Evans" or "Sausage Factory Ltd" etc. I then want to count the number of times each "keyword" shows up. The problem is I don’t want to have to resort to the user having to know regex expressions.

So, the desired output would be something like this:

BobEvans9304902.html
Title: Bob Evans Secret Sausage Recipe

Dates Found: "October 2nd, 2009" , "7/22/09"

"Bob Evans Sausage" : 30 hits

"Paprika" : 2 hits

"Don’t overwork it" : 5 hits

All the solutions I have seen so far seem like they only work for single characters or words (LINQ) or split a "neat’ sentence into words. I’m hoping I won’t have to create a new copy of the string and strip out all the HTML tags, since it’s not always going to be neat and I don’t want to add another step to mass file processing. If that’s the only way to do it, though, so be it.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-19T22:42:40+00:00

Editorial Team

2026-05-19T22:42:40+00:00Added an answer on May 19, 2026 at 10:42 pm

You probably want to investigate an HTML to XML parser that handles poorly formed XML like the html agility pack. Then you can focus on the content and use XPath queries to search for/count keywords. I expect you’ll probably still need regex to handle the dates though.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to create a master index file for a bunch of HTML files

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply