First the IDE that i am using is the visual C# with .net framework.
Okay so i have about 20,000 html documents with information i need to extract and sort into date order.
The date on the files are stored within this html tag
<td valign="top" class="createdate">
Tuesday, 03 April 2012 20:39
</td>
note: all of the dates are in that format within each html file
I want to extract the date then want to automatically read through each html document and measure the occurrences of a phrase or word.
I am not asking someone to create the entire program for me but if you could provide as much detail on how i could sort through these 20000 html files and extract the date and number of occurrences of a word or phrase and then export that information to a word format or excel i would be very grateful.
Ooh and i am using the data for research for my dissertation, i know how to do string manipulation on well strings and all of the string methods such as finding the occurrence of a word etc.
The problem i am having is how do i get the html data or maybe just the content and then sort them into a usable format. Thank you
Are you sure that all the HTML documents has that exact format ? In this case, the string containing the date can be extracted by simple string operations or via RegEx (Side, note, in general, regular expressions are not suited for parsing HTML, but for this use case, keeping it simple sounds like the way to go here). If you need to do heavier parsing, consider HtmlAgilityPack.
Then use
DateTime.TryParseto get the date converted from string into aDateTimeobject.