I have about 2000 documents from which I’m trying to pull metadata. Right now, the metadata is hardcoded as content at the top of the document.
Some givens:
Each page is generated with a <script>...</script> at the head, and I no longer need to capture data starting at the first instance of <p style=... so I can use those tags as “start” and “end” markers.
I don’t need tags, just the text, and I’d prefer a delimited text output, 9 columns, each column representing the data. (e.g., columns would be Desc, RefNum, Replaces, SpecCond, States, How, When, Owner, ChgDate and each line would represent a single document’s data–one line per HTML document).
I’m also trying to automate this as much as possible, so I’d like a tool that will crawl a path and its subdirectories looking for *.html and scraping the content.
I’m not really sure where to start. Thoughts?
</script>
<!-- -->
<!-- BEGIN CAPTURE HERE -->
<!-- -->
<h1>Additional Deposit Warning</h1>
<p class="Plain_Text"><font style="font-family:'Arial';">Description: Additional Deposit</font></p>
<p class="Plain_Text"><font style="font-family:'Arial';">Reference Number: 897</font></p>
<p class="Plain_Text"><font style="font-family:'Arial';">Replaces Letter: CIBS 417</font></p>
<p class="Plain_Text"><font style="font-family:'Arial';">Special Conditions: NA</font></p>
<p class="Plain_Text"><font style="font-family:'Arial';">States Applicable: WI, MI</font></p>
<p class="Plain_Text"><font style="font-family:'Arial';">How Generated: User Selects In CSS</font></p>
<p class="Plain_Text"><font style="font-family:'Arial';">When Generated: Additional deposit may be needed</font></p>
<p class="Plain_Text"><font style="font-family:'Arial';">Owner: Credit - Deposits</font></p>
<p class="Plain_Text"><font style="font-family:'Arial';">Last change letter: March 27, 2003</font></p>
<!-- -->
<!-- END CAPTURE HERE -->
<!-- -->
<p style="margin-top:0;margin-bottom:0"> </p>
<p><font style="font-family:'Times New Roman'; font-size:12pt;">#Mdate</font></p>
<p><font style="font-family:'Times New Roman'; font-size:12pt;"><br />
I ended up using javascript. It took a few rewrites to account for anomalous data, but all in all it worked well.