so I found out it was possible to use the buffered reader/writer to copy an xml file over word for word to a new xml file. However, I was wondering if it would be possible to scrape out only a portion of the document?
For example, looking at this example:
<?xml version="1.0" encoding="UTF-8"?>
<BookCatalogue xmlns="http://www.publishing.org">
<w:pStyle w:val="TOAHeading" />
<Book>
<Title>Yogasana Vijnana: the Science of Yoga</Title>
<author>Dhirendra Brahmachari</Author>
<Date>1966</Date>
<ISBN>81-40-34319-4</ISBN>
<Publisher>Dhirendra Yoga Publications</Publisher>
<Cost currency="INR">11.50</Cost>
</Book>
<Book>
<Title>The First and Last Freedom</Title>
<v:imagedata r:id="rId7" o:title="" croptop="10523f" cropbottom="11721f" />
<Author>J. Krishnamurti</Author>
<Date>1954</Date>
<ISBN>0-06-064831-7</ISBN>
<Publisher>Harper & Row</Publisher>
<Cost currency="USD">2.95</Cost>
</Book>
<w:pStyle w:val="TOAHeading2" />
</BookCatalogue>
Sorry if this is not proper XML Code, I just added the tidbits from the document I was looking at to this sample I found. But basically, if I wanted to look for the an instance of “heading” (in this case, 3rd line -> TOAHeading), then scrape everything from heading down until another instance of heading is found and copy it to another xml file. Is that possible? Furthermore, if I wanted to make that a temporary file I’m storing to, and only keep that file if an instance of “image” (in this case, 14th line) is found, is that possible as well? I’m trying to do this in the simplest way possible, so does anyone have any ideas or experience with this? Thanks in advance.
public class IPDriver
{
public static void main(String[] args) throws IOException
{
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStreamReader("C:/Documents and Settings/user/workspace/Intern Project/Proposals/Converted Proposals/Extracted Items/ProposalOne/word/document.xml"), "UTF-8"));
BufferedWriter writer = new BufferedWriter(new OutputStreamReader(new FileOutputStreamReader("C:/Documents and Settings/user/workspace/Intern Project/Proposals/Converted Proposals/Extracted Items/ProposalOne/word/tempdocument.xml"), "UTF-8"));
String line = null;
while ((line = reader.readLine()) != null)
{
writer.write(line);
}
// Close to unlock.
reader.close();
// Close to unlock and flush to disk.
writer.close();
}
}
Example From My Actual XML Document
- <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="address">
- <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="Street">
- <w:r w:rsidRPr="00822244">
<w:t>6841 Benjamin Franklin Drive</w:t>
</w:r>
</w:smartTag>
</w:smartTag>
</w:p>
- <w:p w:rsidR="00B41602" w:rsidRPr="00822244" w:rsidRDefault="00B41602" w:rsidP="007C3A42">
- <w:pPr>
<w:pStyle w:val="Address" />
</w:pPr>
- <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="City">
- <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="place">
Just your basic document.xml file from a .docx
I’ve seen a lot of technically-correct suggestions, but your request (when taken as-written) suggests to me that you have the following requirements:
If I understood your requirements, you are basically wanting to do a totally unstructured parse of a very structured piece of data (XML markup). In that case, using an XML parser, an XSLT, DOM parser for anything written against the XML spec is going to be a pain in the ass to mangle to your needs.
You’ll need to do a case-insensitive scan of your document contents until you get your match, then pull all the characters between that match and an ending match.
If the documents aren’t huge (say 1 MB or smaller) just read the whole thing into memory into a String and either use a really quick and dirty use of “indexOf” for the different cased versions of what you want, OR read the whole thing into a char[] do write some more efficient scanning code for a case-insensitive match for the starting value you want to begin parsing at.
If I misunderstood your requirement and it is actually much more structured than it sounded in your description above, then please use one of the other suggestions that is more focused on true XML parsing. I am just putting this solution out there in the off chance that it was as random as you made it out to seem.
(NOTE: I’m not saying it’s BAD, just never seen that request before. You have your own reasons for needing to do that and we’ll just try and help 😉