So, I’ve URL encoded a word doc trying to parse certain fields..which is a pain. Though there are some “unexpected” results, I’ve got everything running great except for this one off.
Here is an example of the output from Word for 99.8% of the results:
%13+FORMTEXT+%01%14wes%15
Normally, the regex I setup grabs all the fields exactly as I need, for the example above. But the example below is a weird one. Trying to parse out “wes” from the bottom example.
%13+FORMTEXT+%01%15%86%15%9A%9C%9E%A0%F2%F4%0A%1A%1C%1E+468%3A%3C%3E%40TVXZ%5C%15%60bvxz%FC%F0%E0%14%D4%C1%06%14wes%15
Mind you, this is one big string, so it would continue on in this fashion:
%13+FORMTEXT+%01%15%86%15%9A%9C%9E%A0%F2%F4%0A%1A%1C%1E+468%3A%3C%3E%40TVXZ%5C%15%60bvxz%FC%F0%E0%14%D4%C1%06%14wes%15%13+FORMTEXT+%01%14wess%15
Notice the huge gap between %01 and %14 then the text between %14 and %15. Usually %01%14 are side by side, in this case there is nonsense between them…lots of it, this is shortened for the example.
Cheers,
Wes
Went a different route, converted the doc to docx/ooxml and used regex on the XML.