In .NET is there a way to enumerate all the values for \w?
As for why I am parsing words from unknown files. Will come across some files that use embedding that are nothing but non standard. See sample below
“PK!RýëÙ*[Content_Types].xml ¢( Ì?ÍNã0?÷Hó?·£Æ530̨)?Y!@?ycß6VÛò5о=7)T*””””áM«üø?ïºÕ?Ïä|ÙØâ” “ï*&Ê1+À)¯?Wìÿý¿Ñ+0I§¥õ*¶dçÓoG?ûU,hµÃ?Õ)???£ª¡?Xú??Ì|ld¢Ë8çAª???O¹ò.K£Ôj°éä/Ìä£MÅå?n¯I?cÅÅú½Öªb2k?LÊ??~g2ò³?Q ½zlHºÄAj¬RcË 9Æ;H?CÆwzF°ØÏôuª?Vv`X??ßiôÚ’Oõºî?~?h4·2¦kÙÐì|iù³?ïå~?¾[ÓmQÙHãÞ¸÷øw/#ï¾ÄÀ í|pO?ãL8~dÂñ3??L8N3áø? ÇY&¿3áã\@rIT?K¤?\2Uäª?T¹ÄªÈ%WÅW+Щ9:i¯?[
I think this was a output to printer file.
Need to somehow eliminate what I am calling trash words. It does not need to be perfect. The plan is to mark documents with trash words not included in the index so the user has an easy means for manual review.
What I may end of doing is counting from a list of safe chars (a,b,c,…). Like it must have one safe char or more than 1/2 safe chars to keep. Like I want to keep Café. Trash words tend to be all trash. This is a trash word ª’_LLýú that happens to have some safe chars.
At this point I am evaluating the battle field.
The nature of the business is may intentionally get sent trash files.
In case anyone cares I went with
rSafeChar = new Regex(@"[-_'@A-Za-z0-9]");
Toying with safeCharCount > unsafeCharCount or safeCharCount >= unsafeCharCount
To check what can be matched by \w one could use a string containing the whole ascii table and use the following regex :
The resulting groups should contain the list of characters matched and not matched by \w.
Here is an example :
Shows :