In .NET is there a way to enumerate all the values for \w? As

Question

0

Asked: June 10, 20262026-06-10T16:18:24+00:00 2026-06-10T16:18:24+00:00

In .NET is there a way to enumerate all the values for \w? As

0

In .NET is there a way to enumerate all the values for \w?

As for why I am parsing words from unknown files. Will come across some files that use embedding that are nothing but non standard. See sample below

“PK!RýëÙ*[Content_Types].xml ¢( Ì?ÍNã0?÷Hó?·£Æ530Ì¨)?Y!@?ycß6VÛò5Ð¾=7)T*””””áM«üø?ïºÕ?Ïä|ÙØâ” “ï*&Ê1+À)¯?Wìÿý¿Ñ+0I§¥õ*¶dçÓoG?ûU,hµÃ?Õ)???£ª¡?Xú??Ì|ld¢Ë8çAª???O¹ò.K£Ôj°éä/Ìä£MÅå?n¯I?cÅÅú½Öªb2k?LÊ??~g2ò³?Q ½zlHºÄAj¬RcË 9Æ;H?CÆwzF°ØÏôuª?Vv`X??ßiôÚ’Oõºî?~?h4·2¦kÙÐì|iù³?ïå~?¾[ÓmQÙHãÞ¸÷øw/#ï¾ÄÀ í|pO?ãL8~dÂñ3??L8N3áø? ÇY&¿3áã\@rIT?K¤?\2Uäª?T¹ÄªÈ%WÅW+Ð©9:i¯?[

I think this was a output to printer file.

Need to somehow eliminate what I am calling trash words. It does not need to be perfect. The plan is to mark documents with trash words not included in the index so the user has an easy means for manual review.

What I may end of doing is counting from a list of safe chars (a,b,c,…). Like it must have one safe char or more than 1/2 safe chars to keep. Like I want to keep Café. Trash words tend to be all trash. This is a trash word ª’_LLýú that happens to have some safe chars.

At this point I am evaluating the battle field.

The nature of the business is may intentionally get sent trash files.

In case anyone cares I went with

rSafeChar = new Regex(@"[-_'@A-Za-z0-9]");

Toying with safeCharCount > unsafeCharCount or safeCharCount >= unsafeCharCount

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-10T16:18:25+00:00

To check what can be matched by \w one could use a string containing the whole ascii table and use the following regex :

(?:(?<wmatch>\w)*(?<wnotmatch>[^\w]*))*

The resulting groups should contain the list of characters matched and not matched by \w.

Here is an example :

private void TestMatch()
{
  string ascii = "abcdef0934+_!1@_$14-195djsjfke1058446541";
  Regex r = new Regex(@"(?:(?<wmatch>\w)*(?<wnotmatch>[^\w]*))*");
  Match m = r.Match(ascii);
  if (m.Success)
  {
    string msg = "\\w matches :";
    foreach (Capture cap in m.Groups["wmatch"].Captures)
    {
      msg += cap.Value + ", ";
    }
    msg += Environment.NewLine + "\\w does not match: ";
    foreach (Capture cap in m.Groups["wnotmatch"].Captures)
    {
      msg += cap.Value + ", ";
    }
    MessageBox.Show(msg);
  }
}

Shows :

\\w matches :a, b, c, d, e, f, 0, 9, 3, 4, _, 1, _, 1, 4, 1, 9, 5, d, j, s, j, f, k, e, 1, 0, 5, 8, 4, 4, 6, 5, 4, 1,  
\\w does not match: +, !, @, $, -, "

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

In .NET is there a way to enumerate all the values for \w? As

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply