This is not pretty but it works: rm -R $(ls…

Question

0

Asked: May 16, 20262026-05-16T06:13:31+00:00 2026-05-16T06:13:31+00:00

In my project I work with text in general. I found that preprocessing can

0

In my project I work with text in general. I found that preprocessing can be very slow. So I would like to ask you if you know how to optimize my code. The flow is like this:

get HTML page -> (To plain text -> stemming -> remove stop words) -> further text processing

In brackets there are preprocessing steps. The application runs in about 10.265s, but preprocessing takes 9.18s! This is time for preprocessing 50 HTML pages (excluding downloading).

I use HtmlAgilityPack library to convert HTML into plain text. This is quite fast. It takes 2.5ms to convert 1 document, so it’s relatively OK.

Problem comes later. Stemming one document takes up to 120ms. Unfortunately those HTML pages are in Polish. There no exists stemmer for Polish language written in C#. I know only 2 free to use written in Java: stempel and morfologic. I precompiled stempel.jar to stempel.dll with help of IKVM software. So there is nothing more to do.

Eliminating stop words takes also a lot of time (~70ms for 1 doc). It is done like this:


result = Regex.Replace(text.ToLower(), @"(([-]|[.]|[-.]|[0-9])?[0-9]*([.]|[,])*[0-9]+)|(\b\w{1,2}\b)|([^\w])", " ");
while (stopwords.MoveNext())
{
   string stopword = stopwords.Current.ToString();                
   result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " ");                               
}
return result;

First i remove all numbers, special characters, 1- and 2-letter words. Then in loop I remove stop words. There are about 270 stop words.

Is it possible to make it faster?

Edit:

What I want to do is remove everything which is not a word longer than 2 letters. So I want to get out all special chars (including ‘.’, ‘,’, ‘?’, ‘!’, etc.) numbers, stop words. I need only pure words that I can use for Data Mining.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T06:13:31+00:00

OK, I know that SO is not a pure forum and maybe I shouldn’t answer my own question but I’d like to share with my results.

Finally, thanks to you guys, I managed to get better optimization of my text preprocessing. First of all I made simpler that long expression from my question (following Josh Kelley’s answer):

[0-9]|[^\w]|(\b\w{1,2}\b)

It does the same as first one but is very simple. Then following Josh Kelley’s suggestion again I put this regex into assembly. Great example of compiling expressions into assembly I found here. I did that, because this regex is used many, many times. After lecture of few articles about compiled regex, that was my decision. I removed the last expression after eliminating stop words (no real sense with that).

So the execution time on 12KiB text file was ~15ms. This is only for expression mentioned above.

Last step were stop words. I decided to make a test for 3 different options (Execution times are for the same 12KiB text file).

One big Regular Expression

with all stop words and compiled into assembly (mquander’s suggestion). Nothing to clear here.

Execution time: ~215ms

String.Replace()

People say that this can be faster than Regex. So for each stop word I used string.Replace() method. Many loops to take with result:

Execution time: ~65ms

LINQ

method presented by LBushkin. Nothing to say more.

Execution time: ~2.5ms

I can only say wow. Just compare execution times of first one with the last one! Big thanks LBushkin!

One big Regular Expression

String.Replace()

LINQ

How to approach applying for a job at a company ...

What is a programmer’s life like?

How to handle personal stress caused by utterly incompetent and ...

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

In my project I work with text in general. I found that preprocessing can

Leave an answerCancel reply

1 Answer

One big Regular Expression

String.Replace()

LINQ

How to approach applying for a job at a company ...

What is a programmer’s life like?

How to handle personal stress caused by utterly incompetent and ...

Leave an answer
Cancel reply