In my project I work with text in general. I found that preprocessing can be very slow. So I would like to ask you if you know how to optimize my code. The flow is like this:
get HTML page -> (To plain text -> stemming -> remove stop words) -> further text processing
In brackets there are preprocessing steps. The application runs in about 10.265s, but preprocessing takes 9.18s! This is time for preprocessing 50 HTML pages (excluding downloading).
I use HtmlAgilityPack library to convert HTML into plain text. This is quite fast. It takes 2.5ms to convert 1 document, so it’s relatively OK.
Problem comes later. Stemming one document takes up to 120ms. Unfortunately those HTML pages are in Polish. There no exists stemmer for Polish language written in C#. I know only 2 free to use written in Java: stempel and morfologic. I precompiled stempel.jar to stempel.dll with help of IKVM software. So there is nothing more to do.
Eliminating stop words takes also a lot of time (~70ms for 1 doc). It is done like this:
result = Regex.Replace(text.ToLower(), @"(([-]|[.]|[-.]|[0-9])?[0-9]*([.]|[,])*[0-9]+)|(\b\w{1,2}\b)|([^\w])", " ");
while (stopwords.MoveNext())
{
string stopword = stopwords.Current.ToString();
result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " ");
}
return result;
First i remove all numbers, special characters, 1- and 2-letter words. Then in loop I remove stop words. There are about 270 stop words.
Is it possible to make it faster?
Edit:
What I want to do is remove everything which is not a word longer than 2 letters. So I want to get out all special chars (including ‘.’, ‘,’, ‘?’, ‘!’, etc.) numbers, stop words. I need only pure words that I can use for Data Mining.
OK, I know that SO is not a pure forum and maybe I shouldn’t answer my own question but I’d like to share with my results.
Finally, thanks to you guys, I managed to get better optimization of my text preprocessing. First of all I made simpler that long expression from my question (following Josh Kelley’s answer):
[0-9]|[^\w]|(\b\w{1,2}\b)It does the same as first one but is very simple. Then following Josh Kelley’s suggestion again I put this regex into assembly. Great example of compiling expressions into assembly I found here. I did that, because this regex is used many, many times. After lecture of few articles about compiled regex, that was my decision. I removed the last expression after eliminating stop words (no real sense with that).
So the execution time on 12KiB text file was ~15ms. This is only for expression mentioned above.
Last step were stop words. I decided to make a test for 3 different options (Execution times are for the same 12KiB text file).
One big Regular Expression
with all stop words and compiled into assembly (mquander’s suggestion). Nothing to clear here.
String.Replace()
People say that this can be faster than Regex. So for each stop word I used
string.Replace()method. Many loops to take with result:LINQ
method presented by LBushkin. Nothing to say more.
I can only say wow. Just compare execution times of first one with the last one! Big thanks LBushkin!