Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 992179
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 16, 20262026-05-16T06:13:31+00:00 2026-05-16T06:13:31+00:00

In my project I work with text in general. I found that preprocessing can

  • 0

In my project I work with text in general. I found that preprocessing can be very slow. So I would like to ask you if you know how to optimize my code. The flow is like this:

get HTML page -> (To plain text -> stemming -> remove stop words) -> further text processing

In brackets there are preprocessing steps. The application runs in about 10.265s, but preprocessing takes 9.18s! This is time for preprocessing 50 HTML pages (excluding downloading).

I use HtmlAgilityPack library to convert HTML into plain text. This is quite fast. It takes 2.5ms to convert 1 document, so it’s relatively OK.

Problem comes later. Stemming one document takes up to 120ms. Unfortunately those HTML pages are in Polish. There no exists stemmer for Polish language written in C#. I know only 2 free to use written in Java: stempel and morfologic. I precompiled stempel.jar to stempel.dll with help of IKVM software. So there is nothing more to do.

Eliminating stop words takes also a lot of time (~70ms for 1 doc). It is done like this:


result = Regex.Replace(text.ToLower(), @"(([-]|[.]|[-.]|[0-9])?[0-9]*([.]|[,])*[0-9]+)|(\b\w{1,2}\b)|([^\w])", " ");
while (stopwords.MoveNext())
{
   string stopword = stopwords.Current.ToString();                
   result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " ");                               
}
return result;

First i remove all numbers, special characters, 1- and 2-letter words. Then in loop I remove stop words. There are about 270 stop words.

Is it possible to make it faster?

Edit:

What I want to do is remove everything which is not a word longer than 2 letters. So I want to get out all special chars (including ‘.’, ‘,’, ‘?’, ‘!’, etc.) numbers, stop words. I need only pure words that I can use for Data Mining.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-16T06:13:31+00:00Added an answer on May 16, 2026 at 6:13 am

    OK, I know that SO is not a pure forum and maybe I shouldn’t answer my own question but I’d like to share with my results.

    Finally, thanks to you guys, I managed to get better optimization of my text preprocessing. First of all I made simpler that long expression from my question (following Josh Kelley’s answer):

    [0-9]|[^\w]|(\b\w{1,2}\b)

    It does the same as first one but is very simple. Then following Josh Kelley’s suggestion again I put this regex into assembly. Great example of compiling expressions into assembly I found here. I did that, because this regex is used many, many times. After lecture of few articles about compiled regex, that was my decision. I removed the last expression after eliminating stop words (no real sense with that).

    So the execution time on 12KiB text file was ~15ms. This is only for expression mentioned above.

    Last step were stop words. I decided to make a test for 3 different options (Execution times are for the same 12KiB text file).

    One big Regular Expression

    with all stop words and compiled into assembly (mquander’s suggestion). Nothing to clear here.

    • Execution time: ~215ms

    String.Replace()

    People say that this can be faster than Regex. So for each stop word I used string.Replace() method. Many loops to take with result:

    • Execution time: ~65ms

    LINQ

    method presented by LBushkin. Nothing to say more.

    • Execution time: ~2.5ms

    I can only say wow. Just compare execution times of first one with the last one! Big thanks LBushkin!

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 499k
  • Answers 500k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer This is not pretty but it works: rm -R $(ls… May 16, 2026 at 12:45 pm
  • Editorial Team
    Editorial Team added an answer Yes. Override the base1 and base2 methods in Derived to… May 16, 2026 at 12:45 pm
  • Editorial Team
    Editorial Team added an answer No, you can't. Unfortunately, UIEvent doesn't expose any public way… May 16, 2026 at 12:45 pm

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Related Questions

I have a project ongoing at the moment that uses dynamic text blocks to
I just got a project in Borland JBuilder 2006 that I cannot even build.
I have a wierd problem that i need to work out how to resolve,
I'm showing a line of text using a typewriter effect [ http://plugins.jquery.com/project/jTypeWriter] . I
I want use jQuery in my project. I know the javascript_include_tag calls the jQuery
Can somebody explain why my Debug.Write stops working for no apparent reason - no
I am pretty new to WCF in general. What little experience I have comes
So we're writing a full-text search framework MongoDb . MongoDB is pretty much javascript-native,
I'm working on a project at the moment where I need to pick out
The project I'm working on allows an end-user to modify CSS code to integrate

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.