Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 608963
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T17:30:54+00:00 2026-05-13T17:30:54+00:00

A lot of Natural Language Processing (NLP) algorithms and libraries have a hard time

  • 0

A lot of Natural Language Processing (NLP) algorithms and libraries have a hard time working with random texts from the web, usually because they are presupposing clean, articulate writing. I can understand why that would be easier than parsing YouTube comments.

My question is: given a random piece of text, is there a process to determine whether that text is well written, and is a good candidate for use in NLP? What is the general name for these algorithm?

I would appreciate links to articles, algorithms or code libraries, but I would settle for good search terms.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T17:30:55+00:00Added an answer on May 13, 2026 at 5:30 pm

    ‘Well written’ and ‘good for NLP’ may go together but don’t have to. For a text to be ‘good for NLP’, it maybe should contain whole sentences with a verb and a dot at the end, and it should perhaps convey some meaning. For a text to be well written it should also be well-structured, cohesive, coherent, correctly substitute nouns for pronouns, etc. What you need depends on your application.

    The chances of a sentence to be properly processed by an NLP tool can often be estimated by some simple heuristics: Is it too long (>20 or 30 words, depending on the language)? Too short? Does it contain many weird characters? Does it contain urls or email adresses? Does it have a main verb? Is it just a list of something? To my knowledge, there is no general name for this, nor any particular algorithm for this kind of filtering – it’s called ‘preprocessing’.

    As to a sentence being well-written: some work has been done on automatically evaluating readability, cohesion, and coherence, e.g. the articles by Miltsakaki (Evaluation of text coherence for electronic essay scoring systems and Real-time web text classification and analysis of reading difficulty) or Higgins (Evaluating multiple aspects of coherence in student essays). These approaches are all based on one or the other theory of discourse structure, such as Centering Theory. The articles are rather theory-heavy and assume knowledge of both centering theory as well as machine learning. Nonetheless, some of these techniques have successfully been applied by ETS to automatically scoring student’s essays and I think this is quite similar to what you are trying to do, or at least, you may be able to adapt a few ideas.

    All this being said, I believe that within the next years, NLP will have to develop techniques to process language which is not well-formed with respect to current standards. There is a massive amount of extremely valuable data out there on the web, consisting of exactly the kinds of text you mentioned: youtube comments, chat messages, twitter and facebook status messages, etc. All of them potentially contain very interesting information. So, who should adapt – the people wrting that way or NLP?

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

A lot of the time I will have a Business object that has a
Lot of googling did not help me! Are there any good dictionary web based
A lot of times in code on the internet or code from my co-workers
A lot of contact management programs do this - you type in a name
A lot of programming languages and frameworks do/allow/require something that I can't seem to
A lot of iPhone apps use a blue badge to indicate the number of
A lot of files will be stored in the DB and I need file
A lot of developers say only throw exceptions in truly exceptional circumstances. One of
A lot of programs log things into a text file in a format something
A lot of OS projects I know (I am PHP developer) uses versions as

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.