Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 951339
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 15, 20262026-05-15T23:43:12+00:00 2026-05-15T23:43:12+00:00

Bit of a random one, i am wanting to have a play with some

  • 0

Bit of a random one, i am wanting to have a play with some NLP stuff and I would like to:

Get all the text that will be displayed to the user in a browser from HTML.

My ideal output would not have any tags in it and would only have fullstops (and any other punctuation used) and new line characters, though i can tolerate a fairly reasonable amount of failure in this (random other stuff ending up in output).

If there was a way of inserting a newline or full stop in situations where the content was likely not to continue on then that would be considered an added bonus. e.g:

items in an ul or option tag could be separated by full stops (or to be honest just ignored).

I am working Java, but would be interested in seeing any code that does this.

I can (and will if required) come up with something to do this, just wondered if there was anything out there like this already, as it would probably be better than what I come up with in an afternoon ;-).

An example of the code I might write if I do end up doing this would be to use a SAX parser to find content in p tags, strip it of any span or strong etc tags, and add a full stop if I hit a div or another p without having had a fullstop.

Any pointers or suggestions very welcome.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-15T23:43:13+00:00Added an answer on May 15, 2026 at 11:43 pm

    HTML parsers seem to be a reasonable starting point for this.

    there are a number of them for example: HTMLCleaner and Nekohtml seem to work fine.

    They are good as they fix the tags to allow you to more consistently process them, even if you are just removing them.

    But as it turns out you probably want to get rid of script tags meta data etc. And in that case you are better working with well formed XML which these guy get for you from “wild” html.

    there are many SO questions relating to this (like this one) you should search for “HTML parsing” though 😉

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a text file that looks a bit like: random text random text,
Assuming I have a function that returns a random bit, is it possible to
I have this code for AES encryption, can some one verify that this code
I would like to open a web page that simply contains one of three
I want to generate eight bit (uint8_t) random numbers so that I exclude a
I have a bit of code that basically displays the last x (variable, but
Bit of a strange question. I have a website which has some pages in
This one has me stumped. I have two Windows installations of PHP: 32-bit on
I have a 32-bit app that makes use of Java Accessibility (WindowsAccessBridge-32.dll, via the
if you are to choose a random a 512-bit integer N that is not

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.