I have a string which contains some HTML code. I would like to find

Question

0

Asked: June 15, 20262026-06-15T23:52:37+00:00 2026-06-15T23:52:37+00:00

I have a string which contains some HTML code. I would like to find

0

I have a string which contains some HTML code. I would like to find out if the HTML code represents visible text or an image. I used Java to solve this problem using the following regular expressions (I know you cannot parse HTML using RegExps, but I thought for what I am up to RegExps are enough).

public static String regex_html_tags_1 = "<\\s*br\\s*[/]?>";
public static String regex_html_tags_2 = "<\\s*([a-zA-Z0-9]+)\\s*([^=/>]+\\s*=\\s*[^/>]+\\s*)*\\s*/>"; 
public static String regex_html_tags_3 = "<\\s*([a-zA-Z0-9]+)\\s*([^=>]+\\s*=\\s*[^>]+\\s*)*\\s*>\\s*</\\s*\\1\\s*>"; 

public static String[] HTMLWhiteSpaces = {"&nbsp;", "&#160;"};

The code using these RegExps works fine for strings like

<h2></h2>

or alike. But a string

<img src="someImage.png"></img>

is also thought of as being empty.

Does anyone have a better idea than using RegExps to find out if some HTML code actually represents human readable text when it is interpreted by a browser? Or do you think my approach eventually leads to success?

Thanks a lot in advance.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T23:52:38+00:00

Try using JSoup. It let’s you parse HTML documents using css selectors (jquery-style).

A very simple example to select all non-empty elements would be:

Document doc = Jsoup.connect("http://my.awesome.site.com").get();
Elements nonEmpties = doc.select(":not(:empty)");

The full-blown solution will of course require some extra work to do, like

iterating over lists of elements,
checking css styles (for display or visibility or sizes or overlaying elements)
checking src attributes for images
etc

but it’s definitely worth it. You’ll learn a new framework, discover possibilities to ‘hide’ content in HTML / CSS and – most important – stop using regular expressions for HTML parsing 😉

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a string which contains some HTML code. I would like to find

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply