I have a string which contains some HTML code. I would like to find out if the HTML code represents visible text or an image. I used Java to solve this problem using the following regular expressions (I know you cannot parse HTML using RegExps, but I thought for what I am up to RegExps are enough).
public static String regex_html_tags_1 = "<\\s*br\\s*[/]?>";
public static String regex_html_tags_2 = "<\\s*([a-zA-Z0-9]+)\\s*([^=/>]+\\s*=\\s*[^/>]+\\s*)*\\s*/>";
public static String regex_html_tags_3 = "<\\s*([a-zA-Z0-9]+)\\s*([^=>]+\\s*=\\s*[^>]+\\s*)*\\s*>\\s*</\\s*\\1\\s*>";
public static String[] HTMLWhiteSpaces = {" ", " "};
The code using these RegExps works fine for strings like
<h2></h2>
or alike. But a string
<img src="someImage.png"></img>
is also thought of as being empty.
Does anyone have a better idea than using RegExps to find out if some HTML code actually represents human readable text when it is interpreted by a browser? Or do you think my approach eventually leads to success?
Thanks a lot in advance.
Try using JSoup. It let’s you parse HTML documents using css selectors (jquery-style).
A very simple example to select all non-empty elements would be:
The full-blown solution will of course require some extra work to do, like
displayorvisibilityor sizes or overlaying elements)srcattributes for imagesbut it’s definitely worth it. You’ll learn a new framework, discover possibilities to ‘hide’ content in HTML / CSS and – most important – stop using regular expressions for HTML parsing 😉