I have a lab assignment and I’m dwelled on about removing html tags. Here is method for removing html tags:
public String getFilteredPageContents() {
String str = getUnfilteredPageContents();
String temp = "";
boolean b = false;
for(int i = 0; i<str.length(); i++) {
if(str.charAt(i) == '&' || str.charAt(i) == '<') {
b = true;
}
if(b == false) {
temp += str.charAt(i);
}
if(str.charAt(i) == '>' || str.charAt(i) == ';') {
b = false;
}
}
return temp;
}
And here is my text’s earliest form:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
<head>
<meta http-equiv="Content-Type"
content="text/html; charset=iso-8859-1">
<meta name="GENERATOR" content="Microsoft FrontPage 2.0">
<title>A Shropshire Lad</title>
</head>
<body bgcolor="#008000" text="#FFFFFF" topmargin="10"
leftmargin="20">
<p align="center"><font size="6"><strong></strong></font> </p>
<div align="center"><center>
<pre><font size="7"><strong>A Shropshire Lad
</strong></font><strong>
by A.E. Housman
Published by Dover 1990</strong></pre>
</center></div>
<p><strong>This collection of sixty three poems appeared in 1896.
Many of them make references to Shrewsbury and Shropshire,
however, Housman was not a native of the county. The Shropshire
of his book is a mindscape in which he blends old ballad meters,
classical reminiscences and intense emotional experiences
"recollected in tranquility." Although they are not
particularly to my taste, their style, simplicity and
timelessness are obvious even to me. Below are two short poems
which amused me, I hope you find them interesting too.</strong></p>
<hr size="8" width="80%" color="#FFFFFF">
<div align="left">
<pre><font size="5"><strong><u>
XIII</u></strong></font><font size="4"><strong>
When I was one-and-twenty
I heard a wise man say,
'Give crowns and pounds and guineas
But not your heart away;</strong></font></pre>
</div><div align="left">
<pre><font size="4"><strong>Give pearls away and rubies
But keep your fancy free.
But I was one-and-twenty,
No use to talk to me.</strong></font></pre>
</div><div align="left">
<pre><font size="4"><strong>When I was one-and-twenty
I heard him say again,
'The heart out of the bosom
Was never given in vain;
'Tis paid with sighs a plenty
And sold for endless rue'
And I am two-and-twenty,
And oh, 'tis true 'tis true.
</strong></font><strong></strong></pre>
</div>
<hr size="8" width="80%" color="#FFFFFF">
<pre><font size="5"><strong><u>LVI . The Day of Battle</u></strong></font><font
size="4"><strong>
'Far I hear the bugle blow
To call me where I would not go,
And the guns begin the song,
"Soldier, fly or stay for long."</strong></font></pre>
<pre><font size="4"><strong>'Comrade, if to turn and fly
Made a soldier never die,
Fly I would, for who would not?
'Tis sure no pleasure to be shot.</strong></font></pre>
<pre><font size="4"><strong>'But since the man that runs away
Lives to die another day,
And cowards' funerals, when they come,
Are not wept so well at home,</strong></font></pre>
<pre><font size="4"><strong>'Therefore, though the best is bad,
Stand and do the best, my lad;
Stand and fight and see your slain,
And take the bullet in your brain.'</strong></font></pre>
<hr size="8" width="80%" color="#FFFFFF">
</body>
</html>
And when implement my method on this text:
charset=iso-8859-1">
A Shropshire Lad
A Shropshire Lad
by A.E. Housman
Published by Dover 1990
This collection of sixty three poems appeared in 1896.
Many of them make references to Shrewsbury and Shropshire,
however, Housman was not a native of the county. The Shropshire
of his book is a mindscape in which he blends old ballad meters,
classical reminiscences and intense emotional experiences
recollected in tranquility. Although they are not
particularly to my taste, their style, simplicity and
timelessness are obvious even to me. Below are two short poems
which amused me, I hope you find them interesting too.
.
.
.
My question is: How can I get rid of that little code at the very beginning of text charset=iso-8859-1">. I can’t get away from that bunch of code? Thanks…
I can see that your intent is to remove stuff that looks like
<xxx>and&xxx;. You’re using the variablebto remember whether you’re currently skipping stuff or not.Did you notice that your algorithm will skip things of the form
<xxx;and&xxx>? Namely,&or<will cause skipping to begin, and>or;will cause skipping to end, but you don’t have to match<with>, or&with;. So how about implementing code to remember which character started the skip?A further complication, though, is that
&xxx;stuff can be embedded in<xxx>stuff, like this:<p title="&">Incidentally,
temp += str.charAt(i);will make your program very slow when the string is long. Look at usingStringBuilderinstead.Here is some code that should solve your problem or nearly so: