I am trying to extract the URL from the given String, which contain the HTTP response with HREF tag. I have reached the beginning of the links but I need to terminate the string as soon as the HREF ends. How this could be achieved?
public class Extracturl {
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
String line;
try {
String u="http://en.wikipedia.org/wiki/china";
String fileName = "e:\\test.txt";
BufferedWriter writer = new BufferedWriter(new FileWriter(fileName,true));
url = new URL(u);
is = url.openStream(); // throws an IOException
dis = new DataInputStream(new BufferedInputStream(is));
String w=new String();
while ((line = dis.readLine()) != null) {
try {
if(line.contains("href=\"/wiki")&&line.contains("\" />")&& (!line.contains("File")))
{
if(!w.contains(line.substring(line.indexOf("href=\"/"))))
{w=w+line.substring(line.indexOf("href=\"/"));
System.out.println(line.substring(line.indexOf("href=\"/")));
writer.write(w);
writer.newLine();
}}
} catch (IOException e) {
e.printStackTrace();
}
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
// writer.close();
} catch (IOException ioe) {
// nothing to see here
}
}
}
}
I even tried
w=w+line.substring(line.indexOf("href=\"/"),line.indexOf("\">"));
But this gave me error.
My aim is to get all the URLs which are linked from the page.
Use an HTML parser for that purpose. Here is an example with the embedded Java HTML parser. There are other alternatives like JSoup, but for basic HTML handling, this one does a pretty good job: