I’m trying to read a webpage using following code :
URL url = new URL("somewebsitecomeshere");
URLConnection c = url.openConnection();
if(getHttpResponseCode(c) == 200)
{
if (isContentValid(c))//accept html/xml only!
{
InputStream is = c.getInputStream();
Reader r = new InputStreamReader(is);
System.out.println(r.toString());
//after commenting this everything works great!
setHTMLString(getStringFromReader(r));
System.out.println(getHTMLString());
ParserDelegator parser = new ParserDelegator();
parser.parse(r, new Parser(url), true);
r.close();
is.close();
try {
Thread.sleep(500);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
else
log("content is not valid!");
}
else
{
System.out.println("ERROR" + c.getContentType() + c.getURL());
}
//---------------------------------------------------
private String getStringFromReader(Reader reader) throws IOException {
char[] arr = new char[8*1024]; // 8K at a time
StringBuffer buf = new StringBuffer();
int numChars;
while ((numChars = reader.read(arr, 0, arr.length)) > 0) {
buf.append(arr, 0, numChars);
}
//Reset position to 0
reader.reset();
return buf.toString();
}
if try to read string using getStringFromReader() the rest of the code will be ignored due to changing position of Reader to EOF so I tried to reset the position to 0 but I got the following error :
java.io.IOException: reset() not supported
at java.io.Reader.reset(Unknown Source)
at sample.getStringFromReader(Spider.java:248)
at default(sample.java:286)
at default.main(sample.java:130)
How can I reset the Reader position to 0?
Short answer, your stream doesn’t support reset or mark methods. Check the result of:
Long answer, an InputStream is a flow of bytes. Bytes can come from a file, a network resource, a string, etc. So basically, there are streams that don’t support resetting the reader position to the start of the stream, while others do (random access file).
A stream from a web site will normally use underlying network connection to provide the data. It means that it’s up to the underlying network protocol (TCP/IP for example) to support or not resetting the stream, and normally they don’t.
In order to reset any stream you would have to know the entire flow, from start to end. Network communications send a bunch of packages (which may be in order or not) to transfer data. Packages may get lost or even be duplicated, so normally information is buffered and interpreted as it is received. It would be very expensive to reconstruct all messages at network level. So that is normally up to the receiver, if it wants to do that.
In your case If what you want is print the input stream I would recommend creating a custom InputStream, which receives the original InputStream and whenever it is read it prints the read value and returns it at the same time. For example:
Then wrap your original InputStream with that:
Hope it helps.