I need to fetch HTML from Turkish webpages using Java. However, I am finding that my Java code is not able to pick up certain Turkish characters. Here is the Java code I am using:
import java.io.BufferedInputStream;
import java.io.DataInputStream;
import java.io.InputStream;
import java.net.URL;
public class fetchHTML {
public static void main(String[] args) throws Exception {
URL urls = new URL("http://www.parkbravo.com.tr/pantolon.php");
InputStream is = urls.openStream();
DataInputStream dis = new DataInputStream(new BufferedInputStream(is));
String line;
while ((line = dis.readLine()) != null) {
System.out.println(line);
}
}
}
The first few lines of output of this code are:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" />
<html lang="tr" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml">
<head>
<title>ParkBravo - Ãrünler - Pantolonlar</title>
You can see that the title is incorrect: Ãrünler should be Ürünler
If I use the following Python code to get the HTML:
import urllib2
url = 'http://www.parkbravo.com.tr/pantolon.php'
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
print data
then the output is correct. Title comes out as:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" />
<html lang="tr" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml">
<head>
<title>ParkBravo - Ürünler - Pantolonlar</title>
But I want to be able to get the HTML with Java. Does anyone know how I can get this working?
Thanks!
readLine()inDataInputStreamis Deprecated. You should use a Reader, which handles the conversion from bytes to characters correctly.If you use
InputStreamReader, you can specify the encoding in the constructor and if you wrap it inBufferedReader, you can read lines.Instead of
you can have
Where “UTF-8” can be replaced by whatever encoding you need.