I am trying to use boilerpipe java library, to extract news articles from a

Question

0

Asked: May 31, 20262026-05-31T05:19:38+00:00 2026-05-31T05:19:38+00:00

I am trying to use boilerpipe java library, to extract news articles from a

0

I am trying to use boilerpipe java library, to extract news articles from a set of websites.
It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. I think it is an encoding problem.

In the boilerpipe faq, it says “If you extract non-English text you might need to change some parameters” and then refers to a paper. I found no solution in this paper.

My question is, are there any params when using boilerpipe where i can specify the encoding? Is there any way to go around and get the text correctly?

How i’m using the library:
(first attempt based on the URL):

URL url = new URL(link);
String article = ArticleExtractor.INSTANCE.getText(url);

(second on the HTLM source code)

String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T05:19:38+00:00

Ok, got a solution.
As Andrei said, i had to change the class HTMLFecther, which is in the package de.l3s.boilerpipe.sax
What i did was to convert all the text that was fetched, to UTF-8.
At the end of the fetch function, i had to add two lines, and change the last one:

final byte[] data = bos.toByteArray(); //stays the same
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
return new HTMLDocument(utf8, cs); // edited line

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to use boilerpipe java library, to extract news articles from a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply