I have a Solr-Servlet with 600.000 documents, each document contains about 10-30 multivalued fields. In order to upate a document, I´m facing problems because Solr hasn´t any update function in version 3.6. What I want to do: I want to do: I want to have a app, which just need the field, which should be inserted to the document. For example:
Document1( field1 / value1, field2 / value2)
I want to insert a field3 / value3 to this document. For now, there is the need to do it this way:
Document1( field1 / value1, field2 / value2, field3 / value3)
Cause of the high number of the fields in each document, I just want to add field3 / value3, without the need to write know all the other fields like shown above:
Document1( field3 / value3)
Thats why I wrote an application, which first automatically get all the data from Solr and than add the one field, which should be inserted to the document. Everything went right, until I worked with documents, where fields have values like ‘ä’ ‘ö’ ‘ü’ and so on. Solr than return an error:
org.apache.solr.common.SolrException: Invalid UTF-8 start byte 0xfc
I figured out, that this is caused by the posted characters above. Therefore I wanted to know which encoding my inputstream is (I used juniversalchardet for this) and it pointed out, that the encoding is WINDOWS-1252. My application is written in Java without any Solr libraries (just the standard http libraries and javax for xml handling). Do you have any idea where the encoding is changed and how can I avoid it? Is it Java or is it, because the servlet is running on a windows machine?
Thanks for any help!
edit: Should I use Solrj libraries? Does anyone know if this avoid my problem?
After some research I found my problem and I want to share it for all who might have the same problem. The inputstream seems to be depended on your running system. As you may guess, I´m using a Windows machine. The only thing you have to do, is setting your outputstream as UTF-8, which will be used by Solr to reindex your document. I used a FileOutputStream, because I needed to document the change. So my missing code was:
You can choose the encoding of nearly all streams. I didn´t know about this parameter, so for all who will face this problem -> just set the encoding of your outputstream.