I am using Solr 3.3 to index stuff from my database. I compose the

Question

0

Asked: May 24, 20262026-05-24T22:54:35+00:00 2026-05-24T22:54:35+00:00

I am using Solr 3.3 to index stuff from my database. I compose the

0

I am using Solr 3.3 to index stuff from my database. I compose the JSON content in Python. I manage to upload 2126 records which add up to 523246 chars (approx 511kb). But when I try 2027 records, Python gives me the error:

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "D:\Technovia\db_indexer\solr_update.py", line 69, in upload_service_details
    request_string.append(param_list)
  File "C:\Python27\lib\json\__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "C:\Python27\lib\json\encoder.py", line 203, in encode
    chunks = list(chunks)
  File "C:\Python27\lib\json\encoder.py", line 425, in _iterencode
    for chunk in _iterencode_list(o, _current_indent_level):
  File "C:\Python27\lib\json\encoder.py", line 326, in _iterencode_list
    for chunk in chunks:
  File "C:\Python27\lib\json\encoder.py", line 384, in _iterencode_dict
    yield _encoder(value)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 68: invalid start byte

Ouch. Is 512kb worth of bytes a fundamental limit? Is there any high-volume alternative to the existing JSON module?

Update: its a fault of some data as trying to encode *biz_list[2126:]* causes an immediate error. Here is the offending piece:

‘2nd Floor, Gurumadhavendra Towers,\nKadavanthra Road, Kaloor,\nCochin \x96 682 017′

How can I configure it so that it can be encodable into JSON?

Update 2: The answer worked as expected: the data came from a MySQL table encoded in “latin-1-swedish-ci”. I saw significance in a random number. Sorry for spontaneously channeling the spirit of a headline writer when diagnosing the fault.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T22:54:36+00:00

Simple, just don’t use utf-8 encoding if your data is not in utf-8

>>> json.loads('["\x96"]')
....
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 0: invalid start byte

>>> json.loads('["\x96"]', encoding="latin-1")
[u'\x96']

json.loads

If s is a str instance and is encoded with an ASCII based
encoding other than utf-8 (e.g. latin-1) then an appropriate
encoding name must be specified. Encodings that are not ASCII
based (such as UCS-2) are not allowed and should be decoded to
unicode first.

Edit: To get proper unicode value of “\x96” use “cp1252” as Eli Collins mentioned

>>> json.loads('["\x96"]', encoding="cp1252")
[u'\u2013']

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am using Solr 3.3 to index stuff from my database. I compose the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply