I use the ruby-gem sequel to read utf-8-encoded data from a MSSQL-Server table.
The fields of the table are defined as nvarchar, they look correct in the Microsoft Server Management Studio (Cyrillic is Cyrillic, Chinese looks chinese).
I connect my database with
db = Sequel.connect(
:adapter=>'ado',
:host =>connectiondata[:server],
:database=>connectiondata[:dsn],
#Login via SSO
)
sel = db[:TEXTE].filter(:language=> 'EN')
sel.each{|data|
data.each{|key, val|
puts "#{val.encoding}: #{val.inspect}" #-> CP850: ....
puts val.encode('utf-8')
}
}
This works fine for English, German returns also a useable result:
CP850: "(2 St\x81ck) f\x81r
(2 Stück) für ...
But the result is converted to CP850, it is not the original UTF-8.
Cyrillic languages (I tested with Bulgarian) and Chinese produce only ‘?’
(reasonable, because CP850 doesn’t include Chinese and Bulgarian characters).
I also connected via a odbc-connection:
db = Sequel.odbc(odbckey,
:db_type => 'mssql', #necessary
#:encoding => 'utf-8', #Only MySQL-Adapter
)
The result is ASCII-8BIT, I have to convert the data with force_encoding to CP1252 (not CP850!).
But Cyrillic and Chinese is still not possible.
What I tried already:
- The MySQL-adapter seems to have an encoding option, with MSSQL I detected no effect.
- I did similar tests with sqlite and sequel and I had no problem with unicode.
- I installed
SQLNCLI10.dlland used it as provider. But I get a Invalid connection string attribute-error (same withsqlncli).
So my closing question: How can I read UTF-8 data in MS-SQL via ruby and sequel?
My environment:
Client:
- Windows 7
- Ruby 1.9.2
- sequel-3.33.0
Database:
- SQL Server 2005
- Database has collation Latin1_General_CI_AS
After preparing my question I found a solution. I will post it as an answer.
But I still hope, there is a better way.
If you can avoid it, you really don’t want to use the ado adapter (it’s OK for read-only workloads, but I wouldn’t recommend it for other workloads). I would try the tinytds adapter, as I believe that will handle encodings properly, and it defaults to UTF-8.
Sequel itself does not do any transcoding, it leaves the handling of encodings to the lower level driver.