I have to first read a file in Cyrillic, then randomly pick random number of lines and write modified text to a different file. No problem with Latin letter, but I run into a problem with Cyrillic text, because I get some rubbish. So this is how I tried to do the thing.
Say, file input.txt is
ааааааа
ббббббб
ввввввв
I have to read it, and put every line into a vector:
vector<wstring> inputVector;
wstring inputString, result;
wifstream inputStream;
inputStream.open("input.txt");
while(!inputStream.eof())
{
getline(inputStream, inputString);
inputVector.push_back(inputString);
}
inputStream.close();
srand(time(NULL));
int numLines = rand() % inputVector.size();
for(int i = 0; i < numLines; i++)
{
int randomLine = rand() % inputVector.size();
result += inputVector[randomLine];
}
wofstream resultStream;
resultStream.open("result.txt");
resultStream << result;
resultStream.close();
So how can I do work with Cyrillic so it produces readable things, not just symbols?
Because you saw something like ■a a a a a a a 1♦1♦1♦1♦1♦1♦1♦ 2♦2♦2♦2♦2♦2♦2♦ printed to the console, it appears that
input.txtis encoded in a UTF-16 encoding, probably UTF-16 LE + BOM. You can use your original code if you change the encoding of the file to UTF-8.The reason for using UTF-8 is that, regardless of the char type of the file stream,
basic_fstream‘s underlyingbasic_filebufuses acodecvtobject to convert a stream ofcharobjects to/from a stream of objects of the char type; i.e. when reading, thecharstream that is read from the file is converted to awchar_tstream, but when writing, awchar_tstream is converted to acharstream that is then written to the file. In the case ofstd::wifstream, thecodecvtobject is an instance of the standardstd::codecvt<wchar_t, char, mbstate_t>, which generally converts UTF-8 to UCS-16.As explained on the MSDN documentation page for
basic_filebuf:Similarly, when reading a Unicode string (containing
wchar_tcharacters), thebasic_filebufconverts the ANSI string read from the file to thewchar_tstring returned togetlineand other read operations.If you change the encoding of
input.txtto UTF-8, your original program should work correctly.For reference, this works for me:
Note that the encoding of
result.txtwill also be UTF-8 (generally).