I am trying to read characters from a file and after removing punctuations. I want to store the words in an array and finally write them to another file. The contents of the file are :-
“यौ ता बाबू उदयभाहू उपेक्षा औंर अपमान्नकीपीड््ा ढोये जैसेतैस्ये वहबाबाके आश्रम म्पें पहैच गया ।
बाबा मान्नो उसी की प्रतीक्षा म्पें वैठे थे । वह ज्योही दण्डवत की मुदा म्पें हुभ्रा त्योंही
बाबा का गभ्रीर स्वर उसके कानों म्पे टकराया ‘ आभ्रो, ञैं तुम्हारे लिए ही बैठा हूें । ‘
अमित न्ने मस्तक ऊैंचा उठाया औंर एकाम्र भाव न्से बाबा को देखता रहा । बाबा
के पास वह अनेकों बार आ चुका था परन्तु. आज जैसी व्यथा, थकान्न औंर प्तानता
इससे दूर्व नहीं थी आदमी कभ्रीकभी इतना टूट ञाता ड़ँ कि ठसे अपने अस्तिल्द
के प्रति भ्री शंका होन्ने लगती न्है वह अनेक विचारों म्पें खो गया उसके नेत्र बाबा
कौ देख रहे थे परन्तु उस्यका मन कहीं औंर भ्रटक रद्दा था ।”……..
I tried to read these characters(Hindi– utf-8) using old turbo c++. Using simple char data-type.
The program compiled but the contents were not properly written to the file.
Then I used the same coding in visual c++ with the same code and I got error–
"Debug assertion failed ... unsigned(c+1) <=256"
Next I tried to use wide character data-type for this purpose. using<wchar.h> and <cwchar.h> header files and data-type wchar_t and other wide character functions but still the output is not proper —“���त �ྤ���ྤ�”
Is there any alternative or any other method to solve this problem.
Do answer with complete code segment also tell me what is the alternative for getline function for wchar. This is what I have tried to do…
#include<sstream>
#include<iostream>
#include<fstream>
#include<ctype.h>
#include<string>
#include<stdio.h>
#include<conio.h>
#include <istream>
#include<vector>
#include<string>
#include<stdlib.h>
#include<iostream>
#include<fstream>
#include<ctype.h>
#include<string.h>
#include<stdio.h>
#include<conio.h>
#include<vector>
#include<wchar.h>
#include<cwchar>
#include <locale.h>
#include <cwchar>
using namespace std;
unsigned char line[1000],storech[2000],storech1[20000];
wchar_t word[50];
std::vector< wchar_t* > storewrd;
void main()
{
FILE * file3 = fopen("H:\\myfile.txt" , "w");
cout << "check" << endl;
FILE *stream;
stream = fopen( "H:\\ocr.txt", "r" );
setlocale(LC_ALL,"");
int ch;
int test;
wchar_t temp1;
wchar_t buffer[500];
wchar_t temp[500];
int x=0,j=0;
do
{
int loop = 0;
ch = fgetwc(stream);
//read word
while( (ch != '\n') && (ch != WEOF) )
{
buffer[loop] = ch;
loop++;
test = fgetwc(stream);
temp1 = (wchar_t) test;
if(!iswpunct(test))
fputwc( test , file3);
wcout << temp1 << " ";
}
int t;
if (ch!= WEOF)
{
for(t=0;t<loop;t++)
{
temp[t] = buffer[t];
}
temp[loop++] = '\0';
j++;
//cout << buffer[loop] << " ";
}
}while(ch != WEOF);
cout << "check";
_getch();
}
It’s not really clear to me what you’re trying to do: where did the
assertion failure occur? How are you trying to determine whether the
characters are punctuation or not?
UTF-8 is a multibyte encoding, which means that the single byte
functions like
ispunctdon’t work on it. It is a variable lengthencoding, however, and all of the characters in the original ASCII code
set have single byte encodings. If the only punctuation you are
concerned with are characters in the original ASCII, you can
“cheat” a bit, and use something like:
I put “cheat” in quotes, because one of the goals of Unicode
and UTF-8 is that code that looks for things like ASCII punctuation
should work unchanged.
If you need to recognize more than just ASCII punctuaion (e.g. things
like
«,¿or—), and you want to usewchar_t(which is usually, but not always UTF-16 or UTF-32), and the file is
UTF-8, you’ll need to use an appropriate locale which does the code
translation. In this case, you should definitely use iostream, and
not C style IO; iostream will allow you to imbue the stream with the
appropriate locale, and C++ locales will allow you to create locales on
the fly, by changing a single facet (
codecvt, in this case) fromanother locale (probably the global one). (Under Linux, the global
locale, particularly in non-English speaking areas, is often a UTF-8
locale, which can be used directly. Under Windows, I would expect it to
be a UTF-16 locale, which will not translate UTF-8 correctly.) If you
don’t want to get involved with locales, read your UTF-8 directly into a
charbuffer, and use theiconvlibrary or something similar totranslate it within your program. Be aware, however, that there might
be some rare punctuation outside of the basic plane, which will be
encoded using two surrogate characters in UTF-16;
iswpunctwill notwork for these if your
wchar_tuses UTF-16 (Windows and AIX). (Mostof the characters outside the basic plane are CJK or from historic
scripts not used today, so this might not be an issue for you.)