I am trying to read characters from a file and after removing punctuations. I

Question

0

Asked: June 7, 20262026-06-07T10:59:09+00:00 2026-06-07T10:59:09+00:00

I am trying to read characters from a file and after removing punctuations. I

0

I am trying to read characters from a file and after removing punctuations. I want to store the words in an array and finally write them to another file. The contents of the file are :-

“यौ ता बाबू उदयभाहू उपेक्षा औंर अपमान्नकीपीड््ा ढोये जैसेतैस्ये वहबाबाके आश्रम म्पें पहैच गया ।
बाबा मान्नो उसी की प्रतीक्षा म्पें वैठे थे । वह ज्योही दण्डवत की मुदा म्पें हुभ्रा त्योंही
बाबा का गभ्रीर स्वर उसके कानों म्पे टकराया ‘ आभ्रो, ञैं तुम्हारे लिए ही बैठा हूें । ‘
अमित न्ने मस्तक ऊैंचा उठाया औंर एकाम्र भाव न्से बाबा को देखता रहा । बाबा
के पास वह अनेकों बार आ चुका था परन्तु. आज जैसी व्यथा, थकान्न औंर प्तानता
इससे दूर्व नहीं थी आदमी कभ्रीकभी इतना टूट ञाता ड़ँ कि ठसे अपने अस्तिल्द
के प्रति भ्री शंका होन्ने लगती न्है वह अनेक विचारों म्पें खो गया उसके नेत्र बाबा
कौ देख रहे थे परन्तु उस्यका मन कहीं औंर भ्रटक रद्दा था ।”……..

I tried to read these characters(Hindi– utf-8) using old turbo c++. Using simple char data-type.

The program compiled but the contents were not properly written to the file.
Then I used the same coding in visual c++ with the same code and I got error–

"Debug assertion failed ... unsigned(c+1) <=256"

Next I tried to use wide character data-type for this purpose. using<wchar.h> and <cwchar.h> header files and data-type wchar_t and other wide character functions but still the output is not proper —“��त �ྤ��௤ྤ�”

Is there any alternative or any other method to solve this problem.

Do answer with complete code segment also tell me what is the alternative for getline function for wchar. This is what I have tried to do…

#include<sstream>
#include<iostream>
#include<fstream>
#include<ctype.h>
#include<string>
#include<stdio.h>
#include<conio.h>
#include <istream>
#include<vector>
#include<string>
#include<stdlib.h>
#include<iostream>
#include<fstream>
#include<ctype.h>
#include<string.h>
#include<stdio.h>
#include<conio.h>
#include<vector>
#include<wchar.h>
#include<cwchar>
#include <locale.h>
#include <cwchar>
using namespace std;
unsigned char line[1000],storech[2000],storech1[20000];
wchar_t word[50];
std::vector< wchar_t* > storewrd;

void main()
{ 
    FILE * file3 = fopen("H:\\myfile.txt" , "w");
    cout << "check" << endl;
    FILE *stream;
    stream = fopen( "H:\\ocr.txt", "r" );
    setlocale(LC_ALL,"");
    int ch;
    int  test;
    wchar_t temp1;
    wchar_t buffer[500];
    wchar_t temp[500];

    int x=0,j=0;
        do
    {
        int loop = 0;
        ch = fgetwc(stream);

        //read word 
        while( (ch != '\n') && (ch != WEOF) ) 
        {
                buffer[loop] = ch;
            loop++;


         test = fgetwc(stream);
         temp1 = (wchar_t) test;
         if(!iswpunct(test))    
         fputwc( test , file3);
             wcout << temp1 << "  ";


        }


            int t;
        if (ch!= WEOF)
        {
             for(t=0;t<loop;t++)
             {
            temp[t] = buffer[t];
             }
             temp[loop++] = '\0';

                j++;
                //cout << buffer[loop] << "  ";
        }       
    }while(ch != WEOF);

    cout << "check";


    _getch();

}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-07T10:59:10+00:00

It’s not really clear to me what you’re trying to do: where did the
assertion failure occur? How are you trying to determine whether the
characters are punctuation or not?

UTF-8 is a multibyte encoding, which means that the single byte
functions like ispunct don’t work on it. It is a variable length
encoding, however, and all of the characters in the original ASCII code
set have single byte encodings. If the only punctuation you are
concerned with are characters in the original ASCII, you can
“cheat” a bit, and use something like:

if ( (ch & 0x80) == 0 && ispunct( ch ) ) {
    //  is ASCII punctuation
} else {
    //  is something else
}

I put “cheat” in quotes, because one of the goals of Unicode
and UTF-8 is that code that looks for things like ASCII punctuation
should work unchanged.

If you need to recognize more than just ASCII punctuaion (e.g. things
like «, ¿ or —), and you want to use wchar_t
(which is usually, but not always UTF-16 or UTF-32), and the file is
UTF-8, you’ll need to use an appropriate locale which does the code
translation. In this case, you should definitely use iostream, and
not C style IO; iostream will allow you to imbue the stream with the
appropriate locale, and C++ locales will allow you to create locales on
the fly, by changing a single facet (codecvt, in this case) from
another locale (probably the global one). (Under Linux, the global
locale, particularly in non-English speaking areas, is often a UTF-8
locale, which can be used directly. Under Windows, I would expect it to
be a UTF-16 locale, which will not translate UTF-8 correctly.) If you
don’t want to get involved with locales, read your UTF-8 directly into a
char buffer, and use the iconv library or something similar to
translate it within your program. Be aware, however, that there might
be some rare punctuation outside of the basic plane, which will be
encoded using two surrogate characters in UTF-16; iswpunct will not
work for these if your wchar_t uses UTF-16 (Windows and AIX). (Most
of the characters outside the basic plane are CJK or from historic
scripts not used today, so this might not be an issue for you.)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to read characters from a file and after removing punctuations. I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply