Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6062475
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T09:03:23+00:00 2026-05-23T09:03:23+00:00

I’m trying to use the Japanese morphological analyzer MeCab in a C# program (Visual

  • 0

I’m trying to use the Japanese morphological analyzer MeCab in a C# program (Visual Studio 2010 Express, Windows 7), and something’s going wrong with the encoding. If my input (pasted into a textbox) is this:

一方、広義の「ネコ」は、ネコ類(ネコ科動物)の一部、あるいはその全ての獣を指す包括的名称を指す。

Then my output (in another textbox) looks like this:

?   å詞,サ変接続,*,*,*,*,*
?   å詞,サ変接続,*,*,*,*,*
?   å詞,サ変接続,*,*,*,*,*
?   å詞,サ変接続,*,*,*,*,*
?   å詞,サ変接続,*,*,*,*,*
?   å詞,サ変接続,*,*,*,*,*
?   å詞,サ変接続,*,*,*,*,*
?   å詞,サ変接続,*,*,*,*,*
?   å詞,サ変接続,*,*,*,*,*
?   å詞,サ変接続,*,*,*,*,*
?   å詞,サ変接続,*,*,*,*,*
?   å詞,サ変接続,*,*,*,*,*
?   å詞,サ変接続,*,*,*,*,*
?   å詞,サ変接続,*,*,*,*,*
?   å詞,サ変接続,*,*,*,*,*
(   å詞,サ変接続,*,*,*,*,*
?   å詞,サ変接続,*,*,*,*,*
?   å詞,サ変接続,*,*,*,*,*
?   å詞,サ変接続,*,*,*,*,*
?   å詞,サ変接続,*,*,*,*,*
?   å詞,サ変接続,*,*,*,*,*
)   å詞,サ変接続,*,*,*,*,*
?   å詞,サ変接続,*,*,*,*,*
?????????????????????????   å詞,サ変接続,*,*,*,*,*
EOS

I would guess that that’s text in some other encoding being mistaken for UTF-8-encoded text. But assuming that it’s EUC-JP and using Encoding.Convert to turn it into UTF-8 doesn’t change the output; assuming that it’s Shift-JIS and doing the same gives different gibberish. Also, while it’s definitely processing the text – that’s how MeCab output is supposed to be formatted – it doesn’t appear to be interpreting the input as UTF-8, either. If it were doing so, there wouldn’t be all those identical lines in the output starting with one-character “compounds,” which it’s clearly unable to identify.

I get yet another different-looking set of gibberish when I run the sentence through MeCab’s command line. But, again, it’s just a row of single question marks and parentheses going down the left, so it’s not just the problem that the Windows command line doesn’t support fonts with Japanese characters; again, it’s just not reading the input in as UTF-8. (I did install MeCab in UTF-8 mode.)

The relevant parts of the code look like this:

[DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl)]
private extern static IntPtr mecab_new2(string arg);
[DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl)]
[return: MarshalAs(UnmanagedType.AnsiBStr)]
private extern static string mecab_sparse_tostr(IntPtr m, string str);
[DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl)]
private extern static void mecab_destroy(IntPtr m);

private string meCabParse(string jpnText)
{
    IntPtr mecab = mecab_new2("");
    string parsedText = mecab_sparse_tostr(mecab, jpnText);

    mecab_destroy(mecab);
    return parsedText;
}

(In terms of fiddling with plausible-looking things to see if they make a difference, I’ve tried switching “UnmanagedType.AnsiBStr” to “UnmanagedType.BStr,” which gives the error “AccessViolationException was unhandled,” and adding “CharSet=CharSet.Unicode” to the DllImport parameters, which turned the output into just “EOS”.)

This is how I’ve been doing the conversion:

// 65001 = UTF-8 codepage, 20932 = EUC-JP codepage
private string convertEncoding(string sourceString, int sourceCodepage, int targetCodepage)
{
    Encoding sourceEncoding = Encoding.GetEncoding(sourceCodepage); 
    Encoding targetEncoding = Encoding.GetEncoding(targetCodepage);

    // convert source string into byte array
    byte[] sourceBytes = sourceEncoding.GetBytes(sourceString);

    // convert those bytes into target encoding
    byte[] targetBytes = Encoding.Convert(sourceEncoding, targetEncoding, sourceBytes);

    // byte array to char array
    char[] targetChars = new char[targetEncoding.GetCharCount(targetBytes, 0, targetBytes.Length)];

    //char array to targt-encoded string
    targetEncoding.GetChars(targetBytes, 0, targetBytes.Length, targetChars, 0);
    string targetString = new string(targetChars);

    return targetString;
}

private string meCabParse(string jpnText)
{
    // convert the text from the string from UTF-8 to EUC-JP
    jpnText = convertEncoding(jpnText, 65001, 20932);

    IntPtr mecab = mecab_new2("");
    string parsedText = mecab_sparse_tostr(mecab, jpnText);

    // annnd convert back to UTF-8
    parsedText = convertEncoding(parsedText, 20932, 65001);

    mecab_destroy(mecab);
}

Suggestions/taunts?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T09:03:23+00:00Added an answer on May 23, 2026 at 9:03 am

    I came across this thread looking for a way to do the same. I used your code as a starting point and this blog post for figuring out how to marshal UTF8 strings.

    The following code gives me properly encoded output:

    public class Mecab
    {
        [DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl, CharSet=CharSet.Unicode)]
        private extern static IntPtr mecab_new2(string arg);
        [DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Unicode)]
        private extern static IntPtr mecab_sparse_tostr(IntPtr m, byte[] str);
        [DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Unicode)]
        private extern static void mecab_destroy(IntPtr m);
    
        public static String Parse(String input)
        {
            IntPtr mecab = mecab_new2("");
            IntPtr nativeStr = mecab_sparse_tostr(mecab, Encoding.UTF8.GetBytes(input));
            int size = nativeArraySize(nativeStr) - 1;
            byte[] data = new byte[size];
            Marshal.Copy(nativeStr, data, 0, size);
    
            mecab_destroy(mecab);
    
            return Encoding.UTF8.GetString(data);
        }
    
        private static int nativeArraySize(IntPtr ptr)
        {
            int size = 0;
            while (Marshal.ReadByte(ptr, size) > 0)
                size++;
    
            return size;
        }
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am trying to understand how to use SyndicationItem to display feed which is
I'm trying to use string.replace('’','') to replace the dreaded weird single-quote character: ’ (aka
Basically, what I'm trying to create is a page of div tags, each has
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I want use html5's new tag to play a wav file (currently only supported
I'm parsing an RSS feed that has an ’ in it. SimpleXML turns this
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
I'm trying to create an if statement in PHP that prevents a single post
I am trying to loop through a bunch of documents I have to put
I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.