Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6618271
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T20:48:34+00:00 2026-05-25T20:48:34+00:00

I have the following task: Build a personal dictionary for chinese characters. Users choose

  • 0

I have the following task: Build a personal dictionary for chinese characters. Users choose single chinese characters from a list. The software then goes through a list of combinations of characters and filters out all that contain characters that are not in the users list of single characters. So if the user studied 1(一) and 10 (十), then 11 (十一) should be shown, but not 12 (十二).

The next issue is that there are about 12k single characters and 100k combinations. The whole list can become very long. Currently I am facing the following issue: MySQL does not seem to be able to do proper REGEX matching with unicode characters. PHP can however. When I do a MySQL query (see below), I get a lot of false positives. I have to filter the results with PHP afterwards again. The whole thing takes a lot of time. I have now a sample list of 180 single charachters that are matched in a SQL regex as below. The result of the SQL are over 30’000 combinations. To do that SQL call takes about 6 seconds on the machine that I am running on. When I check the results with PHP afterwards, the result are only 1182 combinations. That’s a lot of false positives. On top of that, checking the results takes another couple of seconds. With each single character I add to the list, the time increases by about half a second. A more effective method is needed urgently.

To tackle the issue, I first need to figure out why MySQL has so many false positives:

If I do regular expressions with PHP, I use a /regex/u to indicate that the subject is unicode and this gives me correct results.

In MySQL however, I do not know how do set such a flag. All REGEXP ‘regex’ results are returned in the same way as if I used PHP preg_match('/regex/', $subject) instead of /regex/u.

I tried to change the collation of the result to various utf8_* but it would not change the result. Also adding a fulltext index over the database did not do anything.

Here is a testing-function that I wrote to highlight the issue. if you have any other ideas for checks to build in there to drill down on the problem please tell me.

$db = mysql_connect('localhost', 'kanji', '************');
$link =  mysql_select_db('kanji_data', $db);
mysql_query('SET character_set_results=utf8');
mysql_query('SET names=utf8');
mysql_query('SET character_set_client=utf8');
mysql_query('SET character_set_connection=utf8');
mysql_query('SET character_set_results=utf8');
mysql_query('SET collation_connection=utf8_general_ci');
mysql_set_charset('utf8');

echo '<pre>debug: encoding=' .mysql_client_encoding(). '</pre>';
$string = '三|二|四|一|五';
$sql = "SELECT simplified, length(simplified), searchindex FROM chinese WHERE strlen>0 AND simplified REGEXP '($string)+';";
$sql_encoding = mb_detect_encoding($sql);
echo '<pre>debug: sql string encoding: ' . $sql_encoding . '</pre>';
echo '<pre>debug: sql string: ' . $sql . '</pre>';
// echo $sql;
$rst = mysql_query($sql);
echo mysql_errno($db) . ": " . mysql_error($db). "\n";
while ($row = mysql_fetch_array($rst, MYSQL_NUM)) {
    $len = mb_strlen($row[0]);
    $result_encoding =  mb_detect_encoding($row[0]);
    $pattern = "/^(三|二|四|一|五)+$/u";
    preg_match($pattern, $row[0], $matches);
    if (count($matches) == 0) {
        echo "ERROR: ";
    }
    echo 'string: '. $row[0] . ' ('.$row[1] .' long mysql, '.$len.' long php, encoding: '.$result_encoding.')'.$row[2] ."<br>\n\n\n";
}

The result of the function can be see on this website.

If I am doing something completely wrong to achieve the required result, I am also happy to tackle this one differently.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T20:48:35+00:00Added an answer on May 25, 2026 at 8:48 pm

    The issue at hand is that MySQL cannot do REGEX in combination with Unicode characters properly at all. REGEX always works on a byte-basis and Unicode needs to group several bytes. There is no solution, only workarounds.

    One workaround that I did was indexing every letter occurrences with another table and then running checks on the index instead of the original Japanese text.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Let's say I have the following class hierarchy: TaskViewer inherits from ListViewer<Task> which in
I have the following question: I was given the task - to build an
I have the following test defined in a psake build script: task package -depends
I have a build task in rake defined with the following dependencies: desc 'Builds
I have the following task, which because of the combination of DestinationFiles and DestionationFolder
I have the following task in my MSBuild script: <Target Name=ZipStates> <Message Text=CREATING ZIP
I have the following beans Task, ServerDetails and ApplicationDetails. I wish to retrieve all
I have the following situation: var Task = Backbone.Model.extend({ initialize: function() { }, save:
I am charged with the following task in a Rails project. Clients will have
I am novice in sharepoint programming. I have a following code: SPWorkflowTask task =

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.