Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6980325
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T18:00:06+00:00 2026-05-27T18:00:06+00:00

I am inserting soft hyphens into long words programatically, and am having problems with

  • 0

I am inserting soft hyphens into long words programatically, and am having problems with unusual characters, specifically: ■

Any word over 10 characters gets the soft hyphen treatment. Words are defined with a regex: [A-Za-z0-9,.]+ (to include long numbers). If I split a string containing two of the above unicode character with that regex, I get a ‘word’ like this: ■■

My script then goes through each word, measured the length (mb_strlen($word, 'UTF-8')), and if it is over an arbitrary number of characters, loops through the letters and inserts soft hyphens all over the place (every third character, not in the last five characters).

With the ■■, the word length is coming out as high enough to trigger the replacement (10). So soft hyphens are inserted, but they are inserted within the characters. So what I get out is something like:

�­�■

In the database, these ■ characters are being stored (in a json_encoded block) as “\u2002”, so I can see where the string length is coming from. What I need is a way to identify these characters, so I can avoid adding soft hyphens to words that contain them. Any ideas, anyone?

(Either that, or a way to measure the length of a string, counting these as single characters, and then a way to split that string into characters without splitting it part-way through a multi-byte character.)

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T18:00:07+00:00Added an answer on May 27, 2026 at 6:00 pm

    With the same caveats as listed in the comments about guessing without seeing the code:

    mb_strlen($word, 'UTF-8'), and if it is over an arbitrary number of characters, loops through the letters

    I suspect you are actually looping through bytes. This is what will happen if you use array-access notation on a string.

    When you are using a multibyte encoding like UTF-8, a letter (or more generally ‘character’) may take up more than one byte of storage. If you insert or delete in the middle of a byte sequence you will get mangled results.

    This is why you must use mb_strlen and not plain old strlen. Some languages have a native Unicode string type where each item is a character, but in PHP strings are completely byte-based and if you want to interact with them in a character-by-character way you must use the mb_string functions. In particular to read a single character from a string you use mb_substr, and you’d loop your index from 0 to mb_strlen.

    It would probably be simpler to take the matched word and use a regular expression replacement to insert the soft hyphen between each sequence. You can get multibyte string support for regex by using the u flag. (This only works for UTF-8, but UTF-8 is the only multibyte encoding you’d ever actually want to use.)

    const SHY= "\xC2\cAD"; // U+00AD Soft Hyphen encoded as UTF-8
    $wrappableword= preg_replace('/.{3}\B/u', '$1'.SHY, $longword);
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

When inserting copy into an HTML document I get from sources such as word
Apart from just inserting and parsing text into a blank Word field, is there
I am having trouble inserting values into my Account table that's in a SQL
I am having trouble inserting a record into a MySQL database from python. This
We're having issues inserting links into rich text in Sitecore 6.1.0. When a link
When inserting an object into an array with a property is there any reason
I'm inserting multiple records into a table A from another table B. Is there
I am inserting a column in a DataGridView programmatically (i.e., not bound to any
If I am inserting elements into a wrap panel and there is still space
Inserting multilingual data into a SQL 2008 database (nvarchar field) I notice that it

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.