Im doing some data cleansing on some messy data which is being imported into

Question

0

Editorial Team

Asked: May 24, 20262026-05-24T19:24:29+00:00 2026-05-24T19:24:29+00:00

Im doing some data cleansing on some messy data which is being imported into

0

Im doing some data cleansing on some messy data which is being imported into mysql.

The data contains ‘pseudo’ unicode chars, which are actually embedded into the strings as ‘u00e9’ etc.

So one field might be.. ‘Jalostotitlu00e1n’
I need to rip out that clumsy ‘u00e1n’ and replace it with the corresponding utf character

I can do this in either mysql, using substring and CHR maybe, but Im preprocssing the data via PHP, so I could do it there also.

I already know all about how to configure mysql and php to work with utf data. The problem is really just in the source data Im importing.

Thanks

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T19:24:30+00:00

There’s a way. Replace all uXXXX with their HTML representation and do an html_entity_decode()

I.e. echo html_entity_decode("Jalostotitlán");

Every UTF character in the form u1234 could be printed in HTML as ሴ. But doing a replace is quite hard, because there could be much false positives if there is no other char that identifies the beginning of an UTF sequence. A simple regex could be

preg_replace('/u([\da-fA-F]{4})/', '&#x\1;', $str)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Im doing some data cleansing on some messy data which is being imported into

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply