I want to split a multi-lingual string to uni-lingual tokens using Regex.
for example for this English-Arabic string :
‘his name was محمد, and his mother name was آمنه.’
The result must be as below:
- ‘his name was ‘
- ‘محمد,’
- ‘ and his mother name was ‘
- ‘آمنه.’
It’s not perfect (you definitely need to try it on some real-world examples to see if it fits), but it’s a start:
This splits on whitespace/punctuation if the preceding character is from the Arabic block and the following character is from the Basic Latin block (or vice versa).