I am having utf-8 encoded file containing arabic text and I have to search

Question

0

Editorial Team

Asked: May 11, 20262026-05-11T18:25:16+00:00 2026-05-11T18:25:16+00:00

I am having utf-8 encoded file containing arabic text and I have to search

0

I am having utf-8 encoded file containing arabic text and I have to search it.

My problem are diacritics, how to search skipping them?

Like if you load that text in Internet Explorer (converting text in HTML ofcourse ), IE is skipping those diacritics?

Any help?

Edit1: Search is simply performed by following code:

 var m1 : TMemo; //contains utf-8 data)
     m2 : TMemo; // contains results

 ...

      m2.lines.BeginUpdate;
      for s in m1.Lines do
      begin
        if pos(eSearch.Text,s)>0 then
           begin
           m2.Lines.Add(s);
           end;
      end;
      m2.Lines.EndUpdate;

Edit2: Example of unicode data:

قُلْ هُوَ اللَّهُ أَحَدٌ
If you search only letters without diacritics قل the word قُلْ wont be found.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-11T18:25:17+00:00

I find that diacritics are not the only problem.

I would do character replacements, replacing them by empty strings, I would also normalize the text ‘أ’ ‘إ’ ‘آ’ are all converted to ‘ا’, and also do the same for ى ئ ي ؤ و ة ه …

For search I’d also use a light stemmer like the “khoja stemmer” (Java source here)

A more advanced way is to do it like TREC:

Remove punctuation
Remove diacritics (mainly weak vowels) Most of the corpus did not contain weak vowels.
Some of the dictionary entries contained weak vowels. This made everything consistent.
Remove non letters
Replace initial إ or أ with bare alif .ا
Replace آ with ا
Replace the sequence ىء with ئ
Replace final ى with ي
Replace final ة with ه
Strip 6 prefixes: definite articles ( فال آال، بال، وال، ال، ) and و
(and) from the beginnings of normalized words
Strip 10 suffixes from the ends of words ات ان، ها،ي ة، ه، ية، يه، ين، ون

I would index the text by this modified text (for memos I’d store the index of the word in the original text), and do the same thing for the search query.

I would also search in Memo1.Text and not the lines one by one, the search could be for multiple words that may be at the end of a line and wrapped to the next line.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am having utf-8 encoded file containing arabic text and I have to search

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply