I am working with a Chinese database in text that saves entries in this format:
Traditional Simplified [pin1 yin1] /English equivalent 1/equivalent 2/
I’ve tried parsing it using delimiters (in Java).
This is what I have so far:
String delims = "[\\[\\]/]+";
String tokens[] = str.split(delims);
The problem is that the English equivalent also contains delimiter tokens.
For instance:
⿔ ⿔ [gui1] /variant of 龜|龟[gui1]/
How would someone parse this String?
I’m trying to get the following information from the String:
Simplified: ⿔
Traditional: ⿔
Pinyin: gui1
English Equivalent: variant of 龜|龟[gui1]
Try using regex to cleanup the whole string.
(\\S+)—>⿔find continuous non-white space group
\\s*—>find continuous white space
\\[(.+?)\\]—>gui1find everything inside [ bla bla bla ].
‘?’ will match shortest possible answer.
e.g. [ bla bla ] rather than [ bla bla] [ble ble ]
/(.+?)/—>variant of 龜|龟[gui1]same as above, but find everything inside / bla bla /
‘?’ will match shortest
You can test the regex here
Now
textbecomes:⿔;⿔;gui1;variant of 龜|龟[gui1]Next you can continue to use
;as delims to split them