I have some text that was generate by another system. It combined some words

Question

0

Asked: May 20, 20262026-05-20T18:09:12+00:00 2026-05-20T18:09:12+00:00

I have some text that was generate by another system. It combined some words

0

I have some text that was generate by another system. It combined some words together in what I assume was some sort of wordwrap by-product. So something simple like ‘the dog’ is combine into ‘thedog’.

I checked the ascii and unicode string to see is there wasn’t some unseen character in there, but there wasn’t. A confounding problem is that this is medical text and a corpus to check against aren’t that available. So, real example is ‘…test to rule out SARS versus pneumonia’ ends up as ‘… versuspneumonia.’

Anyone have a suggestion for finding and separating these?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-20T18:09:13+00:00

Here is what I did. I combined a couple of ideas and using a general bootstrapping methodology came up with a pretty good solution. I used Python for all of this.

took a sample of reports, tokenized all the words and created a frequency table.
For words with a frequency of 3 or under (frequency of 4 or more was deemed common enough to be correct), I spell checked them using PyEnchant package (enchant library)
built a medical dictionary from the ‘misspelled’ words, in step 2, that were clinical.
for all the reports, created a frequency table
for words with a frequency under 4, I spell checked each using PyEnchant and my medical dictionary
Took each misspelled word and split them in all possible ways. The splits were tested for the creation of 2 correctly spelled words. kept any successful split
For each potential solutions the highest weighted solution was used.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have some text that was generate by another system. It combined some words

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply