I have some text that was generate by another system. It combined some words together in what I assume was some sort of wordwrap by-product. So something simple like ‘the dog’ is combine into ‘thedog’.
I checked the ascii and unicode string to see is there wasn’t some unseen character in there, but there wasn’t. A confounding problem is that this is medical text and a corpus to check against aren’t that available. So, real example is ‘…test to rule out SARS versus pneumonia’ ends up as ‘… versuspneumonia.’
Anyone have a suggestion for finding and separating these?
Here is what I did. I combined a couple of ideas and using a general bootstrapping methodology came up with a pretty good solution. I used Python for all of this.