I have following task to do: to fill spell check dictionary (simple txt file) I need parser
which should: – parse within text file (or another type of document), extract
each word and then create text file with simple list of words like this:
adfadf
adfasdfa
adfasfdasdf
adsfadf
…
etc
What scripting language and library you would suggest? If possible, please, give example of code (especially for extracting each word). Thanks!
I have following task to do: to fill spell check dictionary (simple txt file)
Share
What you want is not a parser, but just a tokenizer. This can be done in any language with a bunch of regular expressions, but I do recommend Python with NLTK:
Generally, just about any NLP toolkit will include a tokenizer, so there’s no need to reinvent the wheel; tokenizing isn’t hard, but it involves writing a lot of heuristics to handle all the exceptions such as abbreviations, acronyms, etc.