I am working with text which is, unfortunately, given in ALL CAPS. The default nltk.pos_tag function does not do a very good job on this text (it thinks everything is a proper noun).
What is the best way to deal with this issue?
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
The best would be to apply truecasing to your text before POS-tagging.
If that is too much efford for you, you can transform your Python string
xto lower characters usingx.lower(), that should at least avoid the problem of getting only proper noun tags (there might be some confusions with too less proper noun tags though).You could train a POS-Tagger by transforming a tagged corpus previously to
loweraswell, but if you want to get the best results you probably want to go for the truecasing.