Assume I have a string text = "A compiler translates code from a source language". I want to do two things:
-
I need to iterate through each word and stem using the
NLTKlibrary. The function for stemming isPorterStemmer().stem_word(word). We have to pass the argument ‘word’. How can I stem each word and get back the stemmed sentence? -
I need to remove certain stop words from the
textstring. The list containing the stop words is stored in a text file (space separated)stopwordsfile = open('c:/stopwordlist.txt','r+') stopwordslist=stopwordsfile.read()How can I remove those stop words from
textand get a cleaned new string?
I posted this as a comment, but thought I might as well flesh it out into a full answer with some explanation:
You want to use
str.split()to split the string into words, and then stem each word:As you want to get a string of all the stemmed words together, it’s trivial to then join these stems back together. To do this easily and efficiently we use
str.join()and a generator expression:Edit:
For your other problem:
Here we open the file using the
withstatement (which is the best way to open files, as it handles closing them correctly, even on exceptions, and is more readable) and read the contents into a set. We use a set as we don’t care about the order of the words, or duplicates, and it will be more efficient later. I am presuming one word per line – if this isn’t the case, and they are comma separated, or whitespace separated then usingstr.split()as we did before (with appropriate arguments) is probably a good plan.Here we use the if clause of a generator expression to ignore words that are in the set of words we loaded from a file. Membership checks on a set are O(1), so this should be relatively efficient.
Edit 2:
To remove the words before they are stemmed, it’s even simpler:
The removal of the given words is simply: