I have been working on a Python coded priority email inbox, with the ultimate aim of using a machine learning algorithm to label (or classify) a selection of emails as either important or un-important. I will begin with some background information and then move into my question.
I have so far developed code to extract data from an email and process it to discover the most important ones. This is achieved using the following email features:
- Senders Address Frequency
- Thread Activity
- Date Received (time between replies)
- Common Words in body/subject
The code I have currently applies a ranking (or weighting) (value 0.1-1) to each email based on its importance and then applies a label of either ‘important’ or ‘un-important’ (In this case this is just 1 or 0). The status of priority is awarded if the rank is >0.5. This data is stored in a CSV file (as below).
From Subject Body Date Rank Priority
test@test.com HelloWorld Body Words 10/10/2012 0.67 1
rest@test.com ByeWorld Body Words 10/10/2012 0.21 0
best@test.com SayWorld Body Words 10/10/2012 0.91 1
just@test.com HeyWorld Body Words 10/10/2012 0.48 0
etc …………………………………………………………………………
I have two sets of email data (One Training, One Testing). The above applies to my training email data. I am now attempting to train a learning algorithm so that I can predict the importance of the testing data.
To do this I have been looking at both SCIKIT and NLTK. However, I am having trouble transferring the information I have learnt in the tutorials and implementing into my project. I have no particular requirements in regards to which learning algorithm is used. Is this as simple as applying the following? And if so how?
X, y = email.data, email.target
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf = clf.fit(X, y)
X_new = [Testing Email Data]
clf.predict(X_new)
The easiest (though probably not the fastest) solution(*) is to use scikit-learn’s
DictVectorizer. First, read in each sample with Python’scsvmodule, and build adictcontaining(feature, value)pairs, while keeping the priority separate:You now have a sparse matrix
X_trainthat, together withy, you can feed to a scikit-learn classifier.Be aware:
When you want to make predictions on unseen data, you must apply the same procedure and the exact same
vectorizerobject to it. I.e. you have to build atest_dictsobject using the loop above, then doX_test = vectorizer.transform(test_dicts).I’ve assumed you want to predict the priority directly. Predicting the “rank” instead would be a regression problem, rather than a classification one. Some scikit-learn classifiers have a
predict_probamethod which will produce the probability that email are important, but you can’t train those on the ranks.(*) I am the author of scikit-learn’s
DictVectorizer, so this is not unbiased advice. It is from the horse’s mouth, though 🙂