I am working on a small project which involves a dictionary based text searching within a collection of documents. My dictionary has positive signal words (a.k.a good words) but in the document collection just finding a word does not guarantee a positive result as there may be negative words for example (not, not significant) that may be in the proximity of these positive words. I want to construct a matrix such that it contains the document number,positive word and its proximity to negative words.
Can anyone please suggest a way to do that. My project is at a very very early stage so I am giving a basic example of my text.
No significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide.
This is my example document in which candesartan cilexetil, glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide are my positive words and no significant is my negative word. I want to do a proximity (word based) mapping between my positive and nevative words.
Can anyone give some helpful pointers?
First of all I would suggest not to use R for this task. R is great for many things, but text manipulation is not one of those. Python could be a good alternative.
That said, if I were to implement this in R, I would probably do something like (very very rough):