Im making my final project in my studies.
and I’m trying to create sentiment analysis of Twitter messages.
I’m using Bayesian algorithm, and bag of words.
Do you have an example of bag of words algorithm in PHP?
I can’t find anything, maybe list of positive and negative words or something
I haven’t implemented Bag of Words in PHP but I’ve done it in java. A simple way to implement it would be by taking the training data and tokenizing it (example Stanford Tokenizer). Once you have tokenized all your training data, you can then extract 1-grams from it. I use this http://homepages.inf.ed.ac.uk/lzhang10/ngram.html to extract the grams and then remove the count of words from the output and just use the words. This becomes your Bag of Words corpus which can be used during training and classification. Make sure, you use the same tokenizer during training and testing or classification and also use the same corpus while training the models.
Now implementing it is pretty easy, just take a string of data and tokenize it using the same tokenizer used to create the bag of words corpus. Now take each token and then find whether that token is available in your corpus and at what position. For example, you have a corpus which has words as follows :-
a
name
the
hello
world
,
And you have a string “hello, my name is Jas”. Tokenizing it would give the following tokens {hello,,,my,name,is,Jas} and when you try to match these tokens with the corpus your result would be :-
2:1 4:1 6:1
This means, the words name, hello and comma which are present in the location 2, 4 and 6 in your corpus are present in the incoming test string.