object NGram{
def main(args: Array[String]) {
//args(0) = textfile //args(1) = size of n-grams //args(2) = the number of words to generate
val F = scala.io.Source.fromFile(args(0)) // take from args[0]
for (line <- F.getLines()){
val words = line.split("[ ,:;.?!-]+") map (_.toLowerCase)
var ngram : Set[String] = Set()
//make n-gram
for(i <- 0 to words.size - args(1)) {
// first make sequence by args(1)
for(j <- i until i + args(1)){
ngram = ngram + words(j) // not works it is my problem stage
}
}
}
}
}
I made n-gram algorithm by using scala. at first
- make string sequence, and check it is in original string.
- and It is efficiently works.
I want n string sequence not duplicated (because it must work efficiently)
How to make n string sequence by map?
Am I correct that :
There is a routine that will give you n-grams, it is
sliding.with
There is a caveat, if you have only p words and want n-grams, with n > p, sliding will return one p-gram (not an n-gram obviously) rather than none. So you have to check for that.
You can do
toSetrather thantoSeqto eliminate duplicates.There is the last point, you want only a certain number of n-grams (your last argument). You did not specify how you want to select them. The simple way would be a take. To avoid to go through the whole list of words, and take the
countfirst distinct one, that would beIf you want to take them at random position, that is a different story and maybe
slidingis not the way to go.