I would like to remove duplicates from my data in my CSV file.
The first column is the year, and the second is the sentence. I would like to remove any duplicates of a sentence, regardless of the year information.
Is there a command that I can insert in val text = { } to remove these dupes?
My script is:
val source = CSVFile("science.csv");
val text = {
source ~>
Column(2) ~>
TokenizeWith(tokenizer) ~>
TermCounter() ~>
TermMinimumDocumentCountFilter(30) ~>
TermDynamicStopListFilter(10) ~>
DocumentMinimumLengthFilter(5)
}
Thank you!
Essentially you want a version of distinct where you can specify what makes an object (row) unique (the second column).
Given the code: (modified SeqLike.distinct)
If you had a list of rows (where a row is a tuple) you could get the filtered/unique ones based on the second column with