I have a CSV file that looks like:
Lorem ipsum dolor sit amet , 12:01
consectetuer adipiscing elit, sed , 12:02
etc…
It is quite a large file (approx. 10,000 rows)
I would like to get the total vocabulary size of all the rows of text together. That is, ignoring the second column (the time), lowercasing everything and then counting the number of different words.
Issues:
1) how to separate each word within each row
2) how to lowercase everything and remove non-alphabetical characters.
So far I have the following code:
import csv
with open('/Users/file.csv', 'rb') as file:
vocabulary = []
i = 0
reader = csv.reader(file, delimiter=',')
for row in reader:
for word in row:
if row in vocabulary:
break
else:
vocabulary.append(word)
i = i +1
print i
Thank you for your help!
You have pretty much what you need. One missing point is lowercase-conversion, which can simply be done with
word.lower().Another thing you’re missing is splitting into words. You should use
.split()for this task, which by default splits on every whitespace-character, i.e., spaces, tabs etc.One problem you will have is to distinguish between commas within the text and the column-separation comma. Maybe don’t use csv-reader but simply read each line and remove the time, then split it into words.
If you want to remove other characters, include them in the second regular expression. If performance matters to you you should compile two regular expressions once before the
forloop.