I have a CSV file that looks like: Lorem ipsum dolor sit amet ,

Question

0

Asked: June 17, 20262026-06-17T16:46:20+00:00 2026-06-17T16:46:20+00:00

I have a CSV file that looks like: Lorem ipsum dolor sit amet ,

0

I have a CSV file that looks like:

Lorem ipsum dolor sit amet , 12:01
consectetuer adipiscing elit, sed , 12:02

etc…

It is quite a large file (approx. 10,000 rows)
I would like to get the total vocabulary size of all the rows of text together. That is, ignoring the second column (the time), lowercasing everything and then counting the number of different words.

Issues:
1) how to separate each word within each row
2) how to lowercase everything and remove non-alphabetical characters.

So far I have the following code:

import csv
with open('/Users/file.csv', 'rb') as file:
    vocabulary = []
    i = 0
    reader = csv.reader(file, delimiter=',')
    for row in reader:
        for word in row:
            if row in vocabulary:
                break
            else:
                vocabulary.append(word)
                i = i +1
print i

Thank you for your help!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T16:46:21+00:00

You have pretty much what you need. One missing point is lowercase-conversion, which can simply be done with word.lower().

Another thing you’re missing is splitting into words. You should use .split() for this task, which by default splits on every whitespace-character, i.e., spaces, tabs etc.

One problem you will have is to distinguish between commas within the text and the column-separation comma. Maybe don’t use csv-reader but simply read each line and remove the time, then split it into words.

import re

with open('/Users/file.csv', 'rb') as file:
    for line in file:
        line = re.sub(" , [0-2][0-9]:[0-5][0-9]", "", line)
        line = re.sub("[,|!|.|?|\"]", "", line)
        words = [w.lower() for w in line.split()]
        for word in words:
            ...

If you want to remove other characters, include them in the second regular expression. If performance matters to you you should compile two regular expressions once before the for loop.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a CSV file that looks like: Lorem ipsum dolor sit amet ,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply