I’m working on a data.frame with about 700 000 rows. It’s containing the ids of statusupdates and corresponding usernames from twitter. I just want to know how many different users are in there and how many times they’ve tweeted. So I thought this was a very simple task using tables. But know I noticed that I’m getting different results.
recently I did it converting the column to character like this
>freqs <- as.data.frame(table(as.character(w_dup$from_user))
>nrow(freqs)
[1] 239678
2 months ago I did it like that
>freqs <- as.data.frame(table(w_dup$from_user)
>nrow(freqs)
[1] 253594
I noticed that this way the data frame contains usernames with a Frequency 0. How can that be? If the username is in the dataset it must occur at least one time.
?table didn’t help me. Neither was I able to reproduce this issue on smaller datasets.
What I’m doing wrong. Or am I missunderstanding the use of tables?
The type of the column is the problem here and also keep in mind that levels of factors stay the same when subsetting the data frame:
so the first column contains factors. In this case:
all the levels (
"a","b","c") are taken into consideration. And herethey are not factors anymore.