Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8055169
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 5, 20262026-06-05T08:18:38+00:00 2026-06-05T08:18:38+00:00

I have almost 3.000 CSV files (containing tweets) with the same format, I want

  • 0

I have almost 3.000 CSV files (containing tweets) with the same format, I want to merge these files into one new file and remove the duplicate tweets. I have come across various topics discussing similar questions however the number of files is usually quit small. I hope you can help me write a code within R that does this job both efficiently and effectively.

The CSV files have the following format:

Image of CSV format:
Example CSV files

I changed (in column 2 and 3) the usernames (on Twitter) to A-E and the ‘actual names’ to A1-E1.

Raw text file:

"tweet";"author";"local.time"
"1";"2012-06-05 00:01:45 @A (A1):  Cruijff z'n met-zwart-shirt-zijn-ze-onzichtbaar logica is even mooi ontkracht in #bureausport.";"A (A1)";"2012-06-05 00:01:45"
"2";"2012-06-05 00:01:41 @B (B1):  Welterusten #BureauSport";"B (B1)";"2012-06-05 00:01:41"
"3";"2012-06-05 00:01:38 @C (C1):  Echt ..... eindelijk een origineel sportprogramma #bureausport";"C (C1)";"2012-06-05 00:01:38"
"4";"2012-06-05 00:01:38 @D (D1):  LOL. \"Na onderzoek op de Fontys Hogeschool durven wij te stellen dat..\" Want Fontys staat zo hoog aangeschreven? #bureausport";"D (D1)";"2012-06-05 00:01:38"
"5";"2012-06-05 00:00:27 @E (E1):  Ik kijk Bureau sport op Nederland 3. #bureausport  #kijkes";"E (E1)";"2012-06-05 00:00:27"

Somehow my headers are messed up, they obviously should move one column to the right. Each CSV file contains up to 1500 tweets. I would like to remove the duplicates by checking the 2nd column (containing the tweets) simply because these should be unique and the author columns can be similar (e.g. one author posting multiple tweets).

Is it possible to combine merging the files and removing the duplicates or is this asking for trouble and should the processes be separated? As a starting point I included two links two blogs from Hayward Godwin that discuss three approaches for merging CSV files.

http://psychwire.wordpress.com/2011/06/03/merge-all-files-in-a-directory-using-r-into-a-single-dataframe/

http://psychwire.wordpress.com/2011/06/05/testing-different-methods-for-merging-a-set-of-files-into-a-dataframe/

Obviously there are some topics related to my question on this site as well (e.g. Merging multiple csv files in R) but I haven’t found anything that discusses both merging and removing the duplicates. I really hope you can help me and my limited R knowledge deal with this challenge!

Although I have tried some codes I found on the web, this didn’t actually result in an output file. The approximately 3.000 CSV files have the format discussed above. I meanly tried the following code (for the merge part):

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              

This results in the following error:

Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 

Update

I have tried the following code:

 # grab our list of filenames
 filenames <- list.files(path = ".", pattern='^.*\\.csv$')
 # write a special little read.csv function to do exactly what we want
 my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',     col.names=c('ID','tweet','author','local.time'), colClasses=rep('character', 4)) }
 # read in all those files into one giant data.frame
 my.df <- do.call("rbind", lapply(filenames, my.read.csv))
 # remove the duplicate tweets
 my.new.df <- my.df[!duplicated(my.df$tweet),]

But I run into the following errors:

After the 3rd line I get:

  Error in read.table(file = file, header = header, sep = sep, quote = quote,  :  more columns than column names

After the 4th line I get:

  Error: object 'my.df' not found

I suspect that these errors are caused by some failures made in the writing process of the csv files, since there are some cases of the author/local.time being in the wrong column. Either to the left or the right of where they supposed to be which results in an extra column. I manually adapted 5 files, and tested the code on these files, I didn’t get any errors. However its seemed like nothing happened at all. I didn’t get any output from R?

To solve the extra column problem I adjusted the code slightly:

 #grab our list of filenames
 filenames <- list.files(path = ".", pattern='^.*\\.csv$')
 # write a special little read.csv function to do exactly what we want
 my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',   col.names=c('ID','tweet','author','local.time','extra'), colClasses=rep('character', 5)) }
 # read in all those files into one giant data.frame
 my.df <- do.call("rbind", lapply(filenames, my.read.csv))
 # remove the duplicate tweets
 my.new.df <- my.df[!duplicated(my.df$tweet),]

I tried this code on all the files, although R clearly started processing, I eventually got the following errors:

 Error in read.table(file = file, header = header, sep = sep, quote = quote,  : more columns than column names
 In addition: Warning messages:
 1: In read.table(file = file, header = header, sep = sep, quote = quote,  : incomplete final line found by readTableHeader on 'Twitts -  di mei 29 19_22_30 2012 .csv'
 2: In read.table(file = file, header = header, sep = sep, quote = quote,  : incomplete final line found by readTableHeader on 'Twitts -  di mei 29 19_24_31 2012 .csv'

 Error: object 'my.df' not found

What did I do wrong?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-05T08:18:39+00:00Added an answer on June 5, 2026 at 8:18 am

    First, simplify matters by being in the folder where the files are and try setting the pattern to read only files with the file ending ‘.csv’, so something like

    filenames <- list.files(path = ".", pattern='^.*\\.csv$')
    my.df <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))
    

    This should get you a data.frame with the contents of all the tweets

    A separate issue is the headers in the csv files. Thankfully you know that all files are identical, so I’d handle those something like this:

    read.csv('fred.csv', header=FALSE, skip=1, sep=';',
        col.names=c('ID','tweet','author','local.time'),
        colClasses=rep('character', 4))
    

    Nb. changed so all columns are character, and ‘;’ separated

    I’d parse out the time later if it was needed…

    A further separate issue is the uniqueness of the tweets within the data.frame – but I’m not clear if you want them to be unique to a user or globally unique. For globally unique tweets, something like

    my.new.df <- my.df[!duplicated(my.df$tweet),]
    

    For unique by author, I’d append the two fields – hard to know what works without the real data though!

    my.new.df <- my.df[!duplicated(paste(my.df$tweet, my.df$author)),]
    

    So bringing it all together and assuming a few things along the way…

    # grab our list of filenames
    filenames <- list.files(path = ".", pattern='^.*\\.csv$')
    # write a special little read.csv function to do exactly what we want
    my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',
        col.names=c('ID','tweet','author','local.time'),
        colClasses=rep('character', 4)) }
    # read in all those files into one giant data.frame
    my.df <- do.call("rbind", lapply(filenames, my.read.csv))
    # remove the duplicate tweets
    my.new.df <- my.df[!duplicated(my.df$tweet),]
    

    Based on the revised warnings after line 3, it’s a problem with files with different numbers of columns. This is not easy to fix in general except as you have suggested by having too many columns in the specification. If you remove the specification then you will run into problems when you try to rbind() the data.frames together…

    Here is some code using a for() loop and some debugging cat() statements to make more explicit which files are broken so that you can fix things:

    filenames <- list.files(path = ".", pattern='^.*\\.csv$')
    
    n.files.processed <- 0 # how many files did we process?
    for (fnam in filenames) {
      cat('about to read from file:', fnam, '\n')
      if (exists('tmp.df')) rm(tmp.df)
      tmp.df <- read.csv(fnam, header=FALSE, skip=1, sep=';',
                 col.names=c('ID','tweet','author','local.time','extra'),
                 colClasses=rep('character', 5)) 
      if (exists('tmp.df') & (nrow(tmp.df) > 0)) {
        cat('  successfully read:', nrow(tmp.df), ' rows from ', fnam, '\n')
        # now lets append a column containing the originating file name
        # so that debugging the file contents is easier
        tmp.df$fnam <- fnam
    
        # now lets rbind everything together
        if (exists('my.df')) {
          my.df <- rbind(my.df, tmp.df)
        } else {
          my.df <- tmp.df
        }
      } else {
        cat('  read NO rows from ', fnam, '\n')
      }
    }
    cat('processed ', n.files.processed, ' files\n')
    my.new.df <- my.df[!duplicated(my.df$tweet),]
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have almost 800 MB and over 31,000 files in hundreds of subfolders at
I have a bunch of files (almost 100) which contain data of the format:
I have a number of CSV files with hundreds of columns and about 50,000
I have a table (session) in a database which has almost 72,000 rows. I
I almost have the data I want...but need help filtering it. (pic at bottom)
I have a dictionary with almost 100,000 (key, value) pairs and the majority of
I have approximately 5,000 matrices with the same number of rows and varying numbers
I have almost 10,000 images in a Folder with image name like Abies_koreana_Blauer_Pfiff_05-06-10_1.jpg Abies_koreana_Prostrate_Beauty_05-05-10_2.jpg
I have almost 100,000 spam messages in my bounce folder for qmail. I've been
i work on sql 2005 server I have almost 350 000 insert scripts.. The

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.