I have 1000 CSV files from dfr.jstor.org with two columns, KEYWORDS and WEIGHT. The

Question

0

Asked: May 26, 20262026-05-26T17:32:55+00:00 2026-05-26T17:32:55+00:00

I have 1000 CSV files from dfr.jstor.org with two columns, KEYWORDS and WEIGHT. The

0

I have 1000 CSV files from dfr.jstor.org with two columns, KEYWORDS and WEIGHT. The length of each column varies from file to file. Here’s a snippet of one CSV file:

KEYTERMS  WEIGHT
canoe     1
archaic   0.273
pinus     0.191
florida   0.164

I want to use R to get the KEYTERMS column from each CSV file and merge it into a single data frame like this:

KEYTERMS_CSVFILENAME1 KEYTERMS_CSVFILENAME2 KEYTERMS_CSVFILENAME3
thwart                newsom                period 
dugout                site                  cypress 
sigma                 date                  hartmann 
precontact            NA                    florida 
orange                NA                    NA

Where CSVFILENAME1 is the name of the CSV file where those keywords came from and NA is an empty cell.

I think my problem is very simliar to this one with the difference that I have varying column lengths. This may also be relevant to a solution, and this looks right on topic, but I need a bit of hand-holding to make it suit my situation. Thanks in advance!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T17:32:56+00:00

To save a LITTLE memory/time you could modify the solution from @Ben Bolker like this:

datlist <- lapply(csvnames,read.csv, colClasses=c("character", "NULL"))
rowseq <- seq_len( max(vapply(datlist,nrow, integer(1))) )
keylist <- lapply(datlist,function(x) { x[[1]][rowseq] ) })
names(keylist) <- paste(KEYTERMS,csvnames,sep="_")
#do.call(cbind,keylist)
do.call(data.frame,keylist)

…I just changed so that only the first column is read, and simplified the NA padding by observing that selecting a sequence that extends outside a character vector pads with NA automatically…

If you kept the old way of padding, you should at least pad with NA_character_ instead of NA to avoid unnecessary coercion.

I also index the KEYTERMS column by number instead of name (since there should be only one). I also changed sapply to vapply because I like it better 🙂 – it actually is faster too.

Finally you said you wanted a data.frame. The last line produces that instead of a matrix.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have 1000 CSV files from dfr.jstor.org with two columns, KEYWORDS and WEIGHT. The

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply