I am currently analysing a rather large dataset (22k+records) and am having some trouble

Question

0

Asked: May 20, 20262026-05-20T21:29:54+00:00 2026-05-20T21:29:54+00:00

I am currently analysing a rather large dataset (22k+records) and am having some trouble

0

I am currently analysing a rather large dataset (22k+records) and am having some trouble getting the data into a wide format (with one row corresponding to each observation, and columns representing variables).

The data came in two CSV files, one giving demographics and the other giving participants probability ratings to a number of questions. Both of these CSV files were in long format.

I have used the reshape (and reshape2 for speed) packages to attempt to solve my problem. The specific issue i am having is the following.
I have the participants probability ratings in the following form (after one successful reshape).

dtf <- read.csv("http://dl.dropbox.com/u/8566396/foobar.csv")

Now, the format i would like my data to be in is as follows:
User ID Qid1, ….Qid255 Time, with the probabilities for each question in the questions corresponding column.

I have tried a loop and apply to put the values into a new data frame, and many variations of melt and cast. I have also tried the base reshape function, but all to no avail.

In the past, i’ve always edited my CSV files directly, but this is not an option with the size of this file (my laziness when it comes to data manipulation within R has come back to haunt me).

Any advice or solution you can give to avoid me having to do this by hand would be greatly appreciated.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-20T21:29:54+00:00

Your dataset has 6 rows, 3 of which have the column “variable” equal to “probability” and 3 of which have that column equal to “time”. You want to have probability be the value of each, and time be added onto the right.

I think there’s a difficulty in making this work for you because what you want to do isn’t clear. You have values for each UID-Time-X### cell, and values for each UID-Prob-X### cell. Therefore, you have to discard information to get it into your preferred format (UID-Time-X### with probabilities as the values). It seems to me like you’re treating time as an ID variable, but it’s storing values like a content variable.

To avoid discarding any data, your output would have to look something like:
UID Time1 Time2 Time3 Prob1 Prob2 Prob3

Which is simply reshaped wide.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am currently analysing a rather large dataset (22k+records) and am having some trouble

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply