I large messy data files that look something like this:
1 2 3 4 5 6 7 8 . .
aa bb ccc d eee ffff gg h i jj
6 6 5 1 2 3 4 5i 734
33 44x 1234 12 1 9 888 345 12 987765
Most, but not all, lines in a data file have the same number of elements. What is the best way to read such a data file and convert it to a matrix or data frame?
I have been using readLines to read the file.
Also, I know from an answer to one of my earlier questions that an asymmetric list can be converted to a matrix using the following three lines:
R: convert asymmetric list to matrix – number of elements in each sub-list differ
max.len <- max(sapply(my.data, length))
corrected.list <- lapply(my.data, function(x) {c(x, rep(NA, max.len - length(x)))})
mat <- do.call(rbind, corrected.list)
I was thinking maybe I could:
- read the data file with
readLines - split each row in the data set into its separate elements, and then
- convert the entire data set into a list, and then
- use the three lines above to create a matrix
However, I get stuck on Step 2. I cannot figure out how to split each line into separate elements because the number of empty spaces between elements varies. Further, I suspect the proposed 4-step strategy is not efficient.
Thank you for any help with this problem.
EDIT
Sorry I forgot to post the desired result. I would like the data to look something like this once it is in the matrix or dataframe:
1 2 3 4 5 6 7 8 . .
aa bb ccc d eee ffff gg h i jj
6 6 5 1 2 3 4 5i 734 NA
33 44x 1234 12 1 9 888 345 12 987765
Could you use
strsplitto achieve part 2?Result: