How can one read in a text file in which each record is a paragraph and each newline denotes separate field. The complication is that some records have 4 lines and some have 6. @DWin nailed my questions when the the difference in number of fields was 1 but it all fell apart when it was two. You can have a look at his answer here.
So here is my latest simulation of the starting text
TheInstitute 5467
telephone line 4125526987 x 4567
datetime 2011110516 12:56
blay blay blah who knows what, but anyway it may have a comma
TheInstitute 5467
telephone line 4125526987 x 4567
datetime 2011110516 12:58
blay blay blah who knows what
TheInstitute 5467
telephone line 412552999 x 4999
bump phone line 4125527777
bump pony pony oops 4125527777
datetime 2011110516 12:59
blay blay blah who knows what
TheInstitute 5467
telephone line 4125526987 x 4567
bump phone line 4125527777
bump pony pony oops 4125527777
datetime 2011110516 13:51
blay blay blah who knows what, but anyway it may have a comma
TheInstitute 5467
telephone line 4125526987 x 4567
datetime 2011110516 14:56
blay blay blah who knows what
Here is what the output should look like. In fact this is one step removed from what I need. I am placing a ASCII text representation of an R data.frame below. You will see that everything is in a data frame but the field values are shifted by two columns because some records have two extra fields.
structure(list(institution = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "TheInstitute 5467", class = "factor"),
telephoneline = structure(c(1L, 1L, 2L, 1L, 1L), .Label = c("telephone line 4125526987 x 4567",
"telephone line 412552999 x 4999"), class = "factor"), date.or.bump = structure(c(2L,
3L, 1L, 1L, 4L), .Label = c("bump phone line 4125527777",
"datetime 2011110516 12:56", "datetime 2011110516 12:58",
"datetime 2011110516 14:56"), class = "factor"), field4 = structure(c(2L,
1L, 3L, 3L, 1L), .Label = c("blay blay blah who knows what",
"blay blay blah who knows what, but anyway it may have a comma",
"bump pony pony oops 4125527777"), class = "factor"), field5 = structure(c(1L,
1L, 2L, 3L, 1L), .Label = c("", "datetime 2011110516 12:59",
"datetime 2011110516 13:51"), class = "factor"), field6 = structure(c(1L,
1L, 2L, 3L, 1L), .Label = c("", "blay blay blah who knows what",
"blay blay blah who knows what, but anyway it may have a comma"
), class = "factor")), .Names = c("institution", "telephoneline",
"date.or.bump", "field4", "field5", "field6"), class = "data.frame", row.names = c(NA,
-5L))
PS: Am I correct to believe that one posts a data frame by using dput or can one save a .Rdata file direclty here.
There is probably a more elegant way, but this should get the job done.
Update:
Here’s another solution using
plyr::rbind.fill: