I’ve got a flatfile, fixed width with neither newline nor linefeed (dump from AS400).
How do I load this file into an R data.frame?
I’ve tried different combinations of textConnection and read.fwf, to no avail.
The code below crashes Rstudio, so I’m assuming I’m overloading the system.
len below is 24376400, which is tame as far as the files I usually load using read.table.
Record length is 400.
Is there any RECLEN parameter I should set, similar to SAS?
Is there an option to set EOL = “\n” or “\r\n” ? Thank you.
fname <- "AS400FILE.TXT"
len <- file.info(fname)$size
conn <- file(fname, 'r')
contents <- readChar(conn, len)
close(conn)
df <- read.fwf( textConnection(contents) , widths=layout$length , sep="")
> dput(layout)
structure(list(start = c(1L, 41L, 81L, 121L, 161L, 201L, 224L,
226L, 231L, 235L, 237L, 238L, 240L, 280L, 290L, 300L, 305L, 308L,
309L, 330L, 335L, 337L, 349L, 350L, 351L, 355L, 365L), end = c(40L,
80L, 120L, 160L, 200L, 223L, 225L, 230L, 234L, 236L, 237L, 239L,
279L, 289L, 299L, 304L, 307L, 308L, 329L, 334L, 336L, 348L, 349L,
350L, 354L, 364L, 400L), length = c(40L, 40L, 40L, 40L, 40L,
23L, 2L, 5L, 4L, 2L, 1L, 2L, 40L, 10L, 10L, 5L, 3L, 1L, 21L,
5L, 2L, 12L, 1L, 1L, 4L, 10L, 36L), label = c("TITLE", "SUFFIX",
"ADDRESS1", "ADDRESS2", "ADDRESS3", "CITY", "STATE",
"ZIP", "ZIP+4", "DELIVERY", "CHECKD", "FILLER", "NAME",
"SOURCECODE", "ID", "FILLER", "BATCH", "FILLER", "FILLER",
"GRID", "LOT", "FILLER", "CONTROL",
"ZIPIND", "TROUTE", "SOURCEA", "FILLER")), .Names = c("start",
"end", "length", "label"), class = "data.frame", row.names = c(NA,
-27L))
> dim(layout)
[1] 27 4
>
You could use
readCharfor this.First make up some sample data (I think the format is as you describe as far as I can tell from the question? i.e. wall of text with a specified width per column, no new lines in the entire file):
I can think of 3 ways to do it, various differences between each:
If you know the number of rows in advance you can do:
Otherwise you can use a loop (why read in the file once to work out the number of rows and then again to parse?)
Or, since you already have all of
contentsin a string, you can just split the string usingsubstring:If you want to do it in as little file reads as possible, I suggest the second or third methods.
The third method “feels” most elegant to me, but requires you to read in the entire
contentsall at once, which, depending on file size, may not be viable.If that’s the case I’d go for the second, which only reads in one set of
nFieldsfields at a time.I don’t recommend the first, unless you know the number of rows in advance – it was just my first attempt. I don’t recommend it because you have to first read in the file to determine the number of rows, and then you close it and read it in again. If you want to go down that route then just use method 3! However, if you know by some other means the number of rows in advance, then you could use this method.