I need to read a .dat file using a .dct file. Has anyone done that using R?
The format is:
dictionary {
# how many lines per record
_lines(1)
# start defining the first line
_line(1)
# starting column / storage type / variable name / read format / variable label
_column(1) str8 aid %8s "respondent identifier"
...
}
‘read formats’ are like:
%2f 2 column integer variable
%12s 12 column string variable
%8.2f 8 column number with 2 implied decimal places.
Storage types are described here: http://www.stata.com/help.cgi?datatypes
Other sites used for info:
http://library.columbia.edu/indiv/dssc/technology/stata_write.html
http://www.stata.com/support/faqs/data-management/reading-fixed-format-data/
The .dat file is a bunch of numbers corresponding to the variables specified in the .dct file. (Presumably this is data in fixed width columns).
Here a real example:
.dtc file
http://goo.gl/qHZOk
data
http://goo.gl/FRGRF
A specific example from the stata site is:
The .dat file (“test.raw” in this instance)
C1245A101George Costanza
B1223B011Cosmo Kramer
The .dct file
dictionary using test2.raw {
_column(1) str5 code %5s
_column(2) int call %4f
_column(6) str1 city %1s
_column(7) int neigh %3f
_column(10) str16 name %16s
}
The resulting data file:
+-----------------------------------------------+
| code call city neigh name |
|-----------------------------------------------|
1. | C1245 1245 A 101 George Costanza |
2. | B1223 1223 B 11 Cosmo Kramer |
+-----------------------------------------------+
@thelatemail is spot-on about how to proceed. Here’s a small function I threw together to get you started on a more robust solution:
There is still a lot you would have to do with respect to error checking, generalizing the function, and so on. For example, this function does not work with overlapping columns, as are present in the example that @thelatemail added to your question. Some error checking in the form of “StartPos[n] + ColWidth[n]” should equal “StartPos[n+1]” could be used to stop reading the file if this is not true with an error message. Additionally, the classes of the resulting data can also be extracted from the “metadata” list generated by the function and assigned in
read.fwfusing thecolClassesargument.Here is a dat file and a dct file to demonstrate:
Copy and paste the following two lines into a text editor and save it in your working directory as “test.dat”.
Copy and paste the following lines into a text editor and save it in your working directory as “test.dct”
Now, run the function:
Update: An improved function (with still a lot of room for improvement)
How does it work with your data?
What about with the original example?