I have a word document that contains data dictionaries.
For example, a variable called FUEL is described as follows:
FUEL -- What type of fuel does it take?
1 Gas
2 Diesel
3 Hybrid
4 Flex fuel
7 OTHER, SPECIFY
I want to convert the document into a PostgreSQL table. Do you have any suggestions?
In general, this sort of thing takes two stages: 1st, massage the data into a sane tabular format using text processing tools and scripting, or with something like Excel.
Once you have a tabular format, output the data as
CSV(say, with Save As in Excel) and load it into PostgreSQL using theCOPYcommand or psql’s\copyafter running appropriateCREATE TABLEcommands to define a table structure that matches the structure of the CSV.Edit: Given the updated post, I’d say you probably have to write a simple parser for this, unless the document contains internal structured markup. Save the document as plain text. Now write a script in a language like Perl or Python that looks for the heading that defines the variable, extracts the capitalied variable name and the description from that line, then reads numbered options until it runs out and is ready to read the next variable. If the document is uniformly structured this should only take a few lines of code with some basic regular expressions; you could probably even do it in
awk. Have the script either write CSV ready for importing later, or use database interfaces likeDBD::Pg(Perl) orpsycopg2(Python) to store the data directly.If you don’t know any scripting tools, you’ll either need to learn or get very good at copy and paste.