I am helping a non-profit organization organize their existing data into a database. The data includes about 200 records. They have been using a simple word document. I am starting the work of structuring the raw data so I can enter it in the database. I copied the data into Textpad and it did so nicely. That said the data is structured but is not a perfect match. For example some organizations have a website, some don’t etc. Here is a sample of some information indicative of the remaining data:
I have an ERD created, it has gone through a number of revisions and was given the green light by my mentor. At this point I am at the ETL(Extract, transform, load) Process
- Clean up the remaining partially structured but messy data.
- Put it into an Excel readable doc type, and be arranged into the applicable tables
- Create a data input SQL script.
- Run the script.
I have done this already with some of the other data and it worked perfectly.
The cleaning up and putting it into Excel (CSV or Tab Delimited) is where I need the guidance. Or is it better to convert it to XML? If I manually go through the text file to ensure all the headers(for lack of a better word) match like this:
Is there a way to transfer it?
I have researched this, and I was surprised I could not find any good information. [Updated] I just found the actual term, ETL Process. If I have to just start retyping and/or cut and pasting just let me know.


Those two “records” are significantly different. For example some contacts have multiple phone numbers and others have only one. Additionally, the number of contacts may vary. It seems to lend itself to a relational database schema with multiple tables. However, you’re going to have a hard time automating the process of populating multiple relational tables based on the data layout. How much data are you dealing with? If it’s not an enormous amount, you may be better off doing this sort of half-manually, reformatting parts of your Textpad doc into INSERT statements (using lots of regular expression search and replaces), and taking some time running the queries.
If it’s a truly large amount of data, then you might want to write a little program in the language of your choice to parse the file and create an output file containing the appropriate insert statements to populate all the data tables.
For a robust relational database, you’d want a database schema that includes at a minimum, the following tables:
You could get away without the Types and Categories tables, but they may prove useful depending on the volume of data and how they plan to query on it in the future (e.g. if at some point they will want to find all organizations in a particular category of a particular group type, and there are at least thousands of organizations, then the extra tables will prove worthwhile).
Since the contact/phone information appears to be so flexible, you’re better off putting it into separate tables — otherwise you’d have to include columns in the main organization table for contactN/phoneN/phoneTypeN for the maximum number of possible contact/phones, and that would also create a limit on how many contact/phone associations could be made.
You had also better make sure that none of the records require multiple entries for any of the other fields (MEETINGS, EMAIL, …). If that is a possibility, then you again need to make a choice of whether to add additional relational tables, or add multiple fields for the max possible to the organization table.