I have some data (text files) that is formatted in the most uneven manner one could think of. I am trying to minimize the amount of manual work on parsing this data.
Sample Data :
Name Degree CLASS CODE EDU Scores
--------------------------------------------------------------------------------------
John Marshall CSC 78659944 89989 BE 900
Think Code DB I10 MSC 87782 1231 MS 878
Mary 200 Jones CIVIL 98993483 32985 BE 898
John G. S Mech 7653 54 MS 65
Silent Ghost Python Ninja 788505 88448 MS Comp 887
Conditions :
- More than one spaces should be compressed to a delimiter (pipe better? End goal is to store these files in the database).
- Except for the first column, the other columns won’t have any spaces in them, so all those spaces can be compressed to a pipe.
- Only the first column can have multiple words with spaces (Mary K Jones). The rest of the columns are mostly numbers and some alphabets.
- First and second columns are both strings. They almost always have more than one spaces between them, so that is how we can differentiate between the 2 columns. (If there is a single space, that is a risk I am willing to take given the horrible formatting!).
- The number of columns varies, so we don’t have to worry about column names. All we want is to extract each column’s data.
Hope I made sense! I have a feeling that this task can be done in a oneliner. I don’t want to loop, loop, loop 🙁
Muchos gracias “Pythonistas” for reading all the way and not quitting before this sentence!
It still seems tome that there’s some format in your files:
Regex is quite straightforward, the only things you need to pay attention to are the delimiters (
\s) and the word breaks (\b) in case of the first delimiter. Note that when the line wouldn’t match you get an empty list aslst. That would be a read flag to bring up the user interaction described below. Also you could skip the header lines by doing:Previous variants:
another thing to do is simply allow user to decide what to do with questionable entries:
Uhm, I was all the time under the impression that the second element might contain spaces, since it’s not the case you could just do: