I am a nurse and I know python but I am not an expert, just used it to process DNA sequences
We got hospital records written in human languages and I am supposed to insert these data into a database or csv file but they are more than 5000 lines and this can be so hard. All the data are written in a consistent format let me show you an example
11/11/2010 - 09:00am : He got nausea, vomiting and died 4 hours later
I should get the following data
Sex: Male
Symptoms: Nausea
Vomiting
Death: True
Death Time: 11/11/2010 - 01:00pm
Another example
11/11/2010 - 09:00am : She got heart burn, vomiting of blood and died 1 hours later in the operation room
And I get
Sex: Female
Symptoms: Heart burn
Vomiting of blood
Death: True
Death Time: 11/11/2010 - 10:00am
the order is not consistent by when I say in ……. so in is a keyword and all the text after is a place until i find another keyword
At the beginnning He or She determine sex, got …….. whatever follows is a group of symptoms that i should split according to the separator which can be a comma, hypen or whatever but it’s consistent for the same line
died ….. hours later also should get how many hours, sometimes the patient is stil alive and discharged ….etc
That’s to say we have a lot of conventions and I think if i can tokenize the text with keywords and patterns i can get the job done. So please if you know a useful function/modules/tutorial/tool for doing that preferably in python (if not python so a gui tool would be nice)
Some few information:
there are a lot of rules to express various medical data but here are few examples
- Start with the same date/time format followed by a space followd by a colon followed by a space followed by He/She followed space followed by rules separated by and
- Rules:
* got <symptoms>,<symptoms>,....
* investigations were done <investigation>,<investigation>,<investigation>,......
* received <drug or procedure>,<drug or procedure>,.....
* discharged <digit> (hour|hours) later
* kept under observation
* died <digit> (hour|hours) later
* died <digit> (hour|hours) later in <place>
other rules do exist but they follow the same idea
This uses dateutil to parse the date (e.g. ’11/11/2010 – 09:00am’), and parsedatetime to parse the relative time (e.g. ‘4 hours later’):
yields:
Note: Be careful parsing dates. Does ‘8/9/2010’ mean August 9th, or September 8th? Do all the record keepers use the same convention? If you choose to use dateutil (and I really think that’s the best option if the date string is not rigidly structured) be sure to read the section on “Format precedence” in the dateutil documentation so you can (hopefully) resolve ‘8/9/2010’ properly.
If you can’t guarantee that all the record keepers use the same convention for specifying dates, then the results of this script would have be checked manually. That might be wise in any case.