Background: I have a fixed-width flat file with about 94 million rows of data. The file is from the HCUP Nationwide Inpatient Sample (NIS http://www.hcup-us.ahrq.gov/nisoverview.jsp), which provides information about hospitalizations over the past 12 years, each row a separate hospitalization. For my analyses, I will be querying diagnostic codes (ICD9-CM) to identify patients with various diagnoses.
The fixed-width file contains information on up to 15 diagnostic codes, which are provided as columns dx1 through dx15.
create table `core` (`key` char (14),
`dx1` char (5),
`dx10` char (5),
`dx11` char (5),
`dx12` char (5),
`dx13` char (5),
`dx14` char (5),
`dx15` char (5),
`dx19` char (5),
`dx2` char (5),
`dx3` char (5),
`dx4` char (5),
`dx5` char (5),
`dx6` char (5),
`dx7` char (5),
`dx8` char (5),
`dx9` char (5),
plus various other columns of patient demographics...);
I loaded all of the data into a MySQL table, named core, and can index the 15 columns. However, it seems advantageous to kind of normalize the dx* columns into a separate dx table, such as;
create table `dx` (
`key` char (14),
`icd9` char (5),
);
where key is a foreign key to the main core table. To load the data quickly into dx, I use:
LOAD DATA LOCAL INFILE 'data.ASC' INTO TABLE `dx` (@var1) SET `key` = substr(@var1, 1, 14), `icd9` = substr(@var1, 74, 5);
LOAD DATA LOCAL INFILE 'data.ASC' INTO TABLE `dx` (@var1) SET `key` = substr(@var1, 1, 14), `icd9` = substr(@var1, 79, 5);
LOAD DATA LOCAL INFILE 'data.ASC' INTO TABLE `dx` (@var1) SET `key` = substr(@var1, 1, 14), `icd9` = substr(@var1, 84, 5);
etc for all 15 columns...
The catch is that the each row in the fixed-width file only has a median of 3 diagnosis codes, so most of the dx* columns are just blank (' ' [five blank characters]). So, while the dx table has 1.41 billion (94 million * 15) rows after loading data, about 1.28 billion (94 million * 12) are blank diagnostic codes.
I’ve been simply removing them afterwards and optimizing, prior to indexing:
DELETE FROM `dx` WHERE `icd9` = " ";
OPTIMIZE TABLE `dx`;
CREATE INDEX `icd9` ON `dx` (`icd9`);
However, this takes a lot of time.
Question: Is it possible to modify the LOAD DATA INFILE statement to skip the row if ICD9 = ' '[five blank characters], and would this be significantly faster than my current DELETE and OPTIMIZE method? If there is, I would like to pass this information on to future researchers working with these data.
No. There is an
IGNOREoption. However it use line numbers not inline logical comparisons.Likely. But, as it’s not an option, it doesn’t matter.