Background: I have a fixed-width flat file with about 94 million rows of data.

Question

0

Asked: May 26, 20262026-05-26T10:35:33+00:00 2026-05-26T10:35:33+00:00

Background: I have a fixed-width flat file with about 94 million rows of data.

0

Background: I have a fixed-width flat file with about 94 million rows of data. The file is from the HCUP Nationwide Inpatient Sample (NIS http://www.hcup-us.ahrq.gov/nisoverview.jsp), which provides information about hospitalizations over the past 12 years, each row a separate hospitalization. For my analyses, I will be querying diagnostic codes (ICD9-CM) to identify patients with various diagnoses.

The fixed-width file contains information on up to 15 diagnostic codes, which are provided as columns dx1 through dx15.

create table `core` (`key` char (14),
`dx1` char (5),
`dx10` char (5),
`dx11` char (5),
`dx12` char (5),
`dx13` char (5),
`dx14` char (5),
`dx15` char (5),
`dx19` char (5),
`dx2` char (5),
`dx3` char (5),
`dx4` char (5),
`dx5` char (5),
`dx6` char (5),
`dx7` char (5),
`dx8` char (5),
`dx9` char (5),
plus various other columns of patient demographics...);

I loaded all of the data into a MySQL table, named core, and can index the 15 columns. However, it seems advantageous to kind of normalize the dx* columns into a separate dx table, such as;

create table `dx` (
`key` char (14),
`icd9` char (5),
);

where key is a foreign key to the main core table. To load the data quickly into dx, I use:

LOAD DATA LOCAL INFILE 'data.ASC' INTO TABLE `dx` (@var1) SET `key` = substr(@var1, 1, 14), `icd9` = substr(@var1, 74, 5);
LOAD DATA LOCAL INFILE 'data.ASC' INTO TABLE `dx` (@var1) SET `key` = substr(@var1, 1, 14), `icd9` = substr(@var1, 79, 5);
LOAD DATA LOCAL INFILE 'data.ASC' INTO TABLE `dx` (@var1) SET `key` = substr(@var1, 1, 14), `icd9` = substr(@var1, 84, 5);
etc for all 15 columns...

The catch is that the each row in the fixed-width file only has a median of 3 diagnosis codes, so most of the dx* columns are just blank (' ' [five blank characters]). So, while the dx table has 1.41 billion (94 million * 15) rows after loading data, about 1.28 billion (94 million * 12) are blank diagnostic codes.

I’ve been simply removing them afterwards and optimizing, prior to indexing:

DELETE FROM `dx` WHERE `icd9` = "     ";
OPTIMIZE TABLE `dx`;
CREATE INDEX `icd9` ON `dx` (`icd9`);

However, this takes a lot of time.

Question: Is it possible to modify the LOAD DATA INFILE statement to skip the row if ICD9 = ' '[five blank characters], and would this be significantly faster than my current DELETE and OPTIMIZE method? If there is, I would like to pass this information on to future researchers working with these data.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T10:35:33+00:00

Editorial Team

2026-05-26T10:35:33+00:00Added an answer on May 26, 2026 at 10:35 am

Is it possible to modify the LOAD DATA INFILE statement to skip the
row if

No. There is an IGNORE option. However it use line numbers not inline logical comparisons.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Background: I have a fixed-width flat file with about 94 million rows of data.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply