I’ve got a series of items in a MySQL database.
Each item has four characteristics associated with it:
abs_left (The items left most position)
abs_center (The items center position)
abs_right (The items right most position)
row (The items vertical position)
Within a chunk of data I know that the items are aligned in columns, but I do not know how many columns there are. The numbers of abs_left, abs_center, and abs_right are also not precise, and vary a pretty significant amount (e.g. The abs_right of one item might slightly overlap the abs_left of another item, with them being in different columns). The items row, does not vary and should be correct. However, not every row within the chunk of data has an element in every column. As a result given any single row of data within the chunk, I can not tell how many columns there are.
I would like to determine two things:
1) The number of columns within the chunk of data in question.
2) The approximate bounds of each one of these columns.
I’m pretty sure Math can be applied to help me do this, but I’m not really sure how to go about it conceptually. I’m thinking standard deviation might be able to be used, but I’m not sure how to apply it to X number of columns.
Any help you guys can provide, or ideas on how to attack it would greatly be appreciated!
[EDITED To Add Sample Data]
Below is summary data from queries that have already been used to attempt to round answers. As summary data, its less precise, but will probably give an idea of what is being run into. The “row” portion is left out of the summary data as things were combined, but I do have a concept of row within the full dataset.
“section_id” “abs_left” “abs_right” “count”
“1” “0” “4” “144”
“1” “1” “4” “4”
“1” “8” “12” “152”
“1” “40” “59” “4”
“1” “41” “57” “2”
“1” “41” “60” “45”
“1” “43” “44” “2”
“1” “48” “63” “88”
“1” “50” “65” “1”
“1” “54” “64” “11”
“3” “0” “15” “2”
“3” “1” “10” “4”
“3” “58” “60” “1”
“3” “58” “69” “3”
“3” “63” “70” “5”
“3” “66” “72” “10”
“3” “67” “73” “5”
“3” “82” “87” “3”
“3” “96” “104” “6”
“3” “100” “104” “2”
“3” “114” “122” “25”
“3” “129” “137” “15”
“3” “130” “137” “20”
“3” “133” “137” “1”
“3” “143” “151” “38”
“3” “146” “151” “1”
“3” “165” “172” “3”
“3” “168” “175” “36”
“4” “4” “10” “6”
“4” “4” “21” “18”
“4” “5” “25” “9”
“4” “5” “30” “10”
“4” “5” “34” “21”
“4” “6” “41” “7”
“4” “6” “43” “1”
“4” “55” “64” “3”
“4” “70” “76” “3”
“4” “75” “83” “42”
“4” “76” “84” “4”
“4” “77” “82” “11”
“4” “93” “100” “16”
“4” “95” “101” “13”
“4” “95” “101” “7”
“4” “104” “110” “2”
“4” “108” “116” “27”
“4” “123” “130” “37”
“4” “139” “143” “1”
“4” “139” “146” “75”
“4” “143” “147” “2”
Section 1 has 3 columns.
Section 3 has 7 columns.
Section 4 has 7 columns.
Plan of attack
First create a ‘number line’ of integers that span the domain of the input data (below is 0-255)
Then for each section_id, take a projection of the range (abs_left,abs_right) onto the number line and store in a temporary table
Now, find and label the left-most edge of each column
Now, using this label, we can number and label the columns themselves
Now, we can map the items back onto the columns
Now, we can actually answer the question (..!)
And the edges:
(Using a 1-standard-deviation interval, which covers about 68% for a normal distribution. See: http://en.wikipedia.org/wiki/Standard_deviation)