Intro
I’ve been given a messy excel dump straight into a table. Now I need to turn that mess into something useful.
The dump has duplicates and inconsistencies… good times!
I’ve been striking out on every approach so far 🙁 – Hope you can help me out.
Given this example data set:
ExcelDump
+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
| 1 | | | C |
| 1 | | B | C |
| 1 | A | B | D |
| 1 | E | B | C |
| 2 | A | B | C |
| 2 | A | B | C |
| 3 | A | B | C |
| 3 | A | B | F |
| 4 | A | B | C |
| 4 | G | B | C |
+----+------+------+------+
One possible result could be:
OutputTable
+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
| 1 | A | B | C |
| 2 | A | B | C |
| 3 | A | B | C |
| 4 | A | B | C |
+----+------+------+------+
Nice and neat.
Unique ID key and data merged together in a way that makes sense.
How to choose which data is correct?
You’ve probably noticed that another possible result could be:
+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
| 1 | E | B | C |
| 2 | A | B | C |
| 3 | A | B | F |
| 4 | G | B | C |
+----+------+------+------+
This is where it gets complicated. I want to be able to choose the set that makes the most sense based on some conditions I can manipulate.
For instance I want to setup a condition that says: “Choose the most (non-null) common value, if no most common found take the first value found that is not null.”
This condition should be applied to the selection of grouped by IDs.
The result of that condition would be:
+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
| 1 | A | B | C |
| 2 | A | B | C |
| 3 | A | B | C |
| 4 | A | B | C |
+----+------+------+------+
If I later find out that that assumption was wrong and it instead should be: “Choose the most (non-null) common value, if no most common found take the last value found that is not null.”
+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
| 1 | E | B | C |
| 2 | A | B | C |
| 3 | A | B | F |
| 4 | G | B | C |
+----+------+------+------+
So basically I want to select values based a set of conditions on each group of IDs.
I’ve modified my solution to take into account the extra information added in the question. The below query will get you the second sort priority you specified. In order to get the first one, you’d change the “max” in the outer apply to “min” and change the “sortOrder desc” to “sortOrder asc”. Keep in mind if you have multiple ties for most frequent, say A,A,B,B,C and A came first, it would go with B in the below code because that was the highest count and came after the 2 A’s.