I have a few ideas for dealing with this, but I expect the guRus can come up with something better yet. I submitted a bunch of row typing-in to Mechanical Turk. I needed a single row from a table, and I had a field into which I asked them to type the row’s values separated by commas. In R I have then strsplit this, and I am now comparing the results of multiple Turkers’ entries.
A common pattern is that one Turker will have missed a single entry, throwing the rest of the entries off by one. So the challenge is to know where to put the missing value. Assume that they only ever miss entering a single entry (I have error-checking code to confirm this), and that I may have obtained up to 3 replicates from each table row (so there could be 1-2 of the proper length, and 1-2 that are too short. Entries are approximately of the size below, and I only have about 50, so computational efficiency is not paramount. Assume that the longest entry is the proper length.
Here is an example of one such row (stored as a list, with each element being a replication by a different Turker):
tt <- list(structure(c(4, 4, 5, 7, 9, 13, 15, 18, 20, 22, 24,
27, 30, 32, 35, 37, 41, 43, 46, 48, 51, 54, 57, 60, 63), .Dim = c(25L,
1L)), structure(c(4, 4, 5, 7, 9, 11, 13, 15, 18, 20, 22, 25,
27, 30, 32, 35, 37, 40, 43, 46, 48, 51, 54, 57, 60, 63), .Dim = c(26L,
1L)), structure(c(4, 4, 5, 7, 9, 11, 13, 15, 19, 20, 22, 25,
27, 30, 32, 35, 37, 42, 43, 46, 48, 51, 54, 57, 61, 63), .Dim = c(26L,
1L)))
lengths <- sapply(tt,length)
longs <- simplify2array(tt[lengths==max(lengths)],FALSE)
shorts <- simplify2array(tt[lengths==max(lengths)-1],FALSE)
Algorithms I have considered are:
- Creating
max(lengths)permutations with the NA in every single possible place, and comparing them simultaneously to the 1-2 ones of appropriate length using some estimate of the total deviation. - Looping through each element and comparing to the 1-2 ones of appropriate length until I find a non-exact match. Then decide how big the difference is as compared to all the subsequent differences with the NA or not. E.g. if they match up to the 5th entry, but putting NA in the 5th entry still leaves the rest off by more than the difference in the 5th entry, keep moving down the vectors.
Curious how everyone would implement this. I’m having a hard time avoiding loops and writing this in an elegant way. Possibly something like filter might help.
Examples of problematic input and desired output
Problematic input (missing one value; no typos in other values)
> tt1 <- list(c(4, 4, 7, 9, 11), c(4, 4, 5, 7, 9, 11), c(4, 4, 5, 7, 9,
11))
> tt1
[[1]]
[1] 4 4 7 9 11
[[2]]
[1] 4 4 5 7 9 11
[[3]]
[1] 4 4 5 7 9 11
Desired output
> tt1
[,1] [,2] [,3]
1 4 4 4
2 4 4 4
3 NA 5 5
4 7 7 7
5 9 9 9
6 11 11 11
Problematic input (missing value + a typo in another value)
> tt2 <- list(c(4, 4, 7, 9, 11), c(4, 3, 5, 7, 9, 11), c(4, 4, 5, 7, 9,
11))
> tt2
[[1]]
[1] 4 4 7 9 11
[[2]]
[1] 4 3 5 7 9 11
[[3]]
[1] 4 4 5 7 9 11
Desired output
> tt2[[1]][4:6] <- tt2[[1]][3:5]
> tt2[[1]][3] <- NA
> simplify2array(tt2,FALSE)
[,1] [,2] [,3]
[1,] 4 4 4
[2,] 4 3 4
[3,] NA 5 5
[4,] 7 7 7
[5,] 9 9 9
[6,] 11 11 11
Other variations of typos should be tolerated gracefully. Note that the vectors are generally increasing (you could view them as monotonically increasing with noise). So if someone mistakes a 7 for a 4, that’s probably a typo. Note also that for most I only did 2 replicates, so there won’t be any way to give one non-missing value any more credence than any other non-missing value. Going to have to look at the whole pattern or at least take advantage of the fact that they’re generally increasing.
Full data frame
Each of the tt examples above is all of the TotalTime entries for a given feet-image level in the data.frame below. This is the whole dataset. Note that the total number of entries potentially changes between image groups. This value is known in advance, or you could just get it from the max of the entries.
dat <- structure(list(feet = c(1, 2, 3, 3, 1, 1, 7, 7, 8, 9, 9, 1, 1,
2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6,
6, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 6, 6, 7, 7, 8, 8, 9, 10, 10
), TotalTime = c("4,3,4,6,6,10,12,14,16,18,20,22,25,28,30,32,34,36,41,44,46,49,51,55,58",
"4,4,5,7,9,11,13,15,18,20,22,25,27,30,32,35,37,41,43,46,48,51,54,57,60,63",
"3,4,6,8,11,13,15,17,20,22,25,27,32,34,38,39,41,44,47,49,52,55,58,61,64,67",
"3,4,6,8,11,13,15,17,20,22,25,27,32,34,36,39,41,44,47,49,52,55,58,61,64,67",
"4,3,4,6,8,20,22,24,26,28,30,31,34,36,38,40,42,44,46,48,50,52,54,56,58,60",
"4,3,4,6,8,10,12,14,16,18,20,22,25,28,30,32,34,38,41,44,46,49,51,55,58",
"4,4,4,7,10,15,18,21,24,29,32,35,38,43,47,52,56,60,63,67,72,76,82,84",
"4,4,4,7,10,15,18,21,24,29,32,35,38,43.47,52,56,60,63,67,72,76,82,84",
"4,3,5,8,14,16,20,24,27,31,34,37,42,46,49,55,59,64,68,73,77,83,89,91",
"4,4,6,9,15,18,22,25,28,32,36,40,44,49,53,59,63,68,74,80,85,93,94",
"4,4,6,9,15,18,22,25,28,32,36,40,44,49,53,59,63,68,74,80,85,88,93,94",
"4,3,4,6,8,10,12,14,16,18,20,22,25,28,30,32,34,36,41,44,46,49,51,55,58",
"4,3,4,6,8,10,12,14,16,18,20,22,25,28,30,32,34,36,38,41,44,46,49,51,55,58",
"4,4,5,7,9,11,13,15,18,20,22,25,27,31,32,35,37,41,43,46,48,51,54,57,60,63",
"4,4,5,7,9,11,13,15,18,20,22,25,27,30,32,35,37,41,43,46,48,51,54,57,60,63",
"3,4,6,8,11,13,15,17,20,22,25,27,32,34,38,39,41,44,47,49,52,55,58,61,64,67",
"3,4,6,8,11,13,15,17,20,22,25,27,32,34,36,39,41,44,47,49,52,55,58,61,64,67",
"3,5,7,9,12,14,16,19,22,24,29,31,34,36,38,41,44,47,50,53,58,61,64,67,69,72",
"3,5,7,9,12,14,16,19,22,24,29,31,34,36,38,41,44,47,50,53,58,61,64,67,69,72",
"4,6,8,11,13,15,19,21,25,28,30,33,36,38,41,44,49,52,55,58,61,65,68,71,75,79",
"4,6,8,11,13,15,19,21,25,28,30,33,36,38,41,44,49,52,55,58,61,65,68,71,75,79",
"4,6,9,11,14,17,21,24,27,30,33,35,38,42,45,49,52,55,58,63,67,70,73,78,82,85",
"4,6,9,11,14,17,21,24,27,30,33,35,36,42,45,49,52,55,58,63,67,70,73,78,82,85",
"2,4,6,9,11,13,16,16,20,23,24,26,28,29,31,33,35,37,39,40,42,43,45,47,52",
"2,4,6,9,11,13,16,18,20,21,23,24,26,28,29,31,33,35,37,39,40,42,43,45,47,52",
"2,5,7,11,12,14,17,19,21,22,24,26,28,29,31,35,36,39,41,42,44,46,48,50,52,54",
"2,5,7,11,12,14,17,19,21,22,24,26,28,29,31,35,36,39,41,42,44,46,48,50,52,54",
"4,6,9,11,13,16,18,20,22,24,27,29,31,32,35,37,39,41,43,45,46,49,51,53,55,57",
"4,6,9,11,13,16,18,20,22,24,27,29,31,32,35,37,39,41,43,45,46,49,51,53,55,57",
"6,7,10,13,15,18,20,23,24,28,30,32,34,37,39,41,43,45,47,49,54,57,59,61,63",
"6,7,10,13,15,18,20,23,24,26,28,30,32,34,37,39,41,43,45,47,49,54,57,59,61,63",
"6,8,10,14,16,19,21,23,25,28,30,32,36,39,41,43,45,47,49,52,54,57,59,61,63,65",
"6,8,10,14,16,19,21,23,25,28,30,32,36,39,41,43,45,47,49,52,54,57,59,61,63,65",
"7,9,12,14,18,20,23,24,27,31,33,35,38,40,43,45,47,49,51,55,58,60,62,65,67,69",
"7,9,12,14,18,20,23,24,27,31,33,35,38,40,43,45,47,49,51,55,58,60,62,65,67,69",
"4,3,5,7,10,13,17,20,23,26,29,33,36,40,43,48,51,55,60,64,67,72,75,77",
"4,3,5,7,10,13,17,20,23,26,29,33,36,40,43,48,51,55,60,64,67,72,75,77",
"4,4,4,7,10,15,18,21,24,29,32,35,38,43,47,52,56,60,63,67,72,76,82,84",
"4,4,4,7,10,15,18,21,24,29,32,35,38,43,47,52,56,60,63,67,72,76,82,84",
"4,3,5,8,14,16,20,24,27,31,34,37,42,46,49,55,59,64,68,73,77,83,89,91",
"4,3,5,8,14,16,20,24,27,31,34,37,42,46,49,55,59,64,68,73,77,83,89,91",
"4,4,6,9,15,18,22,25,28,32,36,40,44,49,53,59,63,68,74,80,85,88,93,94",
"4,4,6,9,15,18,22,25,28,32,36,40,44,49,53,59,63,68,74,80,85,88,93,94",
"0,0,0,1,1,1,3,3,3,5,5,5,6,6,7,7,8,8,9,10,11,10,11,11", "0,0,0,1,1,1,3,3,3,5,5,6,6,7,7,8,8,9,10,11,10,11,11",
"6,4,7,10,13,16,20,22,25,27,30,32,35,38,43,45,48,52,54,57,60,62,64,67",
"6,4,7,10,13,16,20,22,25,27,30,32,35,38,43,45,48,52,54,57,60,62,64,67",
"6,4,7,10,14,19,21,23,26,28,33,36,39,42,45,47,50,53,56,60,62,65,69,70",
"6,4,7,10,14,19,21,23,26,28,33,36,39,42,45,47,50,53,56,60,62,65,69,70",
"2,5,9,12,14,20,21,24,29,32,34,37,41,44,46,50,53,59,62,65,68,72,75,76",
"2,5,9,12,14,20,21,24,29,32,34,37,41,44,46,50,53,59,62,65,68,72,75,76",
"2,5,9,13,17,20,24,27,30,33,37,42,45,48,52,55,58,62,65,67,72,75,78,80",
"3,6,10,15,18,23,25,26,28,32,36,40,43,47,50,53,58,61,65,67,70,75,78,83,86",
"3,6,10,15,18,23,25,28,32,36,40,43,47,50,53,58,61,65,67,70,75,78,83,86"
), image = c(1, 1, 1, 1, 3, 3, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4)), .Names = c("feet",
"TotalTime", "image"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 14L, 15L, 16L, 17L, 19L, 20L, 22L, 23L,
24L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L,
38L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 49L, 50L, 51L, 53L,
54L, 55L, 56L, 57L, 58L, 59L, 61L, 62L, 63L), class = "data.frame")
Here is a solution that aims to be readable. It can be collapsed into a smaller number of of lines of code, no doubt:
The results: