In the sample data given below (stored in a file), I need to find distinct ‘ids’ in each ‘item’ category in the fastest way possible. I can do this by going through each line and then finding all item sets and then count, but I am looking for a faster method such as ‘Counter’ or ‘itemgetter’.
“infile.txt”
id item
444 Anemia
444 liver
444 Anemia
444 Anemia
222 liver
222 pancreas
222 liver
222 Anemia
444 pancreas
444 pancreas
444 Anemia
001 Iiver
001 pancreas
111 pancreas
111 liver
111 liver
111 pancreas
555 pancreas
555 liver
555 pancreas
555 liver
555 pancreas
555 liver
I need the output something like the following
item count ids
pancreas 5 001, 111, 222, 444, 555
liver 5 111,222,444,555,001
Anemia 2 222,444
I’d use a defaultdict with a
set