I’m running a Python script as a Hadoop streaming job, but this post is more related to some core Python concepts than knowledge about Hadoop.
Basically I have a set of lines where I want to find overlap
$ cat sample.txt
ID1 2143,2154,
ID2 2913,14545
ID3 2143,2390,3350,5239,6250
ID4 2143,2154,2163,3340
ID5 2143,2154,2156,2163,3340,3711
I want in the end to find overlapping pairs of records and count them, for example here something like:
2143,2154 3
2143,2163 2
2143,3340 2
2154,2163 2
2154,3340 2
2163,3340 2
The way I do this is by creating a Hadoop streaming job written in Python where the mapper will basically output all pair combinations on a given line which will be processed further by the reducer.
My question is actually quite simple: how can I generate efficiently in Python the combination of all pairs in a given line? Note that in my case a pair (x,y) is the same as a pair (y,x). For example for ID3 i’d like the following list generated in my mapper:
[(2143,2390), (2143,2390), (2143,3350), (2143,5239), (2143,6250), (2390,3350), (2390,5239), (2390,6250), (3350,5239), (3350,6250), (5239,6250)]
I can certainly do this with a bunch of for loops but it’s quite ugly. I’ve tried using itertools but couldn’t get something out of it properly. Any thoughts?
How about:
gives: