I’m running a Python script as a Hadoop streaming job, but this post is

Question

0

Asked: June 3, 20262026-06-03T20:30:08+00:00 2026-06-03T20:30:08+00:00

I’m running a Python script as a Hadoop streaming job, but this post is

0

I’m running a Python script as a Hadoop streaming job, but this post is more related to some core Python concepts than knowledge about Hadoop.

Basically I have a set of lines where I want to find overlap

$ cat sample.txt
ID1    2143,2154,
ID2    2913,14545
ID3    2143,2390,3350,5239,6250
ID4    2143,2154,2163,3340
ID5    2143,2154,2156,2163,3340,3711

I want in the end to find overlapping pairs of records and count them, for example here something like:

2143,2154    3
2143,2163    2
2143,3340    2
2154,2163    2
2154,3340    2
2163,3340    2

The way I do this is by creating a Hadoop streaming job written in Python where the mapper will basically output all pair combinations on a given line which will be processed further by the reducer.

My question is actually quite simple: how can I generate efficiently in Python the combination of all pairs in a given line? Note that in my case a pair (x,y) is the same as a pair (y,x). For example for ID3 i’d like the following list generated in my mapper:

[(2143,2390), (2143,2390), (2143,3350), (2143,5239), (2143,6250), (2390,3350), (2390,5239), (2390,6250), (3350,5239), (3350,6250), (5239,6250)]

I can certainly do this with a bunch of for loops but it’s quite ugly. I’ve tried using itertools but couldn’t get something out of it properly. Any thoughts?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-03T20:30:09+00:00

Editorial Team

2026-06-03T20:30:09+00:00Added an answer on June 3, 2026 at 8:30 pm

How about:

x = [2143, 2390, 3350, 5239, 6250]
itertools.combinations(x, 2)

gives:

(2143, 2390) (2143, 3350) (2143, 5239) (2143, 6250) (2390, 3350) (2390, 5239) (2390, 6250) (3350, 5239) (3350, 6250) (5239, 6250)

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m running a Python script as a Hadoop streaming job, but this post is

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply