I am newbie to Perl. I need to parse a tab separated text file. For example:
From name To name Timestamp Interaction
a b Dec 2 06:40:23 IST 2000 comment
c d Dec 1 10:40:23 IST 2001 like
e a Dec 1 16:03:01 IST 2000 follow
b c Dec 2 07:50:29 IST 2002 share
a c Dec 2 08:50:29 IST 2001 comment
c a Dec 11 12:40:23 IST 2008 like
e c Dec 2 07:50:29 IST 2000 like
c b Dec 11 12:40:23 IST 2008 follow
b a Dec 2 08:50:29 IST 2001 share
After parsing I need to create groups base upon users interaction. In this example
a<->b
b<->a
c<->a
a<->c
b<->c
c<->b
for this we can create one group. and we need to display list of groups.
I need some pointers on how to parse the file and form group?
Edit
Constraint-> at least 3 user required for creating group.
Interaction is nothing but some communication is done between two user. It does not matter of which communication
My Approach for solving is
-
We remove repeated interaction between users . such as “a<>b like “again if “a<>b follow” is present then we remove this row.
-
Creating 2 dimensional array which store interaction two users i.e
To Name a b c dFrom Namea X <> <> X b <> X <> X c <> <> X X d X <> X XX= Represent no interaction
<>= represent interaction
In this approach we start from first row i.e “a” user check with “b”. if “a” is interact with “b” then we perform reverse of i.e “b” interact with “a”. same steps perform for each column.
But this approach depends on number of users. If 1000 users are present then we have to create 1000 X 1000 matrix. IS there any alternative to solve this
I have added sample input
a c Dec 2 06:40:23 IST 2000 comment
f g Dec 2 06:40:23 IST 2009 like
c a Dec 2 06:40:23 IST 2009 like
g h Dec 2 06:40:23 IST 2008 like
a d Dec 2 06:40:23 IST 2008 like
r t Dec 2 06:40:23 IST 2007 share
d a Dec 2 06:40:23 IST 2007 share
t u Dec 2 06:40:23 IST 2006 follow
a e Dec 2 06:40:23 IST 2006 follow
k l Dec 2 06:40:23 IST 2009 like
e a Dec 2 06:40:23 IST 2009 like
j k Dec 2 06:40:23 IST 2003 like
c d Dec 2 06:40:23 IST 2003 like
l j Dec 2 06:40:23 IST 2002 like
d c Dec 2 06:40:23 IST 2002 like
m n Dec 2 06:40:23 IST 2005 like
c e Dec 2 06:40:23 IST 2005 like
m l Dec 2 06:40:23 IST 2011 like
e c Dec 2 06:40:23 IST 2011 like
h j Dec 2 06:40:23 IST 2010 like
d e Dec 2 06:40:23 IST 2010 like
o p Dec 2 06:40:23 IST 2009 like
e d Dec 2 06:40:23 IST 2009 like
p q Dec 2 06:40:23 IST 2000 comment
q p Dec 2 06:40:23 IST 2009 like
a p Dec 2 06:40:23 IST 2008 like
p a Dec 2 06:40:23 IST 2007 share
l p Dec 2 06:40:23 IST 2003 like
j l Dec 2 06:40:23 IST 2002 like
t r Dec 2 06:40:23 IST 2000 comment
r h Dec 2 06:40:23 IST 2009 like
j f Dec 2 06:40:23 IST 2008 like
g d Dec 2 06:40:23 IST 2007 share
w q Dec 2 06:40:23 IST 2003 like
o y Dec 2 06:40:23 IST 2002 like
x y Dec 2 06:40:23 IST 2000 comment
y x Dec 2 06:40:23 IST 2009 like
x z Dec 2 06:40:23 IST 2008 like
z x Dec 2 06:40:23 IST 2007 share
y z Dec 2 06:40:23 IST 2003 like
z y Dec 2 06:40:23 IST 2002 like
Output should be:
(a,c, d, e)
(x,y,z)
Parsing is easy. Just a
split /\t/might be enough. However, Text::xSV or Text::CSV might be better.For the connections, you can use the Graph module. To be able to use that module effectively, you need to understand at least the basics of graph theory.
Note that a strongly connected component is defined as:
However, note that if you have
a <-> bandb <-> c,a,b, andcwill form a strongly connected component meaning that is a weaker requirement than all members of a group interacted with each other in both directions.We can still use this to reduce the search space. Once you have candidate groups, you can then check each to see if it fits your definition of a group. If a candidate group does not meet your requirements, then you can check all subsets with one fewer members. If you don’t find any groups among those, you can then look at all subsets with two fewer members and so on until you hit the minimum group size limit.
The script below uses this idea. However, it very likely won’t scale. I strongly suspect one might be able to put together some SQL magic but my mind is far too limited for that.
Output: