I have a 3 dimensional dataset that describes the gene interactions which can be

Question

0

Asked: May 18, 20262026-05-18T00:48:42+00:00 2026-05-18T00:48:42+00:00

I have a 3 dimensional dataset that describes the gene interactions which can be

0

I have a 3 dimensional dataset that describes the gene interactions which can be formulated as a graph. The sample of dataset is:

a + b  
b + c  
c - f  
b - d  
a + c  
f + g  
g + h  
f + h

‘+’ indicates that a gene on the left side positively regulates the gene on the right. In this data I want to count the sub-graph where a gene (say, x) positively regulates another gene (say, y), y in turn positively regulates yet another gene (say, z). Furthermore, z is also positively regulated by x. There are two such cases in above graph. I want to perform this search preferably using awk but any scripting language is fine. My apologies for being a too specific question and thanks in advance for the help.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-18T00:48:42+00:00

Note: See the information regarding Graphviz below.

This should give you a starting point:

Edit: This version handles genes that are described by more than one character.

awk '
    BEGIN { regdelim = "|" }
    {
        delim=""
        if ($2 == "+") {
            if (plus[$1]) delim=regdelim
            plus[$1]=plus[$1] delim $3
        }
        else
            if ($2 == "-") {
            if (minus[$1]) delim=regdelim
                minus[$1]=minus[$1] delim $3
            }
    }
    END {
        for (root in plus) {
            split(plus[root],regs,regdelim)
            for (reg in regs) {
                if (plus[regs[reg]] && plus[root] ~ plus[regs[reg]]) {
                    print "Match: ", root, "+", regs[reg], "+", plus[regs[reg]]
                }
            }
        }
    }
' inputfile

In the BEGIN clause, set regdelim to a character that doesn’t appear in your data.

I’ve omitted the processing code for the minus data.

Output:

Match:  a + b + c
Match:  f + g + h

Edit 2:

The version below allows you to search for arbitrary combinations. It generalizes the technique used in the original version so no code needs to be duplicated. It also fixes a couple of other ~~bugs~~limitations.

#!/bin/bash
# written by Dennis Williamson - 2010-11-12
# for http://stackoverflow.com/questions/4161001/counting-the-occurrence-of-a-sub-graph-in-a-graph
# A (AB) B, A (AC) C, B (BC) C - where "(XY)" represents a + or a - 
# provided by the positional parameters $1, $2 and $3
# $4 carries the data file name and is referenced at the end of the script
awk -v AB=$1 -v AC=$2 -v BC=$3 '
    BEGIN { regdelim = "|" }
    {
        if ($2 == AB) {
            if (regAB[$1]) delim=regdelim; else delim=""
            regAB[$1]=regAB[$1] delim $3
        }
        if ($2 == AC) {
            if (regAC[$1]) delim=regdelim; else delim=""
            regAC[$1]=regAC[$1] delim $3
        }
        if ($2 == BC) {
            if (regBC[$1]) delim=regdelim; else delim=""
            regBC[$1]=regBC[$1] delim $3
        }
    }
    END {
        for (root in regAB) {
            split(regAB[root],ABarray,regdelim)
            for (ABindex in ABarray) {
                split(regAC[root],ACarray,regdelim)
                for (ACindex in ACarray) {
                    split(regBC[ABarray[ABindex]],BCarray,regdelim)
                    for (BCindex in BCarray) {
                        if (ACarray[ACindex] == BCarray[BCindex]) {
                            print "    Match:", root, AB, ABarray[ABindex] ",", root, AC, ACarray[ACindex] ",", ABarray[ABindex], BC, BCarray[BCindex]
                        }
                    }
                }
            }
        }
    }
' "$4"

This can be called like this to do an exhaustive search:

for ab in + -; do for ac in + -; do for bc in + -; do echo "Searching: $ab$ac$bc"; ./searchgraph $ab $ac $bc inputfile; done; done; done

For this data:

a - e
a + b
b + c
c - f
m - n
b - d
a + c
b - e
l - n
f + g
b + i
g + h
l + m
f + h
a + i
a - j
k - j
a - k

The output of the shell loop calling the new version of the script would look like this:

Searching: +++
    Match: a + b, a + c, b + c
    Match: a + b, a + i, b + i
    Match: f + g, f + h, g + h
Searching: ++-
Searching: +-+
Searching: +--
    Match: l + m, l - n, m - n
    Match: a + b, a - e, b - e
Searching: -++
Searching: -+-
Searching: --+
Searching: ---
    Match: a - k, a - j, k - j

Edit 3:

Graphviz

Another approach would be to use Graphviz. The DOT language can describe the graph and gvpr, which is an "AWK-like"¹ programming language, can analyze and manipulate DOT files.

Given the input data in the format as shown in the question, you can use the following AWK program to convert it to DOT:

#!/usr/bin/awk -f
BEGIN {
    print "digraph G {"
    print "    size=\"5,5\""
    print "    ratio=.85"
    print "    node [fontsize=24 color=blue penwidth=3]"
    print "    edge [fontsize=18 labeldistance=5 labelangle=-8 minlen=2 penwidth=3]"
    print "    {rank=same; f l}"
    m  = "-"    # ASCII minus/hyphen as in the source data
    um = "−"    # u2212 minus: − which looks better on the output graphic
    p  = "+"
}

{
    if ($2 == m) { $2 = um; c = lbf = "red"; arr=" arrowhead = empty" }
    if ($2 == p) { c = lbf = "green3"; arr="" }
    print "    " $1, "->", $3, "[taillabel = \"" $2 "\" color = \"" c "\" labelfontcolor = \"" lbf "\"" arr "]"
}
END {
    print "}"
}

The command to run would be something like this:

$ ./dat2dot data.dat > data.dot

You can then create the graphic above using:

$ dot -Tpng -o data.png data.dot

I used the extended data as given above in this answer.

To do an exhaustive search for the type of subgraphs you specified, you can use the following gvpr program:

BEGIN {
    edge_t AB, BC, AC;
}

E {
    AB = $;
    BC = fstedge(AB.head);
    while (BC && BC.head.name != AB.head.name) {
        AC = isEdge(AB.tail,BC.head,"");
        if (AC) {
            printf("%s %s %s, ", AB.tail.name, AB.taillabel, AB.head.name);
            printf("%s %s %s, ", AC.tail.name, AC.taillabel, AC.head.name);
            printf("%s %s %s\n", BC.tail.name, BC.taillabel, BC.head.name);
        }
        BC = nxtedge(BC, AB.head);
    }
}

To run it, you could use:

$ gvpr -f groups.g data.dot | sort -k 2,2 -k 5,5 -k 8,8

The output would be similar to that from the AWK/shell combination above (under "Edit 2"):

a + b, a + c, b + c
a + b, a + i, b + i
f + g, f + h, g + h
a + b, a − e, b − e
l + m, l − n, m − n
a − k, a − j, k − j

¹_{Loosely speaking.}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a 3 dimensional dataset that describes the gene interactions which can be

Leave an answerCancel reply

1 Answer

Graphviz

Leave an answer
Cancel reply