Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 691637
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 14, 20262026-05-14T02:33:57+00:00 2026-05-14T02:33:57+00:00

I am working on a community detection algorithm for analyzing social network data from

  • 0

I am working on a community detection algorithm for analyzing social network data from Facebook. The first task, detecting all cliques in the graph, can be done efficiently in parallel, and leaves me with an output like this:

17118 17136 17392
17064 17093 17376
17118 17136 17356 17318 12345
17118 17136 17356 17283
17007 17059 17116

Each of these lines represents a unique clique (a collection of node ids), and I want to sort these lines in descending order by the number of ids per line. In the case of the example above, here’s what the output should look like:

17118 17136 17356 17318 12345
17118 17136 17356 17283
17118 17136 17392
17064 17093 17376
17007 17059 17116

(Ties—i.e., lines with the same number of ids—can be sorted arbitrarily.)

What is the most efficient way of sorting these lines.

Keep the following points in mind:

  1. The file I want to sort could be larger than the physical memory of the machine
  2. Most of the machines that I’m running this on have several processors, so a parallel solution would be ideal
  3. An ideal solution would just be a shell script (probably using sort), but I’m open to simple solutions in python or perl (or any language, as long as it makes the task simple)
  4. This task is in some sense very easy—I’m not just looking for any old solution, but rather for a simple and above all efficient solution

UPDATE 2: Best solution

Based on benchmarking the solutions proposed (see below), here is the best solution (taken from Vlad who in turn adapted it from other solutions proposed here). It’s quite clever and does not even use sort

for FILE in infile.* ; do
  awk '{ print >sprintf("tmpfile.%05d.%s", NF, FILE) }' \
    FILE=`basename $FILE` $FILE&
done
wait
ls -1r tmpfile.* | xargs cat >outfile
rm -f tmpfile.*

UPDATE 1: Benchmarking results of proposed solutions

For benchmarking I took the Cliques found in an Oklahoma State Facebook network. The unsorted files that contain these cliques look just like the first example I show above, contain 46,362,546 lines, which brings the filesize up to 6.4 GB. The cliques are almost evenly spread over 8 files. The system I am testing this on contains 4 physical processors, each with 6 cores and a 12MB L2 cache, for a total of 24 cores. It also contains 128 GB physical memory. Because the lines to be sorted were split into 8 files, most of these solutions used 8 (or 16) concurrent processes.

Ignoring the first naive approach, I benchmarked the last 5 suggestions of Vlad Romascanu (the solution which I have selected).

The first solution was not too efficient:

real    6m35.973s
user    26m49.810s
sys     2m14.080s

I tried using solutions 2, 3, and 4, which use FIFO files, but they each only used one sort process and were thus taking a long time (and so I killed these before they could finish)/

The last solution was the quickest:

real    1m3.272s
user    1m21.540s
sys     1m22.550s

Note that the user time of this solution is 1m21s, much better than the first solutions 26 minutes.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-14T02:33:57+00:00Added an answer on May 14, 2026 at 2:33 am

    A naive approach could be simply:

    awk '{ print NF " " $0 }' infile| sort -k1,1nr |
     awk '{ $1=""; print $0 }' >outfile
    

    This will keep up to 3 CPUs busy. sort is not limited by the amount of physical memory available, use the -S and -T switches to configure how much memory to use (-S) before resorting to temporary files in a temp directory (-T) on a big enough (and ideally fast) partition.

    If you can produce several input files by subdividing the work leading up to the sort phase, you would then be able to do:

    for FILE in infile.* ; do
      awk '{ print NF " " $0 }' $FILE | sort -k1,1nr >$FILE.tmp&
    done
    wait
    sort -k1,1nr -m infile.*.tmp | awk '{ $1=""; print $0 }' >outfile
    rm -f infile.*.tmp
    

    This will use up to N*2 CPUs; moreover, the last sort (merge-sort) is highly efficient.

    Refining further to improve parallelism to N*2+1 by using FIFOs instead of intermediate files, again assuming multiple input files are possible:

    for FILE in infile.* ; do
      mkfifo $FILE.fifo
      awk '{ print NF " " $0 }' $FILE | sort -k1,1nr >$FILE.fifo&
    done
    sort -k1,1nr -m infile.*.fifo | awk '{ $1=""; print $0 }' >outfile
    rm -f infile.*.fifo
    

    If multiple input files are not possible, you can simulate them (adding I/O overhead which will hopefully be amortized by the number of processes available):

    PARALLELISM=5 # I want 5 parallel instances
    for N in `seq $PARALLELISM` ; do
      mkfifo infile.$N.fifo
      awk 'NR % '$PARALLELISM'=='$N' { print NF " " $0 }' infile |
        sort -k1,1nr >infile.$N.fifo&
    done
    sort -k1,1nr -m infile.*.fifo | awk '{ $1=""; print $0 }' >outfile
    rm -f infile.*.fifo
    

    Because we use modulo-line-number we have good locality and the filesystem cache should ideally bring the cost of reading the input file over and over in $PARALLELISM processes closer to zero.

    Even better, reading the input file only once and round-robin-ing input lines into several sort pipes:

    PARALLELISM=5 # I want 5 parallel instances
    for N in `seq $PARALLELISM` ; do
      mkfifo infile.$N.fifo1
      mkfifo infile.$N.fifo2
      sort -k1,1nr infile.$N.fifo1 >infile.$N.fifo2&
    done
    awk '{ print NF " " $0 >("infile." NR % '$PARALLELISM' ".fifo1") }' infile&
    sort -k1,1nr -m infile.*.fifo2 | awk '{ $1=""; print $0 }' >outfile
    rm -f infile.$N.fifo[12]
    

    You should measure performance for various values of $PARALLELISM then pick the optimal one.

    EDIT

    As shown in other posts, you can of course use cut instead of the final awk (i.e. which strips the first column) for potentially better efficiency. 🙂

    EDIT2

    Updated all scripts for the filename convention you provided, and fixed a bug in the last version.

    Also, using the new filename convention, if I/O is not the bottleneck then a very slight variation on dave/niry‘s solutions should probably be even more efficient:

       for FILE in infile.* ; do
         awk '{ print >sprintf("tmpfile.%05d.%s", NF, FILE) }' \
           FILE=`basename $FILE` $FILE&
       done
       wait
       ls -1r tmpfile.* | xargs cat >outfile
       rm -f tmpfile.*
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 480k
  • Answers 480k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer Do what follows: Get your page's HTML somehow into a… May 16, 2026 at 6:03 am
  • Editorial Team
    Editorial Team added an answer It's not that hard actually. I don't have time to… May 16, 2026 at 6:03 am
  • Editorial Team
    Editorial Team added an answer $data["name"]=$_POST["name"] $data["email"]=$_POST["email"] $data["msg"]=$_POST["msg"] $data["origin"]=$_POST["origin"] file_put_contents("filename.txt", serialize($data)); and to bring those… May 16, 2026 at 6:03 am

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.