Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7129681
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 28, 20262026-05-28T11:16:06+00:00 2026-05-28T11:16:06+00:00

I am using a cluster of computers to do some parallel computation. My home

  • 0

I am using a cluster of computers to do some parallel computation. My home directory is shared across the cluster. In one machine, I have a ruby code that creates bash script containing computation command and write the script to, say, ~/q/ directory. The scripts are named *.worker1.sh, *.worker2.sh, etc.

On other 20 machines, I have 20 python code running ( one at each machine ) that (constantly) check the ~/q/ directory and look for jobs that belong to that machine, using a python code like this:

jobs = glob.glob('q/*.worker1.sh')
[os.system('sh ' + job + ' &') for job in jobs]

For some additional control, the ruby code will create a empty file like workeri.start (i = 1..20) at q directory after it write the bash script to q directory, the python code will check for that ‘start’ file before it runs the above code. And in the bash script, if the command finishes successfully, the bash script will create an empty file like ‘workeri.sccuess’, the python code checks this file after it runs the above code to make sure the computation finishs successfully. If python finds out that the computation finishs successfully, it will remove the ‘start’ file in q directory, so the ruby code knows that job finishs successfully. After the 20 bash script all finished, the ruby code will create new bash script and python read and executes new scripts and so on.

I know this is not a elegant way to coordinate the computation, but I haven’t figured out a better to communicate between different machines.

Now the question is: I expect that the 20 jobs will run somewhat in parallel. The total time to finish the 20 jobs will not be much longer than the time to finish one job. However, it seems that these jobs runs sequentially and time is much longer than I expected.

I suspect that part of the reason is that multiple codes are reading and writing the same directory at once but the linux system or python locks the directory and only allow one process to oprate the directory. This makes the code execute one at a time.

I am not sure if this is the case. If I split the bash scripts to different directories, and let the python code on different machines read and write different directories, will that solve the problem? Or is there any other reasons that cause the problem?

Thanks a lot for any suggestions! Let me know if I didn’t explain anything clearly.

Some additional info:
my home directory is at /home/my_group/my_home, here is the mount info for it
:/vol/my_group on /home/my_group type nfs (rw,nosuid,nodev,noatime,tcp,timeo=600,retrans=2,rsize=65536,wsize=65536,addr=…)

I say constantly check the q directory, meaning a python loop like this:

While True:
    if 'start' file exists:
        find the scripts and execute them as I mentioned above
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-28T11:16:07+00:00Added an answer on May 28, 2026 at 11:16 am

    I know this is not a elegant way to coordinate the computation, but I
    haven’t figured out a better to communicate between different
    machines.

    While this isn’t directly what you asked, you should really, really consider fixing your problem at this level, using some sort of shared message queue is likely to be a lot simpler to manage and debug than relying on the locking semantics of a particular networked filesystem.

    The simplest solution to set up and run in my experience is redis on the machine currently running the Ruby script that creates the jobs. It should literally be as simple as downloading the source, compiling it and starting it up. Once the redis server is up and running, you change your code to append your the computation commands to one or more Redis lists. In ruby you would use the redis-rb library like this:

    require "redis"
    
    redis = Redis.new
    # Your other code to build up command lists...
    redis.lpush 'commands', command1, command2...
    

    If the computations need to be handled by certain machines, use a list per-machine like this:

    redis.lpush 'jobs:machine1', command1
    # etc.
    

    Then in your Python code, you can use redis-py to connect to the Redis server and pull jobs off the list like so:

    from redis import Redis
    r = Redis(host="hostname-of-machine-running-redis")
    while r.llen('jobs:machine1'):
        job = r.lpop('commands:machine1')
        os.system('sh ' + job + ' &')
    

    Of course, you could just as easily pull jobs off the queue and execute them in Ruby:

    require 'redis'
    redis = Redis.new(:host => 'hostname-of-machine-running-redis')
    while redis.llen('jobs:machine1')
        job = redis.lpop('commands:machine1')
        `sh #{job} &`
    end
    

    With some more details about the needs of the computation and the environment it’s running in, it would be possible to recommend even simpler approaches to managing it.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I dont currently have access to a SQL server cluster for development (one using
I've inherited the maintenance of some scientific computing using Parallel Python on a cluster.
I'm using a script that connects to a cluster through ssh and sends some
I'm considering using Berkeley DB to cache some data on an application cluster. What's
I'm trying to see if anyone knows how to cluster some Lat/Long results, using
I have the following way to submit a job with cluster using qsub: Submitting
I have to cluster a list of jobs using fuzzy c-means optimized by the
How would I register a PSOCK cluster (created using package Parallel) with foreach ?
I'm interested in running a Python program using a computer cluster. I have in
I've got a 3 machine Cassandra cluster using rack unaware placements strategy with a

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.