Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8642593
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 12, 20262026-06-12T11:48:56+00:00 2026-06-12T11:48:56+00:00

Background: Python 2.6.6 on Linux. First part of a DNA sequence analysis pipeline. I

  • 0

Background:
Python 2.6.6 on Linux. First part of a DNA sequence analysis pipeline.
I want to read a possibly gzipped file from a mounted remote storage (LAN) and if it is gzipped; gunzip it to a stream (i.e. using gunzip FILENAME -c) and if the first character of the stream (file) is “@”, route that entire stream into a filtering program that takes input on standard input, otherwise just pipe it directly to a file on local disk. I’d like to minimize the number of file reads/seeks from remote storage (just a single pass through the file shouldn’t be impossible?).

Contents of an example input file, first four lines corresponding to one record in FASTQ format:

@I328_1_FC30MD2AAXX:8:1:1719:1113/1                                        
GTTATTATTATAATTTTTTACCGCATTTATCATTTCTTCTTTATTTTCATATTGATAATAAATATATGCAATTCG
+I328_1_FC30MD2AAXX:8:1:1719:1113/1                                        
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhahhhhhhfShhhYhhQhh]hhhhffhU\UhYWc

Files that should not be piped into the filtering program contain records that look like this (first two lines corresponding to one record in FASTA format):

>I328_1_FC30MD2AAXX:8:1:1719:1113/1
GTTATTATTATAATTTTTTACCGCATTTATCATTTCTTCTTTATTTTCATATTGATAATAAATATATGCAATTCG

Some made up semi-pseudo code effort to visualize what I want to do (I know this isn’t possible the way I’ve written it). I hope it makes some sense:

if gzipped:
    gunzip = Popen(["gunzip", "-c", "remotestorage/file.gz"], stdout=PIPE)
    if gunzip.stdout.peek(1) == "@": # This isn't possible
        fastq = True
    else:
        fastq = False
if fastq:
    filter = Popen(["filter", "localstorage/outputfile.fastq"], stdin=gunzip.stdout).communicate()
else:
    # Send the gunzipped stream to another file

Disregard the fact that the code won’t run like I’ve written it here and that I have no error handling etc, all that is already in my other code. I just want help with peeking into the stream or finding a way around that. I would be great if you could gunzip.stdout.peek(1) but I realize that’s not possible.

What I’ve tried so far:
I figured subprocess.Popen might help me achieve this, and I’ve tried a lot of different ideas, amongst others trying to use some kind of io.BufferedRandom() object to write the stream to but I can’t figure out how that would work. I know streams are non-seekable but maybe a workaround might be to read the first character of the gunzip-stream and then create a new stream where you first input a “@” or “>” depending on file contents and then stuff the rest of the gunzip.stdout-stream into the new stream. This new stream would then be fed into filter’s Popen stdin.

Note that the file sizes might be several times larger than available memory. I do not want to perform more than one single read of the source file from remote storage and no unnecessary file accessing.

Any ideas are welcome! Please ask me questions so I can clarify if I didn’t make it clear enough.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-12T11:48:57+00:00Added an answer on June 12, 2026 at 11:48 am

    Here is an implementation of your first input a “@” or “>” depending on file contents and then stuff the rest of the gunzip.stdout-stream into the new stream proposal. I only tested the local-file branch of the test, but it should be enough to demonstrate the concept.

    if gzipped:
        source = Popen(["gunzip", "-c", "remotestorage/file.gz"], stdout=PIPE)
    else:
        source = Popen(["cat", "remotestorage/file"], stdout=PIPE)
    firstchar = source.stdout.read(1)
    # "unread" the char we've just read
    source = Popen([r"(printf '\x%02x' && cat)" % ord(firstchar)],
                   shell=True, stdin=source.stdout, stdout=PIPE)
    
    # Now feed the output to a filter or to a local file.
    flocal = None
    try:
        if firstchar == "@":
            filter = Popen(["filter", "localstorage/outputfile.fastq"],
                           stdin=source.stdout)
        else:
            flocal = open('localstorage/outputfile.stream', 'w')
            filter = Popen(["cat"], stdin=source.stdout, stdout=flocal)
        filter.communicate()
    finally:
        if flocal is not None:
            flocal.close()
    

    The idea is to read a single character from the source command’s output, and then recreate the original output using (printf '\xhh' && cat), effectively implementing the peek. The replacement stream specifies shell=True to Popen, leaving it to the shell and cat to do the heavy lifting. The data remains in the pipeline at all times, never getting entirely read into memory. Note that services of the shell are only requested for the single call to Popen that implements unreading the peeked byte, not to the calls that involve of user-supplied file names. Even at that point, the byte is escaped to hex to make sure that the shell does not mangle it when invoking printf.

    The code could be further cleaned up to implement an actual function named peek that returns the peeked contents and a replacement new_source.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I want to do the following in C# (coming from a Python background): strVar
First off, I come from a RDBMS/SQL/C++/Java/Python background and I'm a newbie to Gaelyk,
I want to launch a background Python job from a bash script and then
I'm new to C# coming from a python background. I've had a hard time
I´m trying to learn C#, coming from a Python/PHP background, and I´m trying to
I have a python script that I want always to run in the background.
I'm moving from a PHP background into Django development via python, mostly for the
I'm very new to Python (I'm coming from a JAVA background) and I'm wondering
I'm coming from a C++ background to python I have been declaring member variables
I am trying to teach myself C from a python background. My current mini-problem

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.