The following code allows me to extract .tgz files. However, it stops extracting after

Question

0

Asked: May 23, 20262026-05-23T08:40:56+00:00 2026-05-23T08:40:56+00:00

The following code allows me to extract .tgz files. However, it stops extracting after

0

The following code allows me to extract .tgz files. However, it stops extracting after about two levels down; there are other subfolders that have .tgz files that need extracting. Additionally, when I extract a file, I have to manually move it to another path or it will get overwritten by other .tgz files that I extract to that location (all .tgz that I’m using have the same file structure/folder names once extracted). Any help is appreciated. Thanks!

import os, sys, tarfile

def extract(tar_url, extract_path='.'):
    print tar_url
    tar = tarfile.open(tar_url, 'r')
    for item in tar:
        tar.extract(item, extract_path)
        if item.name.find(".tgz") != -1 or item.name.find(".tar") != -1:
            extract(item.name, "./" + item.name[:item.name.rfind('/')])
try:

    extract(sys.argv[1] + '.tgz')
    print 'Done.'
except:
    name = os.path.basename(sys.argv[0])
    print name[:name.rfind('.')], '<filename>'

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T08:40:56+00:00

If I have not wrongly misinterpreted your question, then here is what you want to do –

Extract a .tgz file which may have
more .tgz files within it that needs further
extraction (and so on..)
While extracting, you need to be careful that you are not replacing an already existing directory in the folder.

If I have correctly interpreted your problem, then…
Here is what my code does –

Extracts every .tgz file (recursively) in a separate folder with the same name as the .tgz file (without its extension) in the same directory.
While extracting, it makes sure that it is not overwriting/replacing any already existing files/folder.

So if this is the directory structure of the .tgz file –

parent/
    xyz.tgz/
        a
        b
        c
        d.tgz/
            x
            y
            z
        a.tgz/                  # note if I extract this directly, it will replace/overwrite contents of the folder 'a'
            m
            n
            o
            p

After extraction, the directory structure will be –

parent/
    xyz.tgz
    xyz/
        a
        b
        c
        d/
            x
            y
            z
        a 1/                  # it extracts 'a.tgz' to the folder 'a 1' as folder 'a' already exists in the same folder.
            m
            n
            o
            p

Although I have provided plenty of documentation in my code below, I would just brief out the structure of my program. Here are the functions I have defined –

FileExtension --> returns the extension of a file
AppropriateFolderName --> helps in preventing overwriting/replacing of already existing folders (how? you will see it in the program)
Extract --> extracts a .tgz file (safely)
WalkTreeAndExtract - walks down a directory (passed as parameter) and extracts all .tgz files(recursively) on the way down.

I cannot suggest changes to what you have done, as my approach is a bit different. I have used extractall method of the tarfile module instead of the bit complicated extract method as you have done. (Just have glance at this – http://docs.python.org/library/tarfile.html#tarfile.TarFile.extractall and read the warning associated with using extractall method. I don`t think we will be having any such problem in general, but just keep that in mind.)

So here is the code that worked for me –
(I tried it for .tar files nested 5 levels deep (ie .tar within .tar within .tar … 5 times), but it should work for any depth* and also for .tgz files.)

# extracting_nested_tars.py

import os
import re
import tarfile

file_extensions = ('tar', 'tgz')
# Edit this according to the archive types you want to extract. Keep in
# mind that these should be extractable by the tarfile module.

def FileExtension(file_name):
    """Return the file extension of file

    'file' should be a string. It can be either the full path of
    the file or just its name (or any string as long it contains
    the file extension.)

    Examples:
    input (file) -->  'abc.tar'
    return value -->  'tar'

    """
    match = re.compile(r"^.*[.](?P<ext>\w+)$",
      re.VERBOSE|re.IGNORECASE).match(file_name)

    if match:           # if match != None:
        ext = match.group('ext')
        return ext
    else:
        return ''       # there is no file extension to file_name

def AppropriateFolderName(folder_name, parent_fullpath):
    """Return a folder name such that it can be safely created in
    parent_fullpath without replacing any existing folder in it.

    Check if a folder named folder_name exists in parent_fullpath. If no,
    return folder_name (without changing, because it can be safely created 
    without replacing any already existing folder). If yes, append an
    appropriate number to the folder_name such that this new folder_name
    can be safely created in the folder parent_fullpath.

    Examples:
    folder_name = 'untitled folder'
    return value = 'untitled folder' (if no such folder already exists
                                      in parent_fullpath.)

    folder_name = 'untitled folder'
    return value = 'untitled folder 1' (if a folder named 'untitled folder'
                                        already exists but no folder named
                                        'untitled folder 1' exists in
                                        parent_fullpath.)

    folder_name = 'untitled folder'
    return value = 'untitled folder 2' (if folders named 'untitled folder'
                                        and 'untitled folder 1' both
                                        already exist but no folder named
                                        'untitled folder 2' exists in
                                        parent_fullpath.)

    """
    if os.path.exists(os.path.join(parent_fullpath,folder_name)):
        match = re.compile(r'^(?P<name>.*)[ ](?P<num>\d+)$').match(folder_name)
        if match:                           # if match != None:
            name = match.group('name')
            number = match.group('num')
            new_folder_name = '%s %d' %(name, int(number)+1)
            return AppropriateFolderName(new_folder_name,
                                         parent_fullpath)
            # Recursively call itself so that it can be check whether a
            # folder named new_folder_name already exists in parent_fullpath
            # or not.
        else:
            new_folder_name = '%s 1' %folder_name
            return AppropriateFolderName(new_folder_name, parent_fullpath)
            # Recursively call itself so that it can be check whether a
            # folder named new_folder_name already exists in parent_fullpath
            # or not.
    else:
        return folder_name

def Extract(tarfile_fullpath, delete_tar_file=True):
    """Extract the tarfile_fullpath to an appropriate* folder of the same
    name as the tar file (without an extension) and return the path
    of this folder.

    If delete_tar_file is True, it will delete the tar file after
    its extraction; if False, it won`t. Default value is True as you
    would normally want to delete the (nested) tar files after
    extraction. Pass a False, if you don`t want to delete the
    tar file (after its extraction) you are passing.

    """
    tarfile_name = os.path.basename(tarfile_fullpath)
    parent_dir = os.path.dirname(tarfile_fullpath)

    extract_folder_name = AppropriateFolderName(tarfile_name[:\
    -1*len(FileExtension(tarfile_name))-1], parent_dir)
    # (the slicing is to remove the extension (.tar) from the file name.)
    # Get a folder name (from the function AppropriateFolderName)
    # in which the contents of the tar file can be extracted,
    # so that it doesn't replace an already existing folder.
    extract_folder_fullpath = os.path.join(parent_dir,
    extract_folder_name)
    # The full path to this new folder.

    try:
        tar = tarfile.open(tarfile_fullpath)
        tar.extractall(extract_folder_fullpath)
        tar.close()
        if delete_tar_file:
            os.remove(tarfile_fullpath)
        return extract_folder_name
    except Exception as e:
        # Exceptions can occur while opening a damaged tar file.
        print 'Error occured while extracting %s\n'\
        'Reason: %s' %(tarfile_fullpath, e)
        return

def WalkTreeAndExtract(parent_dir):
    """Recursively descend the directory tree rooted at parent_dir
    and extract each tar file on the way down (recursively).
    """
    try:
        dir_contents = os.listdir(parent_dir)
    except OSError as e:
        # Exception can occur if trying to open some folder whose
        # permissions this program does not have.
        print 'Error occured. Could not open folder %s\n'\
        'Reason: %s' %(parent_dir, e)
        return

    for content in dir_contents:
        content_fullpath = os.path.join(parent_dir, content)
        if os.path.isdir(content_fullpath):
            # If content is a folder, walk it down completely.
            WalkTreeAndExtract(content_fullpath)
        elif os.path.isfile(content_fullpath):
            # If content is a file, check if it is a tar file.
            # If so, extract its contents to a new folder.
            if FileExtension(content_fullpath) in file_extensions:
                extract_folder_name = Extract(content_fullpath)
                if extract_folder_name:     # if extract_folder_name != None:
                    dir_contents.append(extract_folder_name)
                    # Append the newly extracted folder to dir_contents
                    # so that it can be later searched for more tar files
                    # to extract.
        else:
            # Unknown file type.
            print 'Skipping %s. <Neither file nor folder>' % content_fullpath

if __name__ == '__main__':
    tarfile_fullpath = 'fullpath_path_of_your_tarfile'    # pass the path of your tar file here.
    extract_folder_name = Extract(tarfile_fullpath, False)

    # tarfile_fullpath is extracted to extract_folder_name. Now descend
    # down its directory structure and extract all other tar files
    # (recursively).
    extract_folder_fullpath = os.path.join(os.path.dirname(tarfile_fullpath),
      extract_folder_name)
    WalkTreeAndExtract(extract_folder_fullpath)
    # If you want to extract all tar files in a dir, just execute the above
    # line and nothing else.

I have not added a command line interface to it. I guess you can add it if you find it useful.

Here is a slightly better version of the above program –
http://guanidene.blogspot.com/2011/06/nested-tar-archives-extractor.html

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

The following code allows me to extract .tgz files. However, it stops extracting after

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply