Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8780759
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 13, 20262026-06-13T20:08:09+00:00 2026-06-13T20:08:09+00:00

I have +20 000 files, that look like this below, all in the same

  • 0

I have +20 000 files, that look like this below, all in the same directory:

8003825.pdf
8003825.tif
8006826.tif

How does one find all duplicate filenames, while ignoring the file extension.

Clarification: I refer to a duplicate being a file with the same filename while ignoring the file extension. I do not care if the file is not 100% the same (ex. hashsize or anything like that)

For example:

"8003825" appears twice

Then look at the metadata of each duplicate file and only keep the newest one.

Similar to this post:

Keep latest file and delete all other

I think I have to create a list of all files, check if file already exists. If so then use os.stat to determine the modification date?

I’m a little concerned about loading all those filename’s into memory. And wondering if there is a more pythonic way of doing things…

Python 2.6
Windows 7

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-13T20:08:10+00:00Added an answer on June 13, 2026 at 8:08 pm

    You can do it with O(n) complexity. The solutions with sort have O(n*log(n)) complexity.

    import os
    from collections import namedtuple
    
    directory = #file directory
    os.chdir(directory)
    
    newest_files = {}
    Entry = namedtuple('Entry',['date','file_name'])
    
    for file_name in os.listdir(directory):
        name,ext = os.path.splitext(file_name)
        cashed_file = newest_files.get(name)
        this_file_date = os.path.getmtime(file_name)
        if cashed_file is None:
            newest_files[name] = Entry(this_file_date,file_name)
        else:
            if this_file_date > cashed_file.date: #replace with the newer one
                newest_files[name] = Entry(this_file_date,file_name)
    

    newest_files is a dictonary having file names without extensions as keys with values of named tuples which hold file full file name and modification date. If the new file that is encountered is inside the dictionary, its date is compared to the stored in the dictionary one and it is replaced if necessary.

    In the end you have a dictionary with the most recent files.

    Then you may use this list to perform the second pass. Note, that lookup complexity in the dictionary is O(1). So the overall complexity of looking all n files in the dictionary is O(n).

    For example, if you want to leave only the newest files with the same name and delete the other, this can be achieved in the following way:

    for file_name in os.listdir(directory):
        name,ext = os.path.splitext(file_name)
        cashed_file_name = newest_files.get(name).file_name
        if file_name != cashed_file_name: #it's not the newest with this name
            os.remove(file_name)
    

    As suggested by Blckknght in the comments, you can even avoid the second pass and delete the older file as soon as you encounter the newer one, just by adding one line of the code:

        else:
            if this_file_date > cashed_file.date: #replace with the newer one
                newest_files[name] = Entry(this_file_date,file_name)
                os.remove(cashed_file.file_name) #this line added
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a data file that looks like this: xyz123 2.000 -0.3974 0.0 hij123
I have a directory with 500,000 files in it. I would like to access
I have about 200,000 text files that are placed in a bz2 file. The
I have a directory (directory A) with 10,000 files in it. I want to
I have files that are named part-r-000[0-9][0-9] and that contain tab separated fields. I
I have a piece of code that looks like this: downloadsByExtensionCount = defaultdict(int) downloadsByExtensionList
I have a RichTextBox that looks like this: TEXT NEXT_TEXT 10.505 -174.994 0 TEXT
I have this code that aggregates multiple workbooks into a preview file where all
In short, I have a 20,000,000 line csv file that has different row lengths.
im planning to create a movie file that might have over 16,000 frames?i know

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.