Using Python, I’m trying to rename a series of .txt files in a directory according to a specific phrase in each given text file. Put differently and more specifically, I have a few hundred text files with arbitrary names but within each file is a unique phrase (something like No. 85-2156). I would like to replace the arbitrary file name with that given phrase for every text file. The phrase is not always on the same line (though it doesn’t deviate that much) but it always is in the same format and with the No. prefix.
I’ve looked at the os module and I understand how
could be useful but I don’t understand how to combine those functions with intratext manipulation functions like linecache or general line reading functions.
I’ve thought through many ways of accomplishing this task but it seems like easiest and most efficient way would be to create a loop that finds the unique phrase in a file, assigns it to a variable and use that variable to rename the file before moving to the next file.
This seems like it should be easy, so much so that I feel silly writing this question. I’ve spent the last few hours looking reading documentation and parsing through StackOverflow but it doesn’t seem like anyone has quite had this issue before — or at least they haven’t asked about their problem.
Can anyone point me in the right direction?
EDIT 1: When I create the regex pattern using this website, it creates bulky but seemingly workable code:
import re
txt='No. 09-1159'
re1='(No)' # Word 1
re2='(\\.)' # Any Single Character 1
re3='( )' # White Space 1
re4='(\\d)' # Any Single Digit 1
re5='(\\d)' # Any Single Digit 2
re6='(-)' # Any Single Character 2
re7='(\\d)' # Any Single Digit 3
re8='(\\d)' # Any Single Digit 4
re9='(\\d)' # Any Single Digit 5
re10='(\\d)' # Any Single Digit 6
rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8+re9+re10,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
name = m.group(0)
print name
When I manipulate that to fit the glob.glob structure, and make it like this:
import glob
import os
import re
re1='(No)' # Word 1
re2='(\\.)' # Any Single Character 1
re3='( )' # White Space 1
re4='(\\d)' # Any Single Digit 1
re5='(\\d)' # Any Single Digit 2
re6='(-)' # Any Single Character 2
re7='(\\d)' # Any Single Digit 3
re8='(\\d)' # Any Single Digit 4
re9='(\\d)' # Any Single Digit 5
re10='(\\d)' # Any Single Digit 6
rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8+re9+re10,re.IGNORECASE|re.DOTALL)
for fname in glob.glob("\file\structure\here\*.txt"):
with open(fname) as f:
contents = f.read()
tname = rg.search(contents)
print tname
Then this prints out the byte location of the the pattern — signifying that the regex pattern is correct. However, when I add in the nname = tname.group(0) line after the original tname = rg.search(contents) and change around the print function to reflect the change, it gives me the following error: AttributeError: ‘NoneType’ object has no attribute ‘group’. When I tried copying and pasting @joaquin’s code line for line, it came up with the same error. I was going to post this as a comment to the @spatz answer but I wanted to include so much code that this seemed to be a better way to express the `new’ problem. Thank you all for the help so far.
Edit 2: This is for the @joaquin answer below:
import glob
import os
import re
for fname in glob.glob("/directory/structure/here/*.txt"):
with open(fname) as f:
contents = f.read()
tname = re.search('No\. (\d\d\-\d\d\d\d)', contents)
nname = tname.group(1)
print nname
Last Edit: I got it to work using mostly the code as written. What was happening is that there were some files that didn’t have that regex expression so I assumed Python would skip them. Silly me. So I spent three days learning to write two lines of code (I know the lesson is more than that). I also used the error catching method recommended here. I wish I could check all of you as the answer, but I bothered @Joaquin the most so I gave it to him. This was a great learning experience. Thank you all for being so generous with your time. The final code is below.
import os
import re
pat3 = "No\. (\d\d-\d\d)"
ext = '.txt'
mydir = '/directory/files/here'
for arch in os.listdir(mydir):
archpath = os.path.join(mydir, arch)
with open(archpath) as f:
txt = f.read()
s = re.search(pat3, txt)
if s is None:
continue
name = s.group(1)
newpath = os.path.join(mydir, name)
if not os.path.exists(newpath):
os.rename(archpath, newpath + ext)
else:
print '{} already exists, passing'.format(newpath)
There is no checking or protection for failures (check is archpath is a file, if newpath already exists, if the search is succesful, etc…), but this should work:
Edit: I tested the regex to show how it works:
The regex is very simple:
So, it says: search for the string
"No. "followed by 2+4 decimal digits separated by a dash.The parentheses are to create a group that I can recover with
s.group(1)and that contains the code number.And that is what you get, before and after:
Text of files one.txt, two.txt and three.txt is always the same, only the number changes: