EDIT: I ran the python profiler and the two most time-consuming things (this is after I decided to comment out the webbrowser portion and Firefox portion of the code, because I knew they were going to be the slowest part…) , the slowest part of my program is re.findall and re.compile and also (len) and (append to list).
I don’t know if I should post all of my code on here at once because I worked really hard on my program (even if it isn’t too good), so for now I’m just going to ask…How do I make my Python program faster?
I have 3 suspects right now for it being so slow:
-
Maybe my computer is just slow
-
Maybe my internet is too slow (sometimes my program has to download the html of web pages and then it searches through the html for a specific piece of text)
-
My code is slow (too many loops maybe? something else? I’m new to this so I wouldn’t know!)
If anyone could offer me advice, I would greatly appreciate it!
Thanks!
EDIT:
My code uses lots of loops I think…also, another thing is that for the program to work you have to be logged in to this website: http://www.locationary.com/
from urllib import urlopen
from gzip import GzipFile
from cStringIO import StringIO
import re
import urllib
import urllib2
import webbrowser
import time
from difflib import SequenceMatcher
import os
def download(url):
s = urlopen(url).read()
if s[:2] == '\x1f\x8b': # assume it's gzipped data
with GzipFile(mode='rb', fileobj=StringIO(s)) as ifh:
s = ifh.read()
return s
for t in range(3,39):
print t
s = download('http://www.locationary.com/place/en/US/Utah/Provo-page' + str(t) + '/?ACTION_TOKEN=NumericAction')
findLoc = re.compile('http://www\.locationary\.com/place/en/US/.{1,50}/.{1,50}/.{1,100}\.jsp')
findLocL = re.findall(findLoc,s)
W = []
X = []
XA = []
Y = []
YA = []
Z = []
ZA = []
for i in range(0,25):
b = download(findLocL[i])
findYP = re.compile('http://www\.yellowpages\.com/')
findYPL = re.findall(findYP,b)
findTitle = re.compile('<title>(.*) \(\d{1,10}.{1,100}\)</title>')
getTitle = re.findall(findTitle,b)
findAddress = re.compile('<title>.{1,100}\((.*), .{4,14}, United States\)</title>')
getAddress = re.findall(findAddress,b)
if not findYPL:
if not getTitle:
print ""
else:
W.append(findLocL[i])
b = download(findLocL[i])
if not getTitle:
print ""
else:
X.append(getAddress)
b = download(findLocL[i])
if not getTitle:
print ""
else:
Y.append(getTitle)
sizeWXY = len(W)
def XReplace(text, dic):
for i, j in dic.iteritems():
text = text.replace(i, j)
XA.append(text)
def YReplace(text2, dic2):
for k, l in dic2.iteritems():
text2 = text2.replace(k, l)
YA.append(text2)
for d in range(0,sizeWXY):
old = str(X[d])
reps = {' ':'-', ',':'', '\'':'', '[':'', ']':''}
XReplace(old, reps)
old2 = str(Y[d])
YReplace(old2, reps)
count = 0
for e in range(0,sizeWXY):
newYPL = "http://www.yellowpages.com/" + XA[e] + "/" + YA[e] + "?order=distance"
v = download(newYPL)
abc = str('<h3 class="business-name fn org">\n<a href="')
dfe = str('" class="no-tracks url "')
findFinal = re.compile(abc + '(.*)' + dfe)
getFinal = re.findall(findFinal, v)
if not getFinal:
W.remove(W[(e-count)])
X.remove(X[(e-count)])
count = (count+1)
else:
for f in range(0,1):
Z.append(getFinal[f])
XA = []
for c in range(0,(len(X))):
aGd = re.compile('(.*), .{1,50}')
bGd = re.findall(aGd, str(X[c]))
XA.append(bGd)
LenZ = len(Z)
V = []
for i in range(0,(len(W))):
if i == 0:
countTwo = 0
gda = download(Z[i-(countTwo)])
ab = str('"street-address">\n')
cd = str('\n</span>')
ZAddress = re.compile(ab + '(.*)' + cd)
ZAddress2 = re.findall(ZAddress, gda)
for b in range(0,(len(ZAddress2))):
if not ZAddress2[b]:
print ""
else:
V.append(str(ZAddress2[b]))
a = str(W[i-(countTwo)])
n = str(Z[i-(countTwo)])
c = str(XA[i])
d = str(V[i])
#webbrowser.open(a)
#webbrowser.open(n)
m = SequenceMatcher(None, c, d)
if m.ratio() < 0.50:
Z.remove(Z[i-(countTwo)])
W.remove(W[i-(countTwo)])
countTwo = (countTwo+1)
def ZReplace(text3, dic3):
for p, q in dic3.iteritems():
text3 = text3.replace(p, q)
ZA.append(text3)
for y in range(0,len(Z)):
old3 = str(Z[y])
reps2 = {':':'%3A', '/':'%2F', '?':'%3F', '=':'%3D'}
ZReplace(old3, reps2)
for z in range(0,len(ZA)):
findPID = re.compile('\d{5,20}')
getPID = re.findall(findPID,str(W[z]))
newPID = re.sub("\D", "", str(getPID))
finalURL = "http://www.locationary.com/access/proxy.jsp?ACTION_TOKEN=proxy_jsp$JspView$SaveAction&inPlaceID=" + str(newPID) + "&xxx_c_1_f_987=" + str(ZA[z])
webbrowser.open(finalURL)
time.sleep(5)
os.system("taskkill /F /IM firefox.exe")
The first thing to do when a program is slow is to identify bottlenecks; in fact, you want to optimize things that take a long time, not things that may actually be fast. In Python, the most efficient way to do this is with one of the Python profilers, which are dedicated tools for performance analysis. Here is a quickstart:
runs your program and stores profiling information in prof.dat. Then,
runs the profiling information analysis tool pstats. Important pstat commands include:
which sorts functions by the time spent in them, and which you can use with a different key instead of
time(cumulative,…). Another important command iswhich print statistics (or
stats 10to print the first 10 most time-consuming functions). You can obtain help with?, orhelp <command>.The way to optimize your program then consists in dealing with the particular code that causes the bottlenecks. You can post the timing results and maybe get some more specific help on the sections of the program that could be most usefully optimized.