i download many html store in os,now get their content ,and extract data what i need to persistence to mysql,
i use the traditional load file one by one ,it’s not efficant cost nealy 8 mins.
any advice is welcome
g_fields=[
'name',
'price',
'productid',
'site',
'link',
'smallImage',
'bigImage',
'description',
'createdOn',
'modifiedOn',
'size',
'weight',
'wrap',
'material',
'packagingCount',
'stock',
'location',
'popularity',
'inStock',
'categories',
] @cost_time
def batch_xml2csv():
"批量将xml导入到一个csv文件中"
delete(g_xml2csv_file)
f=open(g_xml2csv_file,"a")
import os.path
import mmap
for file in glob.glob(g_filter):
print "读入%s"%file
ff=open(file,"r+")
size=os.path.getsize(file)
data=mmap.mmap(ff.fileno(),size)
s=pq(data.read(size))
data.close()
ff.close()
#s=pq(open(file,"r").read())
line=[]
for field in g_fields:
r=s("field[@name='%s']"%field).text()
if r is None:
line.append("\N")
else:
line.append('"%s"'%r.replace('"','\"'))
f.write(",".join(line)+"\n")
f.close()
print "done!"
i tried mmap,it seems didn’t work well
If you’ve got 25,000 text files on disk, ‘you’re doing it wrong’. Depending on how you store them on disk, the slowness could literally be seeking on disk to find the files.
If you’ve got 25,0000 of anything it’ll be faster if you put it in a database with an intelligent index — even if you make the index field the filename it’ll be faster.
If you have multiple directories that descend N levels deep, a database would still be faster.