New to programming. I wrote a Python code (with help from stackoverflow) to read XML files in a directory and use BeautifulSoup to copy the text for an element and write this to an output directory. Here is my code:
from bs4 import BeautifulSoup
import os, sys
source_folder='/Python27/source_xml'
for article in os.listdir(source_folder):
soup=BeautifulSoup(open(source_folder+'/'+article))
#I think this strips any tags that are nested in the sample_tag
clean_text=soup.sample_tag.get_text(" ",strip=True)
#This grabs an ID which I used as the output file name
article_id=soup.article_id.get_text(" ",strip=True)
with open(article_id,"wb") as f:
f.write(clean_text.encode("UTF-8"))
The problem is I needed this to run for 100,000 XML files. I started the program last night and to my estimate it took 15 hours to run. Is there a way to, instead of reading and writing one at a time, to maybe read/write 100 or 1000 at a time, for example?
The multiprocessing module is your friend. You can automatically scale the code accross all your cpu’s with a process pool like this: