New to programming. I wrote a Python code (with help from stackoverflow) to read

Question 1

New to programming. I wrote a Python code (with help from stackoverflow) to read XML files in a directory and use BeautifulSoup to copy the text for an element and write this to an output directory. Here is my code:

from bs4 import BeautifulSoup
import os, sys

source_folder='/Python27/source_xml'

for article in os.listdir(source_folder):
    soup=BeautifulSoup(open(source_folder+'/'+article))

    #I think this strips any tags that are nested in the sample_tag
    clean_text=soup.sample_tag.get_text("  ",strip=True)

    #This grabs an ID which I used as the output file name
    article_id=soup.article_id.get_text("  ",strip=True)

    with open(article_id,"wb") as f:
        f.write(clean_text.encode("UTF-8"))

The problem is I needed this to run for 100,000 XML files. I started the program last night and to my estimate it took 15 hours to run. Is there a way to, instead of reading and writing one at a time, to maybe read/write 100 or 1000 at a time, for example?

Question 2

The multiprocessing module is your friend. You can automatically scale the code accross all your cpu’s with a process pool like this:

from bs4 import BeautifulSoup
import os, sys
from multiprocessing import Pool


source_folder='/Python27/source_xml'

def extract_text(article):
    soup=BeautifulSoup(open(source_folder+'/'+article))

    #I think this strips any tags that are nested in the sample_tag
    clean_text=soup.sample_tag.get_text("  ",strip=True)

    #This grabs an ID which I used as the output file name
    article_id=soup.article_id.get_text("  ",strip=True)

    with open(article_id,"wb") as f:
            f.write(clean_text.encode("UTF-8"))

def main():
    pool = Pool()
    pool.map(extract_text, os.listdir(source_folder))

if __name__ == '__main__':
    main()

Editorial Team · Answer 1 · 2026-06-06T05:46:06+00:00

The multiprocessing module is your friend. You can automatically scale the code accross all your cpu’s with a process pool like this:

from bs4 import BeautifulSoup
import os, sys
from multiprocessing import Pool


source_folder='/Python27/source_xml'

def extract_text(article):
    soup=BeautifulSoup(open(source_folder+'/'+article))

    #I think this strips any tags that are nested in the sample_tag
    clean_text=soup.sample_tag.get_text("  ",strip=True)

    #This grabs an ID which I used as the output file name
    article_id=soup.article_id.get_text("  ",strip=True)

    with open(article_id,"wb") as f:
            f.write(clean_text.encode("UTF-8"))

def main():
    pool = Pool()
    pool.map(extract_text, os.listdir(source_folder))

if __name__ == '__main__':
    main()

Editorial Team
2026-06-06T05:46:06+00:00Added an answer on June 6, 2026 at 5:46 am

The multiprocessing module is your friend. You can automatically scale the code accross all your cpu’s with a process pool like this:

from bs4 import BeautifulSoup import os, sys from multiprocessing import Pool source_folder='/Python27/source_xml' def extract_text(article): soup=BeautifulSoup(open(source_folder+'/'+article)) #I think this strips any tags that are nested in the sample_tag clean_text=soup.sample_tag.get_text(" ",strip=True) #This grabs an ID which I used as the output file name article_id=soup.article_id.get_text(" ",strip=True) with open(article_id,"wb") as f: f.write(clean_text.encode("UTF-8")) def main(): pool = Pool() pool.map(extract_text, os.listdir(source_folder)) if __name__ == '__main__': main()

0

Reply

Share
Share

Share on Facebook

Share on Twitter

Share on LinkedIn

Share on WhatsApp

Report — Editorial Team, 2026-06-06T05:46:06+00:00Added an answer on June 6, 2026 at 5:46 am

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

New to programming. I wrote a Python code (with help from stackoverflow) to read

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply