I’m trying to create a script that takes a .txt file with multiple lines of YouTube usernames, appends it to the YouTube user homepage URL, and crawls through to get profile data.
The code below gives me the info I want for one user, but I have no idea where to start for importing and iterating through multiple URLs.
#!/usr/bin/env python
# -- coding: utf-8 --
from bs4 import BeautifulSoup
import re
import urllib2
# download the page
response = urllib2.urlopen("http://youtube.com/user/alxlvt")
html = response.read()
# create a beautiful soup object
soup = BeautifulSoup(html)
# find the profile info & display it
profileinfo = soup.findAll("div", { "class" : "user-profile-item" })
for info in profileinfo:
print info.get_text()
Does anyone have any recommendations?
Eg., if I had a .txt file that read:
username1
username2
username3
etc.
How could I go about iterating through those, appending them to http://youtube.com/user/%s, and creating a loop to pull all the info?
If you don’t want to use an actual scraping module (like scrapy, mechanize, selenium, etc), you can just keep iterating on what you’ve written.
for line in file_objto go line by line in a document.+below, but you can also use the concatenate function.make a list of urls – will let you stagger your requests, so you can do compassionate screen scraping.
EDIT: Andrew G’s string format is clearer. 🙂