tl;dr: I’m looking for a way to find entries in our database which are

Question

0

Asked: June 4, 20262026-06-04T20:02:36+00:00 2026-06-04T20:02:36+00:00

tl;dr: I’m looking for a way to find entries in our database which are

0

tl;dr: I’m looking for a way to find entries in our database which are missing information, getting that information from a website and adding it to the database entry.

We have a media management program which uses a mySQL table to store the information. When employees download media (video files, images, audio files) and import it into the media manager they are suppose to also copy the description of the media (from the source website) and add it to the description in the Media Manager. However this has not been done for thousands of files.

The file name (eg. file123.mov) is unique and the details page for that file can be accessed by going to a URL on the source website:

website.com/content/file123

The information we want to scrape from that page has an element ID which is always the same.

In my mind the process would be:

Connect to database and Load table

Filter: "format" is "Still Image (JPEG)"

Filter: "description" is "NULL"

Get first result

Get "FILENAME" without extension)

Load the URL: website.com/content/FILENAME

Copy contents of the element "description" (on website)

Paste contents into the "description" (SQL entry)

Get 2nd result

Rinse and repeat until last result is reached

My question(s) are:

Is there software that could perform such a task or is this something that would need to be scripted?
If scripted, what would be the best type of script (eg could I achieve this using AppleScript or would it need to be made in java or php etc.)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-04T20:02:40+00:00

I too am not aware of any existing software packages that will do everything you’re looking for. However, Python can connect to your database, make web requests easily, and handle dirty html. Assuming you already have Python installed, you’ll need three packages:

MySQLdb for connecting to the database.
Requests for easily making http web requests.
BeautifulSoup for robust parsing of html.

You can install these packages with pip commands or Windows installers. Appropriate instructions are on each site. The whole process won’t take more than 10 minutes.

import MySQLdb as db
import os.path
import requests
from bs4 import BeautifulSoup

# Connect to the database. Fill in these fields as necessary.

con = db.connect(host='hostname', user='username', passwd='password',
                 db='dbname')

# Create and execute our SELECT sql statement.

select = con.cursor()
select.execute('SELECT filename FROM table_name \
                WHERE format = ? AND description = NULL',
               ('Still Image (JPEG)',))

while True:
    # Fetch a row from the result of the SELECT statement.

    row = select.fetchone()
    if row is None: break

    # Use Python's built-in os.path.splitext to split the extension
    # and get the url_name.

    filename = row[0]
    url_name = os.path.splitext(filename)[0]
    url = 'http://www.website.com/content/' + url_name

    # Make the web request. You may want to rate-limit your requests
    # so that the website doesn't get angry. You can slow down the
    # rate by inserting a pause with:
    #               
    # import time   # You can put this at the top with other imports
    # time.sleep(1) # This will wait 1 second.

    response = requests.get(url)
    if response.status_code != 200:

        # Don't worry about skipped urls. Just re-run this script
        # on spurious or network-related errors.

        print 'Error accessing:', url, 'SKIPPING'
        continue

    # Parse the result. BeautifulSoup does a great job handling
    # mal-formed input.

    soup = BeautifulSoup(response.content)
    description = soup.find('div', {'id': 'description'}).contents

    # And finally, update the database with another query.

    update = db.cursor()
    update.execute('UPDATE table_name SET description = ? \
                    WHERE filename = ?',
                   (description, filename))

I’ll warn that I’ve made a good effort to make that code “look right” but I haven’t actually tested it. You’ll need to fill in the private details.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

tl;dr: I’m looking for a way to find entries in our database which are

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply