O community, I’m in the process of writing the pseudocode for an application that extracts song lyrics from a remote host (web-server, not my own) by reading the page’s source code.
This is assuming that:
- Lyrics are being displayed in plaintext
- Portion of source code containing lyrics is readable by Java front-end application
I’m not looking for source code to answer the question, but what is the technical term used for querying a remote webpage for plaintext content?
If I can determine the webpage naming scheme, I could set the pointer of the URL object to the appropriate webpage, right? The only limitations would be irregular capitalization, and would only be effective if the plaintext was found in EXACTLY the same place.
Do you have any suggestions?
I was thinking something like this for “Buck 65”, singing “I look good”
- URL url = new URL(http://www.elyrics.net/read/b/buck-65-lyrics/i-look-good-lyrics.html);
- I could substitute “buck-65-lyrics” & “i-look-good-lyrics” to reflect user input?
- Input re-directed to PostgreSQL table
Current objective:
- User will request name of {song, artist, album}, Java front-end will query remote webpage
- Full source code (containing plaintext) will be extracted with Java front-end
- Lyrics will be extracted from source code (somehow)
- If song is not currently indexed by PostgreSQL server, will be added to table.
- Operations will be made on the plaintext to suit the objectives of the program
I’m only looking for direction. If I’m headed completely in the wrong direction, please let me know. This is only for the pseudocode. I’m not looking for answers, or hand-outs, I need assistance in determining what I need to do. Are there external libraries for extracting plaintext that you know of? What technical names are there for what I’m trying to accomplish?
Thanks, Tyler
This approach is referred to as screen or data scraping. Note that employing it often breaks the target service’s terms of service. Usually, this is not a robust approach, which is why API-like services with guarantees about how they operate are preferable.
Your approach sounds like it will work for the most part, but a few things to keep in mind.