New to bash scripting, The previous answers didn’t helped me.
I am trying to harvest ids from web pages and I need to parse page1, get a list of ids, and use them to parse corresponding web pages.
The thing is I’m not sure how to write the script…
Here’s what I would like to do:
- Parse
url1according toregexp. Output: list of extracted ids (101,102,103, etc). - Parse each url with output id, for example: parse (
http://someurl/101), then parse (http://someurl/102), etc.
So far, I have come up with this command:
curl http://subtitle.co.il/browsesubtitles.php?cs=movies | grep -o -P '(?<=list.php\?mid=)\d+'
The command above works, and gives a list of ids.
Any advice for the next steps? Am I on the right track?
Thanks!
You’re next step would probably do a loop on all ids:
Here we have defined a function called
parse_url, that iterates over all ids it finds in a file passed as an argument (ie.$1is the first argument passed to the function).We can then use the ID to generate a URL, or we can grep the URL from the same file, now extracting the ID. Note that the regular expression for finding the URL assumes that the URL has a specific format:
To download the page, we create a temporary file with the
mktempcommand. Since you said you’re new to bash scripting, I’ll just give a quick explanation for the$(...)s that appears. They run a command or a series of commands that are specified between parenthesis, then execute them, capturing their standard output and placing it where the$(...)was. In this case, it is placed inside the double-quotes that we assign to a$new_page_filevariable. Therefore$new_page_filecontains the name of a random file name created for storing the temporary file.We can then download the URL into that temporary file, call the function to parse it, and then delete it.
To call the function initially, we download the initial URL into a file
file.html, and then call the function passing that file name as the argument.EDIT: Added recursion, based on Barmar‘s answer
Hope this helps a little =)