I’m currently creating a Node.js webscraper/proxy, but I’m having trouble parsing relative Urls found in the scripting part of the source, I figured REGEX would do the trick.
Although it is unknown how I would achieve this.
Is there anyway I can go about this?
Also I’m open to an easier way of doing this, as I’m quite baffle about how other proxies parse websites. I figured that most are just glorified site scrapers that can read a site’s source a relay all links/forms back to the proxy.
Advanced HTML string replacement functions
Note for OP, because he requested such a function: Change
base_urlto your proxy’s basE URL in order to achieve the desired results.Two functions will be shown below (the usage guide is contained within the code). Make sure that you don’t skip any part of the explanation of this answer to fully understand the function’s behaviour.
rel_to_abs(urL)– This function returns absolute URLs. When an absolute URL with a commonly trusted protocol is passed, it will immediately return this URL. Otherwise, an absolute URL is generated from thebase_urland the function argument. Relative URLs are correctly parsed (../;./;.;//).replace_all_rel_by_abs– This function will parse all occurences of URLs which have a significant meaning in HTML, such as CSSurl(), links and external resources. See the code for a full list of parsed instances. See this answer for an adjusted implementation to sanitise HTML strings from an external source (to embed in the document).rel_to_abs– Parsing relative URLsCases / examples:
http://foo.bar. Already an absolute URL, thus returned immediately./dooRelative to the root: Returns the current root + provided relative URL../mehRelative to the current directory.../boohRelative to the parent directory.The function converts relative paths to
../, and performs a search-and-replace (http://domain/sub/anything-but-a-slash/../metohttp://domain/sub/me).replace_all_rel_by_abs– Convert all relevant occurences of URLsURLs inside script instances (
<script>, event handlers are not replaced, because it’s near-impossible to create a fast-and-secure filter to parse JavaScript.This script is served with some comments inside. Regular Expressions are dynamically created, because an individual RE can have a size of 3000 characters.
<meta http-equiv=refresh content=.. >can be obfuscated in various ways, hence the size of the RE.A short summary of the private functions:
rel_to_abs(url)– Converts relative / unknown URLs to absolute URLsreplace_all_rel_by_abs(html)– Replaces all relevant occurences of URLs within a string of HTML by absolute URLs.ae– Any Entity – Returns a RE-pattern to deal with HTML entities.by– replace by – This short function request the actual url replace (rel_to_abs). This function may be called hundreds, if not thousand times. Be careful to not add a slow algorithm to this function (customisation).cr– Create Replace – Creates and executes a search-and-replace.Example:
href="..."(within any HTML tag).cri– Create Replace Inline – Creates and executes a search-and-replace.Example:
url(..)within the allstyleattribute within HTML tags.Test case
Open any page, and paste the following bookmarklet in the location bar:
The injected code contains the two functions, as defined above, plus the test case, shown below. Note: The test case does not modify the HTML of the page, but shows the parsed results in a textarea (optionally).
See also: