Could you help me figure out a regular expression that would extract from url:
-
host name when there is no folder specified in the path that follows it
e.g.http://jj.com/' -> 'jj.com http://jj.com/index.php' -> 'jj.com http://jj.com/query?q=http://kk.uk' -> 'jj.com -
host name + one folder from path when there is at least one folder specified in the path
e.g.'http://jj.com/site/index.php' -> 'jj.com/site' 'http://jj.com/site/second/aldldls.html' -> 'jj.com/site'
Is it possible to do that with just one regular expression?
BTW I will be using regex_extract function from hive but any variation of regex (e.g. perl regex) that can do that would be extremely useful.
Output: