I am new to python and was wondering if there was a better solution to match all forms of URLs that might be found in a given string. Upon googling, there seems to a lot of solutions that extract domains, replace it with links etc, but none that removes / deletes them from a string. I have mentioned some examples below for reference. Thanks!
str = 'this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, http://www.google.com and http://www.google.co.uk and www.domain.co.uk and etc.'
URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|
(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))', '', thestring)
print '==' + URLless_string + '=='
Error Log:
C:\Python27>python test.py
File "test.py", line 7
SyntaxError: Non-ASCII character '\xab' in file test.py on line 7, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
Include encoding line at the top of your source file(the regex string contains non-ascii symbols like
»), e.g.:Also surround your regex string in triple single(or double)quotes –
'''or"""instead of single as this string already contains quote symbols itself('and").