Replacing URLs with links… halp?
I found this crazy regex for matching URLs in python. Even though some very smart people went through the trouble of concocting that regex, I can’t seem to use for all test cases… hmm.
I’m trying to turn bare urls into links for our Twitter sidebar widget, and it almost works, except for the last crazy case below (query string + anchor):
#!/usr/bin/python import re def convertLinks(str): #crazy regex from Shag, based on http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/ prog = re.compile(r'(?#FQURL)(?:(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/)(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)*(?#TopLevel Domains)(?:[a-z]+\.?))(?#Port)(?::[\d]{1,5})?\.?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?|(?#BareURL)(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?\.?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?)') return prog.sub(r'<a href="\g<0>">\g<0></a>', str) #single url print convertLinks('http://tikirobot.net\n') #string print convertLinks('See http://tikirobot.net for more info.\n') #string w/ anchor print convertLinks('See http://tikirobot.net/#foo for more info.\n') #two urls in a string print convertLinks('See http://tikirobot.net or http://wikipedia.org for more info.\n') #some test urls from flanders.co.nz print convertLinks('This is a google search: http://www.google.com/search?q=good+url+regex&rls=com.microsoft:*&ie=UTF-8&oe=UTF-8&startIndex=&startPage=1\n') print convertLinks('ftp://joe:password@ftp.filetransferprotocal.com is a ftp url\n') print convertLinks('There is a bare url google.ru somewhere in this sentence\n') #test cases from shag print convertLinks('query string: https://some-url.com/?query=&name=joe&filter=*.*\n') print convertLinks('both a query string and an anchor with no host name separator slash: https://some-url.com?query=&name=joe?filter=*.*#some_anchor\n') print convertLinks('both a query string and an anchor: https://some-url.com/?query=&name=joe&filter=*.*#some_anchor\n') print convertLinks('DNS name with a concluding period: http://some-url.com./\n') print convertLinks('DNS name with a concluding period and query string: http://some-url.com./?foo=bar\n') print convertLinks('single-component DNS name plus root: http://to./\n') print convertLinks('single-component DNS name: http://to/\n') print convertLinks('words with slashes: unread/unregistered\n')
Also note my double-grouping on the regex.. There must be a better way!
Update, figured out that the group zero backreference is \g<0> (\0 doesn’t work, so I was double-grouping so that I could use \1).
Little help, regex ninjas? Update 2: Shag to the rescue!
Updates 3, 4, and 5: Shag has provided us with a even moar better regex in the comments.. Yay Shag!!!!!
Filed under: python, regex, support · 17 Comments

took a quick look. it’s not the anchor that causes it to break. it’s that there is no slash between the .com and the first question mark. haven’t looked at the RFCs but probably it is not a well-formed URL. putting in the slash into the URL makes it work. if necessary, the regex can probably be hacked to deal with that…
btw, that URL has another big problem – it’s got a second question mark in the quest string after ‘joe’ that should be an ampersand.
Yay!!! Thanks for the debugging help, shag :)
This is why I love the internet. You need some help and then motherfuckin’ SHAG shows up and fixes your shit! w00t!
yay!
another proj idea for the twitter sidebar: link the #whatever and the @whomever to the pertinent twitter feeds…
btw here’s a version of the regex that handles the case anyway. something else may subtly be broken by this, but at least it doesn’t show up in the test cases above.
prog = re.compile(r'(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?')bah, guess we need to fix it for “http://to./”
here’s a version that adds support for a concluding period on the DNS name, which is valid:
prog = re.compile(r'(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?\.?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?')This passes the following test cases:
print convertLinks('DNS name with a concluding period: http://some-url.com./\n')
print convertLinks('DNS name with a concluding period and query string: http://some-url.com./?foo=bar\n')
Supporting single DNS name component URLs like “http://to/” looks to be much harder, though, since that regex tries to support “bare addresses” like google.ru. A hacked version of the regex tends think that every two letter word, plus TLDs that may appear in a sentence, is a valid URL. I suppose they all could be.
Anyway, if that bare URL support is removed, the project seems much easier.
Okay, this one passes all of the current test cases, plus some new ones:
prog = re.compile(r'(?#FQURL)(?:(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)*(?#TopLevel Domains)(?:[a-z]+\.?))(?#Port)(?::[\d]{1,5})?\.?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?|(?#BareURL)(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:[a-z]+\.?))(?#Port)(?::[\d]{1,5})?\.?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?)')It is fugly but it seems to work!
Single regexs don’t seem like a good approach for this sort of thing, seems better to break it up into separate regexes, conditionals, etc., from the code reuse and maintainability perspectives.
Also it seems fragile to keep a list of TLDs, it will need to change as TLDs are added, and it is only really needed for the bare URL portion of the URL…
Here are the test cases I used, most from your earlier tests Raj:
Finally found these in the akismet spam panel.. sigh
Converting <code> to <pre lang=”python”> on the above…
This should convert the #foo and @bar…
yuck, wordpress totally mangled teh python! lame :-( that <1 rel nofollow> business should just be <1>
stupid wordpress is stupid
Thanks again, shag! The hashtags should now be linked!
Stupid WP keeps re-inserting the rel=nofollow stuff. I couldn’t find how to turn it off.. will keep looking.
Looks like there is a bug: it turned the ‘unresponded’ in
“unread/unresponded”
into a link to /responded … hmmm ….
Ok, here’s a revised regex:
that handles this new test case now:
Thanks AGAIN! Updated!!