Replacing URLs with links… halp?
I found this crazy regex for matching URLs in python. Even though some very smart people went through the trouble of concocting that regex, I can’t seem to use for all test cases… hmm.
I’m trying to turn bare urls into links for our Twitter sidebar widget, and it almost works, except for the last crazy case below (query string + anchor):
#!/usr/bin/python import re def convertLinks(str): #crazy regex from Shag, based on http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/ prog = re.compile(r'(?#FQURL)(?:(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/)(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)*(?#TopLevel Domains)(?:[a-z]+\.?))(?#Port)(?::[\d]{1,5})?\.?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?|(?#BareURL)(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?\.?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?)') return prog.sub(r'<a href="\g<0>">\g<0></a>', str) #single url print convertLinks('http://tikirobot.net\n') #string print convertLinks('See http://tikirobot.net for more info.\n') #string w/ anchor print convertLinks('See http://tikirobot.net/#foo for more info.\n') #two urls in a string print convertLinks('See http://tikirobot.net or http://wikipedia.org for more info.\n') #some test urls from flanders.co.nz print convertLinks('This is a google search: http://www.google.com/search?q=good+url+regex&rls=com.microsoft:*&ie=UTF-8&oe=UTF-8&startIndex=&startPage=1\n') print convertLinks('ftp://joe:password@ftp.filetransferprotocal.com is a ftp url\n') print convertLinks('There is a bare url google.ru somewhere in this sentence\n') #test cases from shag print convertLinks('query string: https://some-url.com/?query=&name=joe&filter=*.*\n') print convertLinks('both a query string and an anchor with no host name separator slash: https://some-url.com?query=&name=joe?filter=*.*#some_anchor\n') print convertLinks('both a query string and an anchor: https://some-url.com/?query=&name=joe&filter=*.*#some_anchor\n') print convertLinks('DNS name with a concluding period: http://some-url.com./\n') print convertLinks('DNS name with a concluding period and query string: http://some-url.com./?foo=bar\n') print convertLinks('single-component DNS name plus root: http://to./\n') print convertLinks('single-component DNS name: http://to/\n') print convertLinks('words with slashes: unread/unregistered\n')
Also note my double-grouping on the regex.. There must be a better way!
Update, figured out that the group zero backreference is \g<0> (\0 doesn’t work, so I was double-grouping so that I could use \1).
Little help, regex ninjas? Update 2: Shag to the rescue!
Updates 3, 4, and 5: Shag has provided us with a even moar better regex in the comments.. Yay Shag!!!!!
Filed under: python, regex, support · 17 Comments

