Quantcast

Replacing URLs with links… halp?

I found this crazy regex for matching URLs in python. Even though some very smart people went through the trouble of concocting that regex, I can’t seem to use for all test cases… hmm.

I’m trying to turn bare urls into links for our Twitter sidebar widget, and it almost works, except for the last crazy case below (query string + anchor):

#!/usr/bin/python
 
import re
 
def convertLinks(str):
    #crazy regex from Shag, based on http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/
 
    prog = re.compile(r'(?#FQURL)(?:(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/)(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)*(?#TopLevel Domains)(?:[a-z]+\.?))(?#Port)(?::[\d]{1,5})?\.?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?|(?#BareURL)(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?\.?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?)')
 
    return prog.sub(r'<a href="\g&lt;0&gt;">\g&lt;0&gt;</a>', str)
 
#single url
print convertLinks('http://tikirobot.net\n')
 
#string
print convertLinks('See http://tikirobot.net for more info.\n')
 
#string w/ anchor
print convertLinks('See http://tikirobot.net/#foo for more info.\n')
 
#two urls in a string
print convertLinks('See http://tikirobot.net or http://wikipedia.org for more info.\n')
 
#some test urls from flanders.co.nz
print convertLinks('This is a google search: http://www.google.com/search?q=good+url+regex&amp;rls=com.microsoft:*&amp;ie=UTF-8&amp;oe=UTF-8&amp;startIndex=&amp;startPage=1\n')
 
print convertLinks('ftp://joe:password@ftp.filetransferprotocal.com is a ftp url\n')
 
print convertLinks('There is a bare url google.ru somewhere in this sentence\n')
 
#test cases from shag
print convertLinks('query string: https://some-url.com/?query=&amp;name=joe&amp;filter=*.*\n')
 
print convertLinks('both a query string and an anchor with no host name separator slash: https://some-url.com?query=&amp;name=joe?filter=*.*#some_anchor\n')
 
print convertLinks('both a query string and an anchor: https://some-url.com/?query=&amp;name=joe&amp;filter=*.*#some_anchor\n')
 
print convertLinks('DNS name with a concluding period: http://some-url.com./\n')
 
print convertLinks('DNS name with a concluding period and query string: http://some-url.com./?foo=bar\n')
 
print convertLinks('single-component DNS name plus root: http://to./\n')
 
print convertLinks('single-component DNS name: http://to/\n')
 
print convertLinks('words with slashes: unread/unregistered\n')

Also note my double-grouping on the regex.. There must be a better way!
Update, figured out that the group zero backreference is \g<0> (\0 doesn’t work, so I was double-grouping so that I could use \1).

Little help, regex ninjas? Update 2: Shag to the rescue!

Updates 3, 4, and 5: Shag has provided us with a even moar better regex in the comments.. Yay Shag!!!!!

17 Responses to “Replacing URLs with links… halp?”

  1. shag
    January 28th, 2010 | 6:03 pm

    took a quick look. it’s not the anchor that causes it to break. it’s that there is no slash between the .com and the first question mark. haven’t looked at the RFCs but probably it is not a well-formed URL. putting in the slash into the URL makes it work. if necessary, the regex can probably be hacked to deal with that…

    btw, that URL has another big problem – it’s got a second question mark in the quest string after ‘joe’ that should be an ampersand.

  2. January 28th, 2010 | 7:25 pm

    Yay!!! Thanks for the debugging help, shag :)

    This is why I love the internet. You need some help and then motherfuckin’ SHAG shows up and fixes your shit! w00t!

  3. shag
    January 29th, 2010 | 3:47 pm

    yay!

    another proj idea for the twitter sidebar: link the #whatever and the @whomever to the pertinent twitter feeds…

  4. shag
    January 30th, 2010 | 5:48 pm

    btw here’s a version of the regex that handles the case anyway. something else may subtly be broken by this, but at least it doesn’t show up in the test cases above.

    prog = re.compile(r'(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?')

  5. shag
    February 2nd, 2010 | 4:31 pm

    bah, guess we need to fix it for “http://to./”

  6. shag
    February 2nd, 2010 | 5:50 pm

    here’s a version that adds support for a concluding period on the DNS name, which is valid:

    prog = re.compile(r'(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?\.?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?')

    This passes the following test cases:


    print convertLinks('DNS name with a concluding period: http://some-url.com./\n')


    print convertLinks('DNS name with a concluding period and query string: http://some-url.com./?foo=bar\n')

    Supporting single DNS name component URLs like “http://to/” looks to be much harder, though, since that regex tries to support “bare addresses” like google.ru. A hacked version of the regex tends think that every two letter word, plus TLDs that may appear in a sentence, is a valid URL. I suppose they all could be.

    Anyway, if that bare URL support is removed, the project seems much easier.

  7. shag
    February 4th, 2010 | 11:46 am

    Okay, this one passes all of the current test cases, plus some new ones:

    prog = re.compile(r'(?#FQURL)(?:(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)*(?#TopLevel Domains)(?:[a-z]+\.?))(?#Port)(?::[\d]{1,5})?\.?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?|(?#BareURL)(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:[a-z]+\.?))(?#Port)(?::[\d]{1,5})?\.?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?)')

    It is fugly but it seems to work!

    Single regexs don’t seem like a good approach for this sort of thing, seems better to break it up into separate regexes, conditionals, etc., from the code reuse and maintainability perspectives.

    Also it seems fragile to keep a list of TLDs, it will need to change as TLDs are added, and it is only really needed for the bare URL portion of the URL…

  8. shag
    February 4th, 2010 | 11:49 am

    Here are the test cases I used, most from your earlier tests Raj:

     
    #single url        
    print convertLinks('http://tikirobot.net\n')
     
    #string
    print convertLinks('See http://tikirobot.net for more info.\n')
     
    #string w/ anchor
    print convertLinks('See http://tikirobot.net/#foo for more info.\n')
     
    #two urls in a string
    print convertLinks('See http://tikirobot.net or http://wikipedia.org for more info.\n')
     
    #some test urls from flanders.co.nz
    print convertLinks('This is a google search: http://www.google.com/search?q=good+url+regex&rls=com.microsoft:*&ie=UTF-8&oe=UTF-8&startIndex=&startPage=1\n')
     
    print convertLinks('ftp://joe:password@ftp.filetransferprotocal.com is a ftp url\n')
     
    print convertLinks('There is a bare url google.ru somewhere in this sentence\n')
     
    print convertLinks('both a query string and an anchor: https://some-url.com/?query=&name=joe&filter=*.*\n')
     
    print convertLinks('both a query string and an anchor: https://some-url.com?query=&name=joe?filter=*.*#some_anchor\n')
     
    print convertLinks('both a query string and an anchor: https://some-url.com/?query=&name=joe&filter=*.*#some_anchor\n')
     
    print convertLinks('DNS name with a concluding period: http://some-url.com./\n')
     
    print convertLinks('DNS name with a concluding period and query string: http://some-url.com./?foo=bar\n')
     
    print convertLinks('single-component DNS name plus root: http://to./\n')
     
    print convertLinks('single-component DNS name: http://to/\n')
  9. February 4th, 2010 | 12:13 pm

    Finally found these in the akismet spam panel.. sigh

  10. February 4th, 2010 | 12:16 pm

    Converting <code> to <pre lang=”python”> on the above…

  11. shag
    February 4th, 2010 | 1:47 pm

    This should convert the #foo and @bar…

    #!/usr/bin/python
     
    import re
     
    def convertTopicsAndTwitterers(str):
        prog = re.compile(r'#(\w+)')
        str = prog.sub(r'#<a href="http://twitter.com/search?q=%23\g<1 rel="nofollow">">\g<1></a>', str)
        prog2 = re.compile(r'@(\w+)')
        return prog2.sub(r'@<a href="http://twitter.com/\g<1 rel="nofollow">">\g<1></a>', str)
     
    #
    print convertTopicsAndTwitterers('@peliom: #svlug foo #bar')
  12. shag
    February 4th, 2010 | 1:50 pm

    yuck, wordpress totally mangled teh python! lame :-( that <1 rel nofollow> business should just be <1>

  13. February 4th, 2010 | 2:52 pm

    stupid wordpress is stupid

  14. February 4th, 2010 | 4:28 pm

    Thanks again, shag! The hashtags should now be linked!

    Stupid WP keeps re-inserting the rel=nofollow stuff. I couldn’t find how to turn it off.. will keep looking.

  15. shag
    February 8th, 2010 | 5:07 pm

    Looks like there is a bug: it turned the ‘unresponded’ in

    “unread/unresponded”

    into a link to /responded … hmmm ….

  16. shag
    February 8th, 2010 | 8:46 pm

    Ok, here’s a revised regex:

        prog = re.compile(r'(?#FQURL)(?:(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/)(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)*(?#TopLevel Domains)(?:[a-z]+\.?))(?#Port)(?::[\d]{1,5})?\.?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&amp;(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?|(?#BareURL)(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?\.?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&amp;(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?)')

    that handles this new test case now:

    print convertLinks('words with slashes: unread/unregistered\n')
  17. February 25th, 2010 | 12:34 pm

    Thanks AGAIN! Updated!!

Leave a reply