Quantcast

Replacing URLs with links… halp?

I found this crazy regex for matching URLs in python. Even though some very smart people went through the trouble of concocting that regex, I can’t seem to use for all test cases… hmm.

I’m trying to turn bare urls into links for our Twitter sidebar widget, and it almost works, except for the last crazy case below (query string + anchor):

#!/usr/bin/python
 
import re
 
def convertLinks(str):
    #crazy regex from Shag, based on http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/
 
    prog = re.compile(r'(?#FQURL)(?:(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/)(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)*(?#TopLevel Domains)(?:[a-z]+\.?))(?#Port)(?::[\d]{1,5})?\.?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?|(?#BareURL)(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?\.?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?)')
 
    return prog.sub(r'<a href="\g&lt;0&gt;">\g&lt;0&gt;</a>', str)
 
#single url
print convertLinks('http://tikirobot.net\n')
 
#string
print convertLinks('See http://tikirobot.net for more info.\n')
 
#string w/ anchor
print convertLinks('See http://tikirobot.net/#foo for more info.\n')
 
#two urls in a string
print convertLinks('See http://tikirobot.net or http://wikipedia.org for more info.\n')
 
#some test urls from flanders.co.nz
print convertLinks('This is a google search: http://www.google.com/search?q=good+url+regex&amp;rls=com.microsoft:*&amp;ie=UTF-8&amp;oe=UTF-8&amp;startIndex=&amp;startPage=1\n')
 
print convertLinks('ftp://joe:password@ftp.filetransferprotocal.com is a ftp url\n')
 
print convertLinks('There is a bare url google.ru somewhere in this sentence\n')
 
#test cases from shag
print convertLinks('query string: https://some-url.com/?query=&amp;name=joe&amp;filter=*.*\n')
 
print convertLinks('both a query string and an anchor with no host name separator slash: https://some-url.com?query=&amp;name=joe?filter=*.*#some_anchor\n')
 
print convertLinks('both a query string and an anchor: https://some-url.com/?query=&amp;name=joe&amp;filter=*.*#some_anchor\n')
 
print convertLinks('DNS name with a concluding period: http://some-url.com./\n')
 
print convertLinks('DNS name with a concluding period and query string: http://some-url.com./?foo=bar\n')
 
print convertLinks('single-component DNS name plus root: http://to./\n')
 
print convertLinks('single-component DNS name: http://to/\n')
 
print convertLinks('words with slashes: unread/unregistered\n')

Also note my double-grouping on the regex.. There must be a better way!
Update, figured out that the group zero backreference is \g<0> (\0 doesn’t work, so I was double-grouping so that I could use \1).

Little help, regex ninjas? Update 2: Shag to the rescue!

Updates 3, 4, and 5: Shag has provided us with a even moar better regex in the comments.. Yay Shag!!!!!

Signing Amazon Web Services API Requests in Python

I wanted to ping the “Amazon Product Advertising API” which now requires an HMAC signature, and the pyAWS library doesn’t sign requests and is no longer maintained. Here is some Python code to create a signed request:

# pyAWS no longer works with the AWS signed request requirement
# Sign an AWS REST request using the method described here
# http://docs.amazonwebservices.com/AWSECommerceService/latest/DG/index.html?RequestAuthenticationArticle.html
#_______________________________________________________________________________
def getSignedUrl(accessKey, secretKey, params):
 
    #Step 0: add accessKey, Service, Timestamp, and Version to params
    params['AWSAccessKeyId'] = accessKey
    params['Service']        = 'AWSECommerceService'
 
    #Amazon adds hundredths of a second to the timestamp (always .000), so we do too.
    #(see http://associates-amazon.s3.amazonaws.com/signed-requests/helper/index.html)
    params['Timestamp']      = time.strftime("%Y-%m-%dT%H:%M:%S.000Z", time.gmtime())
    params['Version']        = '2009-03-31'
 
    #Step 1a: sort params
    paramsList = params.items()
    paramsList.sort()
 
    #Step 1b-d: create canonicalizedQueryString
    # This code comes from http://blog.umlungu.co.uk/blog/2009/jul/12/pyaws-adding-request-authentication/
    # and the resulting discussion
    canonicalizedQueryString = '&'.join(['%s=%s' % (k,urllib.quote(str(v))) for (k,v) in paramsList if v])
 
    #Step 2: create string to sign
    host          = 'ecs.amazonaws.com'
    requestUri    = '/onca/xml'
    stringToSign  = 'GET\n'
    stringToSign += host +'\n'
    stringToSign += requestUri+'\n'
    stringToSign += canonicalizedQueryString.encode('utf-8')
 
    #Step 3: create HMAC
    digest = hmac.new(secretKey, stringToSign, hashlib.sha256).digest()
 
    #Step 4: base64 the hmac
    sig = base64.b64encode(digest)
 
    #Step 5: append signature to query
    url  = 'http://' + host + requestUri + '?'
    url += canonicalizedQueryString + "&Signature=" + urllib.quote(sig)
 
    return url

How To Pretty-Print a Python ElementTree Structure

ElementTree doesn’t support pretty-printing XML. lxml does, but isn’t installed on our system. minidom’s toprettyxml() is seriously fucked up. What to do? Turned out PyXML was installed, so I took some advice from here and came up with this function, which takes an ET node and returns a pretty-printed string:

import xml.etree.ElementTree as ET
 
from xml.dom.ext.reader import Sax2
from xml.dom.ext import PrettyPrint
from StringIO import StringIO
 
def prettyPrintET(etNode):
    reader = Sax2.Reader()
    docNode = reader.fromString(ET.tostring(etNode))
    tmpStream = StringIO()
    PrettyPrint(docNode, stream=tmpStream)
    return tmpStream.getvalue()

Announcing YouTubeFilter

YouTubeFilter is a simple tool that scrapes the MetaFilter RSS feed and embeds the YouTube videos inline. I wrote it to make it easier to find cool videos to watch on my Wii.

Unfortunately, the Wii runs out memory when loading YouTubeFilter! And of course, Firefox bugs on the mac prevent some of the embedded videos from showing up unless you resize the window just right. Stupid firefox.

The code is checked into SourceForge. I use Beautiful Soup for parsing the RSS. Someone please help me make it work on the wii!

Announcing ChatBubble!

I’ve finally made it easy to post good-looking iChat transcripts to the blog! We use CSS to style DIVs to look like iChat speech balloons.

Cool! Where I can get the CSS?

All the code is checked into SourceForge. You can browse it here.

But how does it work?

A brief description is here. Scott Schiller came up with the Even More Rounded Corners technique that we use. There is a CSS file to include and a python script that turns transcripts into html that you can paste into a blog post. We need more documentation, CSS cleanup, cross-bowswer support, and more speech balloon colors, if you feel like contributing patches.

Doesn’t WordPress completely bork the formatting in Safari by adding unmatched </p> tags?

Yup! WordPress is crap! You can use the wp-unformatted plugin to disable autop() on posts that contain ChatBubbles.

Example Scripts: REST web services and system calls

I’ve been translating all the new perl and php scripts I write into Python and Ruby in order to learn more about those two languages. I checked some more example scripts into SourceForge, which might be useful for others who know one of these languages and want to learn a new one.

These scripts are available in perl, php, python, and ruby:


The REST Web Service PHP and Perl scripts don’t work in Mac OS X, because OS X doesn’t ship with Perl’s XML::Simple or PHP’s simplexml. More surprisingly, OS X doesn’t ship with Perl’s LWP module.

I’m starting to like Ruby more every day. It would be nice if Ruby and Python had a XML::Simple equivalent in their standard distributions.

Example scripts: directory listing in perl, php, python, and ruby

I remember when I fell in love with Perl. It was the summer of 1995, and peliom and I had just met, and were working at the Lab. Postscript hacking using MacPerl on OS 8. It was beautiful.

That was more than ten years ago, and even though I’ve remained a Perl hacker the whole time, I see massive amounts of development happening on Python and Ruby, and the Perl community seems to be slowing down (what’s up with Perl 6 anyway?), so, despite the lack of block-level scope, I think it might finally be time to move on.

I don’t know enough about either Python or Ruby to figure out which to learn, so I’ll learn them both, and deal with choosing one later. Along the way I’ll post some example scripts. Anyone else making the jump from Perl or PHP to something modern might find these useful. Here is the first example: printing out a directory listing using readdir and glob in your favorite scripting language:

(more…)