Quantcast

Replacing URLs with links… halp?

I found this crazy regex for matching URLs in python. Even though some very smart people went through the trouble of concocting that regex, I can’t seem to use for all test cases… hmm.

I’m trying to turn bare urls into links for our Twitter sidebar widget, and it almost works, except for the last crazy case below (query string + anchor):

#!/usr/bin/python
 
import re
 
def convertLinks(str):
    #crazy regex from Shag, based on http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/
 
    prog = re.compile(r'(?#FQURL)(?:(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/)(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)*(?#TopLevel Domains)(?:[a-z]+\.?))(?#Port)(?::[\d]{1,5})?\.?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?|(?#BareURL)(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?\.?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?)')
 
    return prog.sub(r'<a href="\g&lt;0&gt;">\g&lt;0&gt;</a>', str)
 
#single url
print convertLinks('http://tikirobot.net\n')
 
#string
print convertLinks('See http://tikirobot.net for more info.\n')
 
#string w/ anchor
print convertLinks('See http://tikirobot.net/#foo for more info.\n')
 
#two urls in a string
print convertLinks('See http://tikirobot.net or http://wikipedia.org for more info.\n')
 
#some test urls from flanders.co.nz
print convertLinks('This is a google search: http://www.google.com/search?q=good+url+regex&amp;rls=com.microsoft:*&amp;ie=UTF-8&amp;oe=UTF-8&amp;startIndex=&amp;startPage=1\n')
 
print convertLinks('ftp://joe:password@ftp.filetransferprotocal.com is a ftp url\n')
 
print convertLinks('There is a bare url google.ru somewhere in this sentence\n')
 
#test cases from shag
print convertLinks('query string: https://some-url.com/?query=&amp;name=joe&amp;filter=*.*\n')
 
print convertLinks('both a query string and an anchor with no host name separator slash: https://some-url.com?query=&amp;name=joe?filter=*.*#some_anchor\n')
 
print convertLinks('both a query string and an anchor: https://some-url.com/?query=&amp;name=joe&amp;filter=*.*#some_anchor\n')
 
print convertLinks('DNS name with a concluding period: http://some-url.com./\n')
 
print convertLinks('DNS name with a concluding period and query string: http://some-url.com./?foo=bar\n')
 
print convertLinks('single-component DNS name plus root: http://to./\n')
 
print convertLinks('single-component DNS name: http://to/\n')
 
print convertLinks('words with slashes: unread/unregistered\n')

Also note my double-grouping on the regex.. There must be a better way!
Update, figured out that the group zero backreference is \g<0> (\0 doesn’t work, so I was double-grouping so that I could use \1).

Little help, regex ninjas? Update 2: Shag to the rescue!

Updates 3, 4, and 5: Shag has provided us with a even moar better regex in the comments.. Yay Shag!!!!!

Fever Ray WANTS YOUR BRAIN

Karin Dreijer Andersson made it on to Resident Advisor’s Top 100 albums of the ’00s list twice, once for her solo album as Fever Ray and again for her collaboration with her brother as The Knife.

Here is an acceptance speech that Fever Ray gave at an awards show in Sweeden. It is one of the best acceptance speeches ever given:

Signing Amazon Web Services API Requests in Python

I wanted to ping the “Amazon Product Advertising API” which now requires an HMAC signature, and the pyAWS library doesn’t sign requests and is no longer maintained. Here is some Python code to create a signed request:

# pyAWS no longer works with the AWS signed request requirement
# Sign an AWS REST request using the method described here
# http://docs.amazonwebservices.com/AWSECommerceService/latest/DG/index.html?RequestAuthenticationArticle.html
#_______________________________________________________________________________
def getSignedUrl(accessKey, secretKey, params):
 
    #Step 0: add accessKey, Service, Timestamp, and Version to params
    params['AWSAccessKeyId'] = accessKey
    params['Service']        = 'AWSECommerceService'
 
    #Amazon adds hundredths of a second to the timestamp (always .000), so we do too.
    #(see http://associates-amazon.s3.amazonaws.com/signed-requests/helper/index.html)
    params['Timestamp']      = time.strftime("%Y-%m-%dT%H:%M:%S.000Z", time.gmtime())
    params['Version']        = '2009-03-31'
 
    #Step 1a: sort params
    paramsList = params.items()
    paramsList.sort()
 
    #Step 1b-d: create canonicalizedQueryString
    # This code comes from http://blog.umlungu.co.uk/blog/2009/jul/12/pyaws-adding-request-authentication/
    # and the resulting discussion
    canonicalizedQueryString = '&'.join(['%s=%s' % (k,urllib.quote(str(v))) for (k,v) in paramsList if v])
 
    #Step 2: create string to sign
    host          = 'ecs.amazonaws.com'
    requestUri    = '/onca/xml'
    stringToSign  = 'GET\n'
    stringToSign += host +'\n'
    stringToSign += requestUri+'\n'
    stringToSign += canonicalizedQueryString.encode('utf-8')
 
    #Step 3: create HMAC
    digest = hmac.new(secretKey, stringToSign, hashlib.sha256).digest()
 
    #Step 4: base64 the hmac
    sig = base64.b64encode(digest)
 
    #Step 5: append signature to query
    url  = 'http://' + host + requestUri + '?'
    url += canonicalizedQueryString + "&Signature=" + urllib.quote(sig)
 
    return url

Fast manipulation of tar files using 7zip

tar always reads every byte in an archive (never calles seek()) and is very slow when trying to extract a single file from a large archive.

One solution is to use 7z instead, which is found in the p7zip-full debian package. For some operations, 7z is three orders of magnitude faster than tar. Here are some timings that illustrate how much faster 7z is:

#4.5GB archive listing using tar
time tar tvf fifteenthcensus00reel2149_jp2.tar 
 
(output suppressed, timing stable whether file cache is warmed or not)
 
real	2m1.876s
user	0m0.444s
sys	0m7.740s
 
#4.5GB archive listing using 7z with cold file cache
time 7z l fifteenthcensus00reel2149_jp2.tar 
 
(output suppressed)
 
real	0m10.419s
user	0m0.080s
sys	0m0.124s
5:28 PM
 
#4.5GB archive listing using 7z with hot file cache
time 7z l fifteenthcensus00reel2149_jp2.tar 
 
(output suppressed)
 
real	0m0.145s
user	0m0.052s
sys	0m0.040s
 
 
#extraction of last file in a 4.5GB archive using tar
time tar xvf fifteenthcensus00reel2149_jp2.tar fifteenthcensus00reel2149_jp2/fifteenthcensus00reel2149_0185.jp2
 
real	2m3.545s
user	0m0.436s
sys	0m7.824s
 
#extraction of last file in a 4.5GB archive using 7z and a hot file cache
time 7z e fifteenthcensus00reel2149_jp2.tar fifteenthcensus00reel2149_jp2/fifteenthcensus00reel2149_0185.jp2
 
real	0m0.104s
user	0m0.036s
sys	0m0.036s

Happy Belated Birthday!

The Imaginarium of Dr Parnassus

Speaking of Mr. Tom Waits, the new Terry Gilliam film The Imaginarium of Doctor Parnassus stars Tom Waits as THE DEVIL. It is playing at the Kabuki RIGHT NOW and all this week. When shall we go?

A ramp for Mom!



A ramp for Mom!, originally uploaded by tiki.robot.

-raj

What’s he building in there?

- Tom Waits, via reddit