Quantcast

Replacing URLs with links… halp?

I found this crazy regex for matching URLs in python. Even though some very smart people went through the trouble of concocting that regex, I can’t seem to use for all test cases… hmm.

I’m trying to turn bare urls into links for our Twitter sidebar widget, and it almost works, except for the last crazy case below (query string + anchor):

#!/usr/bin/python
 
import re
 
def convertLinks(str):
    #crazy regex from Shag, based on http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/
 
    prog = re.compile(r'(?#FQURL)(?:(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/)(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)*(?#TopLevel Domains)(?:[a-z]+\.?))(?#Port)(?::[\d]{1,5})?\.?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?|(?#BareURL)(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?\.?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/|)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?)')
 
    return prog.sub(r'<a href="\g&lt;0&gt;">\g&lt;0&gt;</a>', str)
 
#single url
print convertLinks('http://tikirobot.net\n')
 
#string
print convertLinks('See http://tikirobot.net for more info.\n')
 
#string w/ anchor
print convertLinks('See http://tikirobot.net/#foo for more info.\n')
 
#two urls in a string
print convertLinks('See http://tikirobot.net or http://wikipedia.org for more info.\n')
 
#some test urls from flanders.co.nz
print convertLinks('This is a google search: http://www.google.com/search?q=good+url+regex&amp;rls=com.microsoft:*&amp;ie=UTF-8&amp;oe=UTF-8&amp;startIndex=&amp;startPage=1\n')
 
print convertLinks('ftp://joe:password@ftp.filetransferprotocal.com is a ftp url\n')
 
print convertLinks('There is a bare url google.ru somewhere in this sentence\n')
 
#test cases from shag
print convertLinks('query string: https://some-url.com/?query=&amp;name=joe&amp;filter=*.*\n')
 
print convertLinks('both a query string and an anchor with no host name separator slash: https://some-url.com?query=&amp;name=joe?filter=*.*#some_anchor\n')
 
print convertLinks('both a query string and an anchor: https://some-url.com/?query=&amp;name=joe&amp;filter=*.*#some_anchor\n')
 
print convertLinks('DNS name with a concluding period: http://some-url.com./\n')
 
print convertLinks('DNS name with a concluding period and query string: http://some-url.com./?foo=bar\n')
 
print convertLinks('single-component DNS name plus root: http://to./\n')
 
print convertLinks('single-component DNS name: http://to/\n')
 
print convertLinks('words with slashes: unread/unregistered\n')

Also note my double-grouping on the regex.. There must be a better way!
Update, figured out that the group zero backreference is \g<0> (\0 doesn’t work, so I was double-grouping so that I could use \1).

Little help, regex ninjas? Update 2: Shag to the rescue!

Updates 3, 4, and 5: Shag has provided us with a even moar better regex in the comments.. Yay Shag!!!!!

Fever Ray WANTS YOUR BRAIN

Karin Dreijer Andersson made it on to Resident Advisor’s Top 100 albums of the ’00s list twice, once for her solo album as Fever Ray and again for her collaboration with her brother as The Knife.

Here is an acceptance speech that Fever Ray gave at an awards show in Sweeden. It is one of the best acceptance speeches ever given:

Signing Amazon Web Services API Requests in Python

I wanted to ping the “Amazon Product Advertising API” which now requires an HMAC signature, and the pyAWS library doesn’t sign requests and is no longer maintained. Here is some Python code to create a signed request:

# pyAWS no longer works with the AWS signed request requirement
# Sign an AWS REST request using the method described here
# http://docs.amazonwebservices.com/AWSECommerceService/latest/DG/index.html?RequestAuthenticationArticle.html
#_______________________________________________________________________________
def getSignedUrl(accessKey, secretKey, params):
 
    #Step 0: add accessKey, Service, Timestamp, and Version to params
    params['AWSAccessKeyId'] = accessKey
    params['Service']        = 'AWSECommerceService'
 
    #Amazon adds hundredths of a second to the timestamp (always .000), so we do too.
    #(see http://associates-amazon.s3.amazonaws.com/signed-requests/helper/index.html)
    params['Timestamp']      = time.strftime("%Y-%m-%dT%H:%M:%S.000Z", time.gmtime())
    params['Version']        = '2009-03-31'
 
    #Step 1a: sort params
    paramsList = params.items()
    paramsList.sort()
 
    #Step 1b-d: create canonicalizedQueryString
    # This code comes from http://blog.umlungu.co.uk/blog/2009/jul/12/pyaws-adding-request-authentication/
    # and the resulting discussion
    canonicalizedQueryString = '&'.join(['%s=%s' % (k,urllib.quote(str(v))) for (k,v) in paramsList if v])
 
    #Step 2: create string to sign
    host          = 'ecs.amazonaws.com'
    requestUri    = '/onca/xml'
    stringToSign  = 'GET\n'
    stringToSign += host +'\n'
    stringToSign += requestUri+'\n'
    stringToSign += canonicalizedQueryString.encode('utf-8')
 
    #Step 3: create HMAC
    digest = hmac.new(secretKey, stringToSign, hashlib.sha256).digest()
 
    #Step 4: base64 the hmac
    sig = base64.b64encode(digest)
 
    #Step 5: append signature to query
    url  = 'http://' + host + requestUri + '?'
    url += canonicalizedQueryString + "&Signature=" + urllib.quote(sig)
 
    return url

Fast manipulation of tar files using 7zip

tar always reads every byte in an archive (never calles seek()) and is very slow when trying to extract a single file from a large archive.

One solution is to use 7z instead, which is found in the p7zip-full debian package. For some operations, 7z is three orders of magnitude faster than tar. Here are some timings that illustrate how much faster 7z is:

#4.5GB archive listing using tar
time tar tvf fifteenthcensus00reel2149_jp2.tar 
 
(output suppressed, timing stable whether file cache is warmed or not)
 
real	2m1.876s
user	0m0.444s
sys	0m7.740s
 
#4.5GB archive listing using 7z with cold file cache
time 7z l fifteenthcensus00reel2149_jp2.tar 
 
(output suppressed)
 
real	0m10.419s
user	0m0.080s
sys	0m0.124s
5:28 PM
 
#4.5GB archive listing using 7z with hot file cache
time 7z l fifteenthcensus00reel2149_jp2.tar 
 
(output suppressed)
 
real	0m0.145s
user	0m0.052s
sys	0m0.040s
 
 
#extraction of last file in a 4.5GB archive using tar
time tar xvf fifteenthcensus00reel2149_jp2.tar fifteenthcensus00reel2149_jp2/fifteenthcensus00reel2149_0185.jp2
 
real	2m3.545s
user	0m0.436s
sys	0m7.824s
 
#extraction of last file in a 4.5GB archive using 7z and a hot file cache
time 7z e fifteenthcensus00reel2149_jp2.tar fifteenthcensus00reel2149_jp2/fifteenthcensus00reel2149_0185.jp2
 
real	0m0.104s
user	0m0.036s
sys	0m0.036s

The Imaginarium of Dr Parnassus

Speaking of Mr. Tom Waits, the new Terry Gilliam film The Imaginarium of Doctor Parnassus stars Tom Waits as THE DEVIL. It is playing at the Kabuki RIGHT NOW and all this week. When shall we go?

What’s he building in there?

- Tom Waits, via reddit

Look what Zara got me!

How To Teach Physics To Your Dog

Grayscale Santa

At this year’s Santacon, someone Brody dressed up as a grayscale Santa. Here is a picture by NV6V:

Another pic from sfgate…

Update: more info in the description of this pic:

“Custom gray & white Santa suit, made by me. Wig + contacts + Kryolan body paint”

Unreal…

Copenhagen’s Awesome Bicycle Infrastructure

StreetFilms has posted this video showing off the bicycle infrastructure in Copenhagen, where 37% of all commute trips are made on bike.

Update somewhat related: The Senseable City Lab at MIT has unveiled their Copenhagen Wheel.

It transforms ordinary bicycles quickly into hybrid e-bikes that also function as mobile sensing units. The Copenhagen Wheel allows you to capture the energy dissipated while cycling and braking and save it for when you need a bit of a boost. It also maps pollution levels, traffic congestion, and road conditions in real-time.

Death of the Newspaper Greatly Exaggerated

Photo 17

I just picked up the latest issue of the San Francisco Panorama today from Green Apple. McSweeney’s has proven that the American newspaper is still viable. It just takes a team of writers five months to produce a single issue, which sells for $17.52 :)

On Bikes and Buses

This was a good week for bicycling in San Francisco! We got our first new bike lane in 3 years, our first physically-separated bike lane, and our very first bike box! Streetfilms has a video covering the Bike Celebration Press Conference:

In other transportation news, several Muni routes will be discontinued tomorrow. Mission Local has made a farewell video to to the 26 Valencia, which is ending its 108-year run:

On Market Street

From Shag:

20091119_006

At the K-Mart/BigLots parking lot

Pics from Rolla, Missouri:

IMG_6293

IMG_6294

Circuit-Bent Tabla Machine

Check out this drum machine. It’s from India. I circuit-bent it!”

An Anthropological Introduction to YouTube

An Anthropological Introduction to YouTube is an hour-long presentation that Prof. Michael Wesch gave at Library of Congress. Prof. Wesch researches digital ethnography at KSU and previously made the famous The Machine is Us/ing Us video in 2007.

Panic Attack

via reddit:

Ataque de panico! (Panic Attack!) Is a short film describing an invasion of giant robots. What is impressive is knowing that it was carried by two men from Uruguay in six months with $ 300 budget.

I used to have that awesome robot in the opening shot..

This film would have been scarier if it included this giant flame-throwing robot baby:

Happy Halloween!

photo

Remembering GeoCities and KickTam

After fifteen years, GeoCities is shutting down for good today. The Internet Archive has been working with Yahoo to make sure that the Wayback Machine has a complete, final snapshot of GeoCities before it goes offline.

The Archive Team, another archivist group run by Jason Scott of textfiles.com, is also archiving GeoCities. Jason created an under construction animated gif gallery to show the important cultural artifacts that we are going to lose with the GeoCities closure. I was browsing the gallery and found this little guy:

sesenshi-moonkawaii_construction

That penguin looks a lot like Tux, the linux penguin, but he’s actually the old-school mascot of QuickTime. I think his name was KickTam, which was how someone’s kid pronounced QuickTime. I forget the details.

KickTam isn’t used by Apple marketing, but you can sometimes find him hanging out with the QuickTime developers. He’s seen less and less everyday, and I was surprised to stumble upon him wearing a hard hat.

I think the only place that Apple still has KickTam up on apple.com is on the Letters from the Ice Floe page. Icefloe were tech notes that were useful to developers. I remember pointing people to Icefloe #19 every once in a while, and was surprised to see it is ten years old now. It seems the last Icefloe was written in 2001, and I imagine these will slip off apple.com soon, and KickTam will be gone from the net forever.

Crickets Available in a Library

Silent Library is a segment of Downtown no GAki no Taskai ya Arahende!! It is a little bit like Crickets Available Here, but played in a library.

This one involves no crickets, but does have a giant millipede:

This one involves spoons!

E-reader Taste Test

We’ve tried a bunch, but have yet to find one that is actually tasty.

Also, I started a new blog about the Archive.

Painting Over Banksy’s Street Art

I read about another Banksy piece being painted over today. It’s unreal that graffiti removal crews will paint over pieces worth more than US $500,000, and then keep on doing it. Here’s a small list.. I’m sure there are ton more that have been painted over:

  • February 2007: “Workers from Network Rail painted over a set of doors to an electricity generator on which the secretive artist had sprayed a monkey preparing to blow up a bunch of bananas.”
  • April 2007: “Transport workers in London have painted over a mural by world-renowned graffiti artist Banksy, erasing a piece of art estimated to be worth $500,000 (250,000 pounds).”
  • March 2007: “A group of bungling council workers have painted over one of the earliest surviving murals by guerrilla graffiti artist Banksy.
    The 25ft x 4ft design, thought to be worth more than £100,000, was mistaken as vandalism by workmen who slapped thick black paint over it.”
  • February 2008: “The famous maid stencil by Banksy is in a parlous state. Someone has painted over most of the image, and stenciled the words ‘all the best’ over the top.”
  • September 2008: “The painting of a child flying a refrigerator-shaped kite by fabled British graffiti artist Banksy on St. Claude Ave. near St. Anthony St. has been painted over.”
  • March 2009: “As of very recently, Banksy’s “One Nation Under CCTV” image in Westminister has been painted over with a single coat of very grey paint.”
  • July 2009: “I’ve seen this one in the flesh. Its in the suburbs of Bamako in Mali, West Africa. Banksy was there about 4-5 months ago. Sadly this has been partially painted over.”
  • September 2009: “Council officials have painted over a Banksy graffito sketch from which a reworked version was derived as the cover artwork for the 2003 single Crazy Beat by the band Blur.”

See also: Barcelona’s disappearing street scene.

The Great Bayview Warehouse Fire

There was a terrible fire down by the RaNCh this weekend. Five neighboring warehouses burned to the ground. Somehow the ranch survived. dreameleven has a writeup of what happened. Craig started a flickr pool of photos of the fire:

Travel The World, Meet Interesting People, and Blog Them

I was talking to Bobslobster about the crazy scheme that some people on Reddit came up with to buy some unemployed person a $600 Jet Blue All-You-Can-Jet pass and have them run wacky missions around the country for a month. Mister Lobster had no idea what I was talking about, because he doesn’t read the AskReddit portion of the site. The fact that this ridiculousness was organized in two days is stunning. For those who missed it, here is a recap:

  • On August 17, Reddit user hiS_oWn posts this question, and offers to kick in $100 towards the ticket:
    Anyone else remember that JetBlue $600 for a month deal? What if we sponsor some unemployed redditor to travel around and do stuff for us, like courier packages, or do requests for us as compensation?

  • Redditor mr-oblivious creates a subreddit where candidates can post their resumes for consideration to be the Reddit Traveler.
  • Redditor mustardhamsters creates a KickStarter project to start to raise funds for the ticket.
  • On August 18, Redditor Saydrah offers to screen the candidates for the Reddit Traveler position.
  • On August 19, the KickStarter Project receives the required $680 in funds, about three hours after it was started.
  • Saydrah starts a voting thread, where redditors are asked to select their top choice among six candidates that have been screened over the phone.
  • Redditor arunan volunteers to tally the votes. The results are posted here. Redditors draynen and 77or88 are almost tied for the most votes.
  • Saydrah asks for more donations, so that two travelers can be sent, instead of just one.
  • An hour later, Foodproof donates $500, bringing the total to more than $1600.
  • hiS_oWn posts the official update, announcing both draynen, a filmmaker from Seattle, and 77or88, a recent college grad from Ohio as winners of the Travel Challenge. Donors are invited to suggest “reddit missions” and redditors are invited to arrange hosting in the /r/reddittraveljetblue subreddit.

Hooray for the internet!

Giro d’Italia 1974: A Time Before Camelbaks

via reddit

How To Pretty-Print a Python ElementTree Structure

ElementTree doesn’t support pretty-printing XML. lxml does, but isn’t installed on our system. minidom’s toprettyxml() is seriously fucked up. What to do? Turned out PyXML was installed, so I took some advice from here and came up with this function, which takes an ET node and returns a pretty-printed string:

import xml.etree.ElementTree as ET
 
from xml.dom.ext.reader import Sax2
from xml.dom.ext import PrettyPrint
from StringIO import StringIO
 
def prettyPrintET(etNode):
    reader = Sax2.Reader()
    docNode = reader.fromString(ET.tostring(etNode))
    tmpStream = StringIO()
    PrettyPrint(docNode, stream=tmpStream)
    return tmpStream.getvalue()
Older Posts »