<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>TikiRobot! &#187; regex</title>
	<atom:link href="http://www.tikirobot.net/wp/tag/regex/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.tikirobot.net/wp</link>
	<description>Mai Tais and Blinky Lights, Ahoy!</description>
	<lastBuildDate>Sun, 25 Jul 2010 18:54:42 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Replacing URLs with links&#8230; halp?</title>
		<link>http://www.tikirobot.net/wp/2010/01/26/replacing-urls-with-links-halp/</link>
		<comments>http://www.tikirobot.net/wp/2010/01/26/replacing-urls-with-links-halp/#comments</comments>
		<pubDate>Wed, 27 Jan 2010 04:16:03 +0000</pubDate>
		<dc:creator>rajbot</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[support]]></category>

		<guid isPermaLink="false">http://www.tikirobot.net/wp/?p=2973</guid>
		<description><![CDATA[I found this crazy regex for matching URLs in python. Even though some very smart people went through the trouble of concocting that regex, I can&#8217;t seem to use for all test cases&#8230; hmm.
I&#8217;m trying to turn bare urls into links for our Twitter sidebar widget, and it almost works, except for the last crazy [...]]]></description>
			<content:encoded><![CDATA[<p>I found <a href="http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/">this crazy regex</a> for matching URLs in python. Even though some very smart people went through the trouble of concocting that regex, I can&#8217;t seem to use for all test cases&#8230; hmm.</p>
<p>I&#8217;m trying to turn bare urls into links for our Twitter sidebar widget, and it almost works, except for the last crazy case below (query string + anchor):</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">re</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">def</span> convertLinks<span style="color: black;">&#40;</span><span style="color: #008000;">str</span><span style="color: black;">&#41;</span>:
    <span style="color: #808080; font-style: italic;">#crazy regex from Shag, based on http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/</span>
&nbsp;
    prog = <span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">'(?#FQURL)(?:(?#Protocol)(?:(?:ht|f)tp(?:s?)<span style="color: #000099; font-weight: bold;">\:</span><span style="color: #000099; font-weight: bold;">\/</span><span style="color: #000099; font-weight: bold;">\/</span>)(?#Username:Password)(?:<span style="color: #000099; font-weight: bold;">\w</span>+:<span style="color: #000099; font-weight: bold;">\w</span>+@)?(?#Subdomains)(?:(?:[-<span style="color: #000099; font-weight: bold;">\w</span>]+<span style="color: #000099; font-weight: bold;">\.</span>)*(?#TopLevel Domains)(?:[a-z]+<span style="color: #000099; font-weight: bold;">\.</span>?))(?#Port)(?::[<span style="color: #000099; font-weight: bold;">\d</span>]{1,5})?<span style="color: #000099; font-weight: bold;">\.</span>?(?#Directories)(?:(?:(?:<span style="color: #000099; font-weight: bold;">\/</span>(?:[-<span style="color: #000099; font-weight: bold;">\w</span>~!$+|.,=]|%[a-f<span style="color: #000099; font-weight: bold;">\d</span>]{2})+)+|<span style="color: #000099; font-weight: bold;">\/</span>|)+|<span style="color: #000099; font-weight: bold;">\?</span>|#)?(?#Query)(?:(?:<span style="color: #000099; font-weight: bold;">\?</span>(?:[-<span style="color: #000099; font-weight: bold;">\w</span>~!$+|.,*:]|%[a-f<span style="color: #000099; font-weight: bold;">\d</span>{2}])+=?(?:[-<span style="color: #000099; font-weight: bold;">\w</span>~!$+|.,*:=]|%[a-f<span style="color: #000099; font-weight: bold;">\d</span>]{2})*)(?:&amp;amp;(?:[-<span style="color: #000099; font-weight: bold;">\w</span>~!$+|.,*:]|%[a-f<span style="color: #000099; font-weight: bold;">\d</span>{2}])+=?(?:[-<span style="color: #000099; font-weight: bold;">\w</span>~!$+|.,*:=]|%[a-f<span style="color: #000099; font-weight: bold;">\d</span>]{2})*)*)*(?#Anchor)(?:#(?:[-<span style="color: #000099; font-weight: bold;">\w</span>~!$+|.,*:=]|%[a-f<span style="color: #000099; font-weight: bold;">\d</span>]{2})*)?|(?#BareURL)(?#Username:Password)(?:<span style="color: #000099; font-weight: bold;">\w</span>+:<span style="color: #000099; font-weight: bold;">\w</span>+@)?(?#Subdomains)(?:(?:[-<span style="color: #000099; font-weight: bold;">\w</span>]+<span style="color: #000099; font-weight: bold;">\.</span>)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[<span style="color: #000099; font-weight: bold;">\d</span>]{1,5})?<span style="color: #000099; font-weight: bold;">\.</span>?(?#Directories)(?:(?:(?:<span style="color: #000099; font-weight: bold;">\/</span>(?:[-<span style="color: #000099; font-weight: bold;">\w</span>~!$+|.,=]|%[a-f<span style="color: #000099; font-weight: bold;">\d</span>]{2})+)+|<span style="color: #000099; font-weight: bold;">\/</span>|)+|<span style="color: #000099; font-weight: bold;">\?</span>|#)?(?#Query)(?:(?:<span style="color: #000099; font-weight: bold;">\?</span>(?:[-<span style="color: #000099; font-weight: bold;">\w</span>~!$+|.,*:]|%[a-f<span style="color: #000099; font-weight: bold;">\d</span>{2}])+=?(?:[-<span style="color: #000099; font-weight: bold;">\w</span>~!$+|.,*:=]|%[a-f<span style="color: #000099; font-weight: bold;">\d</span>]{2})*)(?:&amp;amp;(?:[-<span style="color: #000099; font-weight: bold;">\w</span>~!$+|.,*:]|%[a-f<span style="color: #000099; font-weight: bold;">\d</span>{2}])+=?(?:[-<span style="color: #000099; font-weight: bold;">\w</span>~!$+|.,*:=]|%[a-f<span style="color: #000099; font-weight: bold;">\d</span>]{2})*)*)*(?#Anchor)(?:#(?:[-<span style="color: #000099; font-weight: bold;">\w</span>~!$+|.,*:=]|%[a-f<span style="color: #000099; font-weight: bold;">\d</span>]{2})*)?)'</span><span style="color: black;">&#41;</span>
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">return</span> prog.<span style="color: black;">sub</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">'&lt;a href=&quot;<span style="color: #000099; font-weight: bold;">\g</span>&amp;lt;0&amp;gt;&quot;&gt;<span style="color: #000099; font-weight: bold;">\g</span>&amp;lt;0&amp;gt;&lt;/a&gt;'</span>, <span style="color: #008000;">str</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">#single url</span>
<span style="color: #ff7700;font-weight:bold;">print</span> convertLinks<span style="color: black;">&#40;</span><span style="color: #483d8b;">'http://tikirobot.net<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">#string</span>
<span style="color: #ff7700;font-weight:bold;">print</span> convertLinks<span style="color: black;">&#40;</span><span style="color: #483d8b;">'See http://tikirobot.net for more info.<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">#string w/ anchor</span>
<span style="color: #ff7700;font-weight:bold;">print</span> convertLinks<span style="color: black;">&#40;</span><span style="color: #483d8b;">'See http://tikirobot.net/#foo for more info.<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">#two urls in a string</span>
<span style="color: #ff7700;font-weight:bold;">print</span> convertLinks<span style="color: black;">&#40;</span><span style="color: #483d8b;">'See http://tikirobot.net or http://wikipedia.org for more info.<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">#some test urls from flanders.co.nz</span>
<span style="color: #ff7700;font-weight:bold;">print</span> convertLinks<span style="color: black;">&#40;</span><span style="color: #483d8b;">'This is a google search: http://www.google.com/search?q=good+url+regex&amp;amp;rls=com.microsoft:*&amp;amp;ie=UTF-8&amp;amp;oe=UTF-8&amp;amp;startIndex=&amp;amp;startPage=1<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">print</span> convertLinks<span style="color: black;">&#40;</span><span style="color: #483d8b;">'ftp://joe:password@ftp.filetransferprotocal.com is a ftp url<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">print</span> convertLinks<span style="color: black;">&#40;</span><span style="color: #483d8b;">'There is a bare url google.ru somewhere in this sentence<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">#test cases from shag</span>
<span style="color: #ff7700;font-weight:bold;">print</span> convertLinks<span style="color: black;">&#40;</span><span style="color: #483d8b;">'query string: https://some-url.com/?query=&amp;amp;name=joe&amp;amp;filter=*.*<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">print</span> convertLinks<span style="color: black;">&#40;</span><span style="color: #483d8b;">'both a query string and an anchor with no host name separator slash: https://some-url.com?query=&amp;amp;name=joe?filter=*.*#some_anchor<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">print</span> convertLinks<span style="color: black;">&#40;</span><span style="color: #483d8b;">'both a query string and an anchor: https://some-url.com/?query=&amp;amp;name=joe&amp;amp;filter=*.*#some_anchor<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">print</span> convertLinks<span style="color: black;">&#40;</span><span style="color: #483d8b;">'DNS name with a concluding period: http://some-url.com./<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">print</span> convertLinks<span style="color: black;">&#40;</span><span style="color: #483d8b;">'DNS name with a concluding period and query string: http://some-url.com./?foo=bar<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">print</span> convertLinks<span style="color: black;">&#40;</span><span style="color: #483d8b;">'single-component DNS name plus root: http://to./<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">print</span> convertLinks<span style="color: black;">&#40;</span><span style="color: #483d8b;">'single-component DNS name: http://to/<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">print</span> convertLinks<span style="color: black;">&#40;</span><span style="color: #483d8b;">'words with slashes: unread/unregistered<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span></pre></div></div>

<p><del datetime="2010-01-27T04:27:08+00:00">Also note my double-grouping on the regex.. There must be a better way!</del><br />
Update, figured out that the group zero backreference is \g&lt;0&gt; (\0 doesn&#8217;t work, so I was double-grouping so that I could use \1).</p>
<p><del datetime="2010-01-29T02:20:35+00:00">Little help, regex ninjas?</del> Update 2: Shag to the rescue!</p>
<p><strong>Updates 3, 4, and 5</strong>: Shag has provided us with a even moar better regex in the comments.. Yay Shag!!!!!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.tikirobot.net/wp/2010/01/26/replacing-urls-with-links-halp/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
		</item>
	</channel>
</rss>
