I’ve converted over my Retlang spider example over to IronPython in order to get a feel for the differences. Here’s some notes:

Python list comprehensions can do the same as LINQ for simple cases, but LINQ is much more powerful, and it supports deferred execution, while list comprehensions are evaluated greedily. UPDATE : Thanks to Mark for pointing out that generators support deferred execution. There’s still no syntax for grouping or ordering, but these are relatively rare cases.

: Thanks to Mark for pointing out that generators support deferred execution. There’s still no syntax for grouping or ordering, but these are relatively rare cases. Every description of the python syntax I ever see emphasizes the fact you don’t need to put in braces. Pity they don’t spend more time telling you that you have to put in colons, that would actually be useful knowledge. This really bit me when I learnt boo, which is syntactically very similar.

IronPython 1.0 targets CPython 2.4. This sounds fine until you realize that this was released in 2004. A fair bit has happened since then, not limited to the introduction of an inline if syntax.

While we’re on the subject, the inline if (defaultValue if condition else exceptionValue) is actually quite cool.

The fairly lax approach to types means I don’t need to call .Cast<Match> anymore.

Tabs are the devil as far as indents in Boo and Python are concerned. I highly recommend using something that shows tabs explicitly, and then eliminating the lot.

Instance methods that explicitly take self feels awkward to me.

Static Methods that require an incantation such as “ToAbsoluteUrl = staticmethod(ToAbsoluteUrl)” also feels pretty awkward. UPDATE : I’ve now learnt about decorators, thanks to Ken. @staticmethod is relatively readable.

: I’ve now learnt about decorators, thanks to Ken. @staticmethod is relatively readable. The casting to delegate isn’t quite as slick as it is in C#. Passing in “spiderTracker.FoundUrl” instead of “lambda url: spiderTracker.FoundUrl(url)” results in an extremely unhelpful runtime error.

The lambda syntax is pretty elegant, but so is C#3’s. Indeed, C#3 seems to have the edge.

Python’s regular expressions are powerful, but not quite as good as .NET’s. In particular search does what you’d expect match to do findall doesn’t return matches. Rather, it returns the value of the first group of the matches. This is often actually what you wanted, but it’s a bit peculiar. The semantics of “Matches” in .NET are much easier to understand.

It did seem rather slow compared to the C# version. There are, of course, way too many variables here to make a final judgement, but it was disappointing.

So, I’m still waiting for my ESR moment. Can’t say my program ran correctly the first time I ran it.

The code

A couple of notes if you’re looking at this in detail:

You’ll need to explicitly declare where the Retlang DLL is.

I’ve inlined a couple of functions, since the lambda syntax seemed less work than the class method syntax.

It handles proxies better than the original version.

There’s a minor bug fix from the previous version to encourage it to ignore hrefs in the HTML. I’m not intending to work this up into a working HTML parser, so there will definitely be other bugs in this space.

import clr

import re

from System import *

from System . Net import *

from System . IO import *

from System . Threading import *



clr . AddReferenceToFileAndPath ( """c:WhereeverRetlangIsRetlang.dll""" )



from Retlang import *

def Search ( baseUrl , spiderThreadsCount ):

queues = []

spiderChannel = QueueChannel [ str ]()

spiderTrackerChannel = Channel [ str ]()

finishedTrackerChannel = Channel [ str ]()



waitHandle = AutoResetEvent ( False )

spiderTracker = SpiderTracker ( spiderChannel , waitHandle )



spiderTrackerQueue = PoolQueue ()

spiderTrackerQueue . Start ()

spiderTrackerChannel . Subscribe ( spiderTrackerQueue ,

lambda url : spiderTracker . FoundUrl ( url ))

finishedTrackerChannel . Subscribe ( spiderTrackerQueue ,

lambda url : spiderTracker . FinishedWithUrl ( url ))

for index in range ( spiderThreadsCount ):

queue = PoolQueue ()

queues . append ( queue )

queue . Start ()

spider = Spider ( spiderTrackerChannel , finishedTrackerChannel , baseUrl )

spiderChannel . Subscribe ( queue , lambda url : spider . FindReferencedUrls ( url ))

spiderTrackerChannel . Publish ( baseUrl )



waitHandle . WaitOne ()

return spiderTracker . FoundUrls ()



class Spider :

def __init__ ( self , spiderTracker , finishedTracker , baseUrl ):

self . _spiderTracker = spiderTracker

self . _finishedTracker = finishedTracker

self . _baseUrl = baseUrl . lower ()





def FindReferencedUrls ( self , pageUrl ):

content = self . GetContent ( pageUrl )

searchUrls = lambda pattern : [ match for match in re . findall ( pattern , content )]



urls = [ self . ToAbsoluteUrl ( pageUrl , url )

for url in searchUrls ( "href=[']([^'<>]+)[']" )

+ searchUrls ( "href=["]([^"<>]+)["]" )

+ searchUrls ( "href=(['" <>]+)" )

if url is not None and url . Length > 0

and self . IsInternalLink ( url )

and url [ 0 ] != '#'

and not url . endswith ( ".css" )

and not re . search ( "css[.]axd" , url )

]

for newUrl in urls :

self . _spiderTracker . Publish ( newUrl )

self . _finishedTracker . Publish ( pageUrl )



def IsInternalLink ( self , url ):

url = url . lower ()

if url == """ or url == "'" :

return False

if url . startswith ( self . _baseUrl ):

return True

if url . startswith ( "http" ) or url . startswith ( "ftp" ) or url . startswith ( "javascript" ):

return False

if re . search ( "javascript-error" , url ) or re . search ( "lt;" , url ):

return False

return True



def ToAbsoluteUrl ( url , relativeUrl ):

if re . search ( "//" , relativeUrl ):

return relativeUrl

BaseUrlIndex = lambda u : u . find ( '/' , u . find ( "//" ) + 2 )

hashIndex = relativeUrl . find ( '#' )

if hashIndex >= 0 :

relativeUrl = relativeUrl [ 0 : hashIndex ]

if len ( relativeUrl ):

isRoot = relativeUrl . startswith ( "/" )

if isRoot :

index = BaseUrlIndex ( url )

else :

index = url . LastIndexOf ( '/' ) + 1

if index < 0 :

raise "The url % is not correctly formatted." % url

return url [ 0 : index ] + relativeUrl

return None



def GetContent ( self , url ):

# print "Request: " + url

request = WebRequest . Create ( url )

request . Proxy = WebRequest . DefaultWebProxy

response = request . GetResponse ()

try :

reader = StreamReader ( response . GetResponseStream ())

try :

return reader . ReadToEnd ()

finally :

reader . Dispose ()

finally :

response . Dispose ()



ToAbsoluteUrl = staticmethod ( ToAbsoluteUrl )



class SpiderTracker :

def __init__ ( self , spider , waitHandle ):

self . _spider = spider

self . _waitHandle = waitHandle

self . _knownUrls = set ();

self . _urlsInProcess = 0



def FoundUrls ( self ):

return sorted ( self . _knownUrls )



def FoundUrl ( self , url ):

if url not in self . _knownUrls :

self . _knownUrls . add ( url )

if Path . GetExtension ( url ) != "css" :

self . _urlsInProcess = self . _urlsInProcess + 1

self . _spider . Publish ( url )



def FinishedWithUrl ( self , url ):

self . _urlsInProcess = self . _urlsInProcess - 1

print self . _urlsInProcess

if self . _urlsInProcess == 0 :

self . _waitHandle . Set ()





for url in Search ( "http://www.yourtargeturl.com/" , 5 ):

print url

