[Olpcaustria] linkchecker script

Thomas Perl thp at perli.net
Sat Dec 1 09:16:14 CET 2007


Hey Chris!

Chris Hager wrote:
> I've just written a simple linkchecker script in python, and thought 
> someone could perhaps  use it. It recurses through the links of a given 
> domain and stays inside that domain. For found links it gives you the 
> feedbacks: ok, outside, 404, https
> 
>   http://wiki.laptop.org/go/Linkchecker.py

I could imagine that
  "if url in history"
is faster than
  "if history.count(url) > 0:"

Also, you should probably use BeautifulSoup[1] to parse that HTML and
extract the links from there instead of trying to extract the URLs "by
hand". That helps in situations where the code is so completely awkward
that there are no quotes around the value of the href= attribute, for
example. Also, you code would probably do something wrong when the href
attribute is something like href="http://example.org/bla'blub.html".

You also often concatenate strings with the "+" operator.
''.join(s1,s2,...) might be faster, but given that it's doing network
access, I don't know if it would bring that much performance improvements.

For the URL history list, maybe a set[2] might be better suited?

Nice and useful utility, btw.. :)

[1] http://www.crummy.com/software/BeautifulSoup/
[2] http://docs.python.org/lib/types-set.html


Thomas


More information about the Olpcaustria mailing list