Announcing Django Crawler and django-test-utils

Today I'm going to be releasing a new project, called django-test-utils. It's rather empty at the moment, but it does have one cool feature. That is my Django Crawler. I have some big plans for this little guy, but for the moment it has enough functionality to make it pretty useful.

Usage

The crawler at the moment has 4 options implemented on it. I will outline them below and show example of the output. It is implemented as a management command, named crawlurls. You simply add test_utils to your INSTALLED_APPS and you are good to go. So to run it you simply do ./manage.py crawlurls. It crawls your site using the Django Test Client (so no network traffic is required!) This allows the crawler to have intimate knowledge of your Django Code. This allows it to have features that other crawlers can't have.

Core features

The Crawler at the beginning loops through all of your URLConfs. It then loads up all of the regular expressions from these URLConfs to examine later. Once the crawler is done crawling your site, it will tell you what URLConf entries are not being hit.

-v --verbosity [0,1,2]

Same as most django apps. Set it to 2 to get a lot of output. 1 is the default, which will only output errors.

-t --time

The -t option, as the help says: Pass -t to time your requests. This outputs the time it takes to run each request on your site. This option also tells the crawler to output the top 10 URLs that took the most time at the end of it's run. Here is an example output from running it on my site with -t -v 2:

Getting /blog/2007/oct/17/umw-blog-ring/ ({}) from (/blog/2007/oct/17/umw-blog-ring/)
Time Elapsed: 0.256254911423 
Getting /blog/2007/dec/20/logo-lovers/ ({}) from (/blog/2007/dec/20/logo-lovers/)
Time Elapsed: 0.06906914711 
Getting /blog/2007/dec/18/getting-real/ ({}) from (/blog/2007/dec/18/getting-real/)
Time Elapsed: 0.211296081543 
Getting /blog/ ({u'page': u'5'}) from (/blog/?page=4)
Time Elapsed: 0.165636062622 
NOT MATCHED: account/email/
NOT MATCHED: account/register/
NOT MATCHED: admin/doc/bookmarklets/
NOT MATCHED: admin/doc/tags/
NOT MATCHED: admin/(.*)
NOT MATCHED: admin/doc/views/
NOT MATCHED: account/signin/complete/
NOT MATCHED: account/password/
NOT MATCHED: resume/
/blog/2008/feb/9/another-neat-ad/ took 0.743204
/blog/2007/dec/20/browser-tabs/#comments took 0.637164
/blog/2008/nov/1/blog-post-day-keeps-doctor-away/ took 0.522269

-p --pdb

This option allows you to drop into pdb on an error in your site. This lets you look around the response, context, and other things to see what happened to cause the error. I don't know how useful this will be, but it seems like a neat feature to be able to have. I stole this idea from nose tests.

-s --safe

This option alerts you when you have escaped HTML fragments in your templates. This is useful for tracking down places where you aren't applying safe correctly, and other HTML related failures. This isn't implemented well, and might be buggy because I didn't have any broken pages on my site to test on :)

-r --response

This tells the crawler to store the response object for each site. This used to be the default behavior, but doing this bloats up memory. There isn't anything useful implemented on top of this feature, but with this feature you get a dictionary of request URLs with responses as their values. You can then go through and do whatever you want (including examine the Templates rendered and Contexts.

Considerations

At the moment, this crawler doesn't have a lot of end-user functionality. However, you can go in and edit the script at the end of the crawl to do almost anything. You are left with a dictionary of URLs crawled, and the time it took, and response (if you use the -r option).

Future improvements

There are a lot of future improvements that I have planned. I want to enable the test client to login as a user, passed in from the command line. This should be pretty simple, I just haven't implemented it yet.

Another thing that I want to do but isn't implemented is fixtures. I want to be able to output a copy of the data returned from the crawler run. This will allow for future runs of the crawler to diff against previous runs, creating a kind of regression test.

A third thing I want to implement is an option to only evaluate each URLConf entry X times. Where you could say "only hit /blog/[year]/[month]/ 10 times". This goes on the assumption that you are looking for errors in your views or templates, and you only need to hit each URL a couple of times. This also shouldn't be hard, but isn't implemented yet.

The big pony that I want to make is to use multiprocessing on the crawler. The crawler doesn't hit a network, so it is CPU-bound. However, running with CPUs with multiple cores, multiprocessing will speed this up. A problem with it is that some of the timing stuff and pdb things won't be as useful.

I would love to hear some people's feedback and thoughts on this. I think that this could be made into a really awesome tool. At the moment it works well for smaller sites, but it would be nice to be able to test only certain URLs in an app. There are lots of neat things I have planned, but I like following the release early, release often mantra.




Comments

1 JH says...

What's up kansas.

I'm not a django guy - but I'm guessing this means that you can do some interesting profiling and code introspection as well?

Posted at 9:45 p.m. on November 10, 2008

2 Adrian Holovaty says...

Hey, cool! I've wanted to do something like this for a very long time.

One quick question that I can't seem to answer by reading this entry or skimming the code: for URLpatterns that have wildcards in them, how does it decide what values to insert into the wildcards? Do you have to specify some sample values manually, or does it guess data, or does it just skip URLpatterns that have wildcards? That's always been the non-starter that has prevented me from implementing this.

Again, nice work.

Adrian

Posted at 2:38 a.m. on November 11, 2008

3 Tom Davis says...

Very clever idea and should be quite useful with some improvements. I wanted to comment briefly on your desire to add multiprocessing support.

Be very careful about making queries from within workers! I have annotated my issue in a (probably unrelated) ticket here: http://code.djangoproject.com/ticket/9409#comment:13 but to explain briefly, there seems to be a problem where workers do not open their database connections if a connection is already established in another process. This causes multiple workers to use the same connection which of course will cause nasty errors. The only workaround I've found for this is calling connection.close() before spawning workers, but even this doesn't entirely fix the issue. This never occurred when using Parallel Python (pp), though it has its own issues. Using locks does not appear to fix the issue; instead, it causes transactions to get "stuck" and never complete.

This has become such a serious problem for me that I've completely given up trying to fix it and am now taking steps to ensure that no workers ever make database queries unless those queries are done from a fresh connection through the base db library (psycopg2, in my case) and don't make use of Django models or the connection object.

Posted at 4:05 a.m. on November 11, 2008

4 Jeremy Dunck says...

Adrian, It's not that smart. ;-) Here: http://github.com/ericholscher/django-test-utils/tree/mas...

It starts at root ('/') and crawls any links found in the responses. It keeps track of hits based on regex of the crawled URLs. It doesn't generate any URLs.

Posted at 4:43 a.m. on November 11, 2008

5 Eric Holscher says...

@Adrian: No. It doesn't generate URLs. It simply crawls the site and sees which ones you are using. Then it compares this list against all of the possible ones, outputting ones that aren't used on your site (which presumably means they don't need to exist).

I've been thinking about creating heuristics from the URLs. So after running this I will have a list of all of the strings that matched each URLConf entry, and what value was used for each. It would be neat to have some kind of way to have it learn what was needed for each URL.

Another neat thing that I've thought about doing is learning what is normally used for the names of URLConf entries. Like ?P<year> is usually d{4} and between 2000-2008 or something like that. Where we could have a (possibly) distributed effort know what these things commonly map to. I don't know why this would be useful, but it seems neat.

@Tom thanks for the heads up! I haven't started playing with it, but your comments will really help if I start running into problems. If i can figure anything out, I'll let you know.

Posted at 5:03 a.m. on November 11, 2008

6 Tom Davis says...

Eric,

After another day of testing, anger and a lot of cigarettes, here are my findings... such as they are:

  1. Use django.db.connection.close() to rid yourself of any existing database connection. I mentioned this before, but it's an issue I was not able to get around.

  2. Import anything that may make db connections as local imports in workers. Multiprocessing uses some fancy magic (I guess) to give workers access to global imports even though this makes no sense to me since they shouldn't be allowed to share stuff from other processes.

  3. For more advanced db operations, specifically transactional ones, use a multiprocessing.Semaphore. After realizing my strategy of "no db access in workers" wasn't going to fly because something from Django was using a thread.Lock (these can't be pickled) I went back to the drawing board in a final, desperate attempt to find a locking mechanism that actually worked. The Semaphore was my only successful test.

After so much anguish, I am taking a "I'll believe it when I see it" stance on this strategy, i.e. it must hold up under heavy production load. That being said, preliminary tests have been promising.

Keep in mind these troubles could be all my own; if you find yourself in similar circumstances and don't come across these gnarly ghosts in the machine, do let me know!

(P.S. comment previews aren't sent through markdown; this may be deliberate, just felt like letting you know!)

Posted at 4:33 p.m. on November 12, 2008

Comments support markdown

Comments are closed.

Comments have been close for this post.