« Correct commands to check out and update VCS repos | Required Reading »
Over on ReadTheDocs, I wanted to build search around the documentation that we're hosting. I chose Haystack and Solr for this, because it's the best way to do search in Django these days. However, I've only ever used Haystack to index content that is in the database. I thought about trying to add all the rendered HTML from the documentation into the database, but that was a non-starter.
I ended up adding a ImportedFile model to the database, which would contain the metadata for the HTML file:
class ImportedFile(models.Model):
project = models.ForeignKey(Project, related_name='imported_files')
name = models.CharField(max_length=255)
slug = models.SlugField()
path = models.CharField(max_length=255)
md5 = models.CharField(max_length=255)
This allows me to link the SearchIndex in haystack to a model. Then the interesting part is in the Haystack SearchIndex, where I override the prepare_text method, allowing me to read the data in from the filesystem instead of from the database.
class ImportedFileIndex(SearchIndex):
text = CharField(document=True)
author = CharField(model_attr='project__user')
project = CharField(model_attr='project__name')
title = CharField(model_attr='name')
def prepare_text(self, obj):
full_path = obj.project.full_html_path
to_read = os.path.join(full_path, obj.path.lstrip('/'))
try:
content = codecs.open(to_read, encoding="utf-8", mode='r').read()
return content
except IOError:
print "%s not found" % full_path
site.register(ImportedFile, ImportedFileIndex)
This means that I don't have to bloat my database with all my rendered HTML, but have the full HTML stored in Solr which works for querying.
Posted at 12:01 a.m. on November 17, 2010
Comments: 4
Welcome to the home of Eric Holscher on the web. I talk about software development, mostly in the realm of Django. I am interested in the real time web, testing, mobile apps, and other things.
Why Read the Docs matters
1 month ago (Comments: 7)
Read the Docs Update
10 months, 2 weeks ago (Comments: 2)
Using Reviewboard with Git
1 year ago (Comments: 0)
Read the Docs Updates
1 year, 1 month ago (Comments: 1)
Handling Django Settings Files
1 year, 1 month ago (Comments: 12)
Required Reading
1 year, 3 months ago (Comments: 0)
Using Haystack to index non-database content
1 year, 3 months ago (Comments: 4)
Correct commands to check out and update VCS repos
1 year, 3 months ago (Comments: 0)
Site upgrades
1 year, 3 months ago (Comments: 0)
Building a Django App Server with Chef: Part 4
1 year, 3 months ago (Comments: 1)
Setting up Django and mod_wsgi
Building a Django App Server with Chef: Part 1
Screencast: Django Command Extensions
Big list of Django tips (and some python tips too)
Handling Django Settings Files
Lessons Learned From The Dash: Easy Django Deployment
Large Problems in Django, Mostly Solved: Delayed Execution
Building a Django App Server with Chef: Part 2


Comments
1 Stefan says...
Bookmarked. Might use this for future versions of django-sphinxdoc. Thank you. :)
Posted at 2:34 a.m. on November 17, 2010
2 apollo13 says...
http://readthedocs.org/search/?q=search Oh noes it fails :)
Posted at 3:40 a.m. on November 17, 2010
3 Eric Holscher says...
apollo13: Heh. That's me not handling deleting of projects gracefully. Should be fixed now.
Note that update_index doesn't remove things from the index, so deleted files hang around. rebuild_index does, but it requires deleting your index first.
Posted at 9:38 a.m. on November 17, 2010
4 Thejaswi Puthraya says...
Interesting way to solve the problem. For a similar situation, I've used ElasticSearch (http://www.elasticsearch.com/).
And for database content, I've used the serializer to output JSON that is indexed by elastic search.
Posted at 4:14 a.m. on November 27, 2010