Nov 17

Using Haystack to index non-database content

Over on ReadTheDocs, I wanted to build search around the documentation that we're hosting. I chose Haystack and Solr for this, because it's the best way to do search in Django these days. However, I've only ever used Haystack to index content that is in the database. I thought about trying to add all the rendered HTML from the documentation into the database, but that was a non-starter.

I ended up adding a ImportedFile model to the database, which would contain the metadata for the HTML file:

class ImportedFile(models.Model):
    project = models.ForeignKey(Project, related_name='imported_files')
    name = models.CharField(max_length=255)
    slug = models.SlugField()
    path = models.CharField(max_length=255)
    md5 = models.CharField(max_length=255)

This allows me to link the SearchIndex in haystack to a model. Then the interesting part is in the Haystack SearchIndex, where I override the prepare_text method, allowing me to read the data in from the filesystem instead of from the database.

class ImportedFileIndex(SearchIndex):
    text = CharField(document=True)
    author = CharField(model_attr='project__user')
    project = CharField(model_attr='project__name')
    title = CharField(model_attr='name')

    def prepare_text(self, obj):
        full_path = obj.project.full_html_path
        to_read = os.path.join(full_path, obj.path.lstrip('/'))
        try:
            content = codecs.open(to_read, encoding="utf-8", mode='r').read()
            return content
        except IOError:
            print "%s not found" % full_path

site.register(ImportedFile, ImportedFileIndex)

This means that I don't have to bloat my database with all my rendered HTML, but have the full HTML stored in Solr which works for querying.


Comments

1 Stefan says...

Bookmarked. Might use this for future versions of django-sphinxdoc. Thank you. :)

Posted at 2:34 a.m. on November 17, 2010

2 apollo13 says...

http://readthedocs.org/search/?q=search Oh noes it fails :)

Posted at 3:40 a.m. on November 17, 2010

3 Eric Holscher says...

apollo13: Heh. That's me not handling deleting of projects gracefully. Should be fixed now.

Note that update_index doesn't remove things from the index, so deleted files hang around. rebuild_index does, but it requires deleting your index first.

Posted at 9:38 a.m. on November 17, 2010

4 Thejaswi Puthraya says...

Interesting way to solve the problem. For a similar situation, I've used ElasticSearch (http://www.elasticsearch.com/).

And for database content, I've used the serializer to output JSON that is indexed by elastic search.

Posted at 4:14 a.m. on November 27, 2010

Comment are disabled for this post.