Get search done with PostgreSQL
A few words of wisdom…
A search feature is now a part of almost every website regardless of its size or purpose.
If you’re a Django developer, then you’re lucky to have django-haystack at your disposal to hook up any of its available search engines, such as ElasticSearch or Solr. There’s also a library called ElasticUtils, for dealing exclusively with the ElasticSearch engine (if you wish to have more control over your searching power).
But what if you only need to plug in simple search functionality to your awesome new blogging website?
Well, if you’re using Django together with postgresql as your database engine (which you should!) you can actually use some postgresql built-in features to get search running quickly.
I’m about to show you how to set it up in a few simple steps. Also, I will add a quick tip on how to add autocomplete super-powers.
Setup
First things first, you need a small 3rd party app called djorm-ext-pgfulltext
pip install djorm-ext-pgfulltext
Make sure you have the proper indexes applied to your database:
CREATE EXTENSION unaccent; ALTER FUNCTION unaccent(text) IMMUTABLE;
Next, let’s assume you have simple blog post model and you want to be able to search by its content
field:
from django.db import models from djorm_pgfulltext.models import SearchManager from djorm_pgfulltext.fields import VectorField class BlogPost(models.Model): title = models.CharField(max_length=255, default="") content = models.TextField(default="") search_index = VectorField() objects = SearchManager( fields = ('content',), config = 'pg_catalog.english', search_field = 'search_index', auto_update_search_field = True )
This setup will automatically update the index field when the object is saved.
If you already had some data, you would need to run: ./manage.py update_search_field [appname]
to rebuild the index fields.
…and we’re done with the setup part.
Search
Now let’s see how we go about searching through the content.
For testing purposes I’ve added ~50k objects with some random text data.
Let’s say we want to search for either text1
or text2
keywords. Here’s how we do it:
word_list = ['text1', 'text2'] query = " | ".join(q) qs = BlogPost.objects.search(query, rank_field="rank", raw=True)[:15]
The library includes custom manager method called search
, which takes as arguments:
- query string
- rank field to be included in the results (so the search results will be ordered by highest rank value)
raw
option to indicate usage ofto_tsquery
function
Basically to_tsquery
is more powerful since it allows us to define our own searching query. You can read more about to_tsquery
function and ranks here.
While testing on 50k rows, I got query times of ~130ms, which in searching standards is not really that great and it’s also not very good considering search widget responsiveness.
So how can we make a it faster?
When looking at the database indexes we notice the search_index
field has a btree
index type:
"search_blogpost_a71a185f" btree (search_index)
which is not that great when you’re dealing with parsing the search index field.
If you look through the documentation, you’ll find out there are actually better index types that we can use, namely GiST
and GIN
.
The GIN
seemed to have the best results in my case. After adding it:
"gin_idx" gin (search_index)
..I got an almost 3x increase in query speed-up on the same search (~ 53ms). Obviously this will cause slower object entries, but it’s a small price to pay. Actually the real performance killer here is the additional rank field, since the to_rank
function needs to go over all tsvector
summaries of all returned documents, thereby slowing down the query.
Nevertheless, we’ve got pretty decent speeds for such a large amount of data, and it’s only by using postgresql goodies!
But wait! There’s more…
There’s a chance you’ll need to add autocomplete to your search feature. Standard keyword searching won’t really cut it anymore. Here’s where the raw search queries come to the rescue:
q = 'search phrase' word_list = map(lambda word: word + ':*', tokenize(q)) query = " | ".join(word_list) qs = BlogPost.objects.search(query, rank_field="rank", raw=True)[:15]
The tokenize
function is taken from this snippet. You can see a brief explanation there of what it does as well.
So there ya go! All that searching functionality just by using postgresql power! All done within a couple lines of code, together with some database enhancements.