Michael Ovies dot Com


Postgres Document Search: Part One

I've learned over time that Postgres is incredibly powerful. Often times, I don't believe engineers appreciate what it's truly capable of. Here I'd like to take a moment and discuss document-based text search within postgres. What do I mean by "document" exactly? Well, in this case I mean searching through paragraphs and paragraphs of text.

Text search in Postgres, for me, started out quite simply. I'd slap an ILIKE on it and pat myself on the back. Sometime after that I learned about GIN and GiST indexes. Fancy! And then after that I discovered the trigram index - even more fancy! But you can take it even further and that's what I want to explore here.

Before we go further, why might we need something more than just a GIN, GiST, or trigram? Well, like I said previously, imagine we have paragraphs of text. A good example to pull from is weather alerts from the United States' National Oceanic and Atmospheric Adminstration (NOAA). If you haven't seen one before here's a snippet:

At 1100 PM EDT...A low pressure system has been lingering offshore
of the Mid-Atlantic region since Sunday afternoon. Two low
pressure centers have developed approximately 240 miles off the
Mid-Atlantic and New England coasts. An occluded front is draped
on the northern side of both lows and extends northeast out into
the open Atlantic. National Weather Service Doppler Radar and
observations indicate onshore flow along the Southern New England
coast, pushing light to moderate rain showers across the
Northeast.

The weather alert I pulled that from has at least eight paragraphs of text and additional info for cities. There's a lot of data there. So let's take a look at how we can search through it.

Before we go further, though, let's cover some terminology. For this type of text search in Postgres we have these important terms: Documents, Tokens, Lexemes, and Dictionaries.

Okay, those are the basics. You start with a "document" of text which is opaquely processed into tokens and then finally persisted as lexemes. It is these lexemes which you'll eventually search against.

Before we move into Part Two, I want to show how this works under the hood in Postgres. This is normally an opaque process outside of Postgres but I find exploring this sort of thing handy for understanding.

Getting Our Hands Dirty

First, let's start with Postgres' ts_parse function. This function will take text and convert it into tokens. This is the first step in the process of converting documents into lexemes. In order to follow along pop open psql from the command line and try this select:

SELECT * FROM ts_parse('default', 'the dog likes to run!');
This will output the following tokens:
 tokid | token
-------+-------
     1 | the
    12 |
     1 | dog
    12 |
     1 | likes
    12 |
     1 | to
    12 |
     1 | run
    12 | !
(10 rows)
Okay, cool, now we have our tokens. Let's try turning this into lexemes. Again from a psql terminal enter this:
SELECT to_tsvector('pg_catalog.english','the dog likes to run!');
And this will output what we're after:
       to_tsvector
--------------------------
 'dog':2 'like':3 'run':5
Notice how "likes" was shortened to "like?" That's the dictionary normalization in process. Cool, eh? And it's this sequence of strings which is, ultimately, what we'll be searching against in Part Two.

Before we wrap, let's loop this back to our original text from NOAA. Let's see what Postgres' function to_tsvector will give us:

SELECT to_tsvector('pg_catalog.english','At 1100 PM EDT...A low pressure system
has been lingering offshore of the Mid-Atlantic region since Sunday afternoon.
Two low pressure centers have developed approximately 240 miles off the
Mid-Atlantic and New England coasts. An occluded front is draped on the
northern side of both lows and extends northeast out into the open Atlantic.
National Weather Service Doppler Radar and observations indicate onshore flow
along the Southern New England coast, pushing light to moderate rain showers
across the Northeast.');
                                                                                      to_tsvector

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 '1100':2 '240':29 'across':82 'afternoon':21 'along':70 'approxim':28 'atlant':17,35,59 'center':25 'coast':39,75 'develop':27 'doppler':63 'drape':44 'edt':4 'england':38,74 'extend':53 'flow':69 'front
':42 'indic':67 'light':77 'linger':11 'low':6,23,51 'mid':16,34 'mid-atlant':15,33 'mile':30 'moder':79 'nation':60 'new':37,73 'northeast':54,84 'northern':47 'observ':66 'occlud':41 'offshor':12 'onsho
r':68 'open':58 'pm':3 'pressur':7,24 'push':76 'radar':64 'rain':80 'region':18 'servic':62 'shower':81 'side':48 'sinc':19 'southern':72 'sunday':20 'system':8 'two':22 'weather':61
^ scrooooooooll right and left on this one, there are a lot of lexemes! Take a close look at those lexemes though, you'll see some shortened ones such as "pressur", "approxim", and "atlant." Isn't that cool? Before we dive into Part Two, you can imagine the performance differences of a basic ILIKE search against the full text of the weather alert versus, say, a GiST index utilizing these lexemes.

Stay tuned :)

P.S. I don't like to leave something like this without references in case someone would like to dig further. Here are some links which may prove useful:
Postgres Text Search Parsers
Text Search Intro
Text Search Types


Home | Reach out to devnull @ michaelovies dot com