Postgres Document Search: Part One
October 5th, 2022
I've learned over time that Postgres is incredibly powerful. Often times, I don't believe engineers appreciate what it's truly capable of. Here I'd like to take a moment and discuss document-based text search within postgres. What do I mean by "document" exactly? Well, in this case I mean searching through paragraphs and paragraphs of text.
Text search in Postgres, for me, started out quite simply. I'd slap
an ILIKE
on it and pat myself on the back. Sometime after
that I learned about GIN and GiST indexes. Fancy! And then after that
I discovered the trigram index - even more fancy! But you can take it
even further and that's what I want to explore here.
Before we go further, why might we need something more than just a GIN, GiST, or trigram? Well, like I said previously, imagine we have paragraphs of text. A good example to pull from is weather alerts from the United States' National Oceanic and Atmospheric Adminstration (NOAA). If you haven't seen one before here's a snippet:
At 1100 PM EDT...A low pressure system has been lingering offshore of the Mid-Atlantic region since Sunday afternoon. Two low pressure centers have developed approximately 240 miles off the Mid-Atlantic and New England coasts. An occluded front is draped on the northern side of both lows and extends northeast out into the open Atlantic. National Weather Service Doppler Radar and observations indicate onshore flow along the Southern New England coast, pushing light to moderate rain showers across the Northeast.
The weather alert I pulled that from has at least eight paragraphs of text and additional info for cities. There's a lot of data there. So let's take a look at how we can search through it.
Before we go further, though, let's cover some terminology. For this type of text search in Postgres we have these important terms: Documents, Tokens, Lexemes, and Dictionaries.
- Documents are easy, these are your original source texts.
- Tokens are composed from a single document and are the individual words, spaces, and punctuation. For example, the document "the dog likes to run!" will be split into ten tokens.
- Dictionaries "normalize" the tokens from the document and eleminate such things as "stop words." Stop words are words which should not be considered in part of a search - like "it" or "a." Finally, the dictionary will normalize words so if you start with "clouds" the resulting lexeme will be "cloud."
- Lexemes are the secret sauce. Dictionaries parse tokens into a small set of lexemes. These small set of strings are what can be indexed and searched against. From here you can begin to imagine how the search will improve when you have eight paragraphs of text versus, say, nine string values.
Before we move into Part Two, I want to show how this works under the hood in Postgres. This is normally an opaque process outside of Postgres but I find exploring this sort of thing handy for understanding.
Getting Our Hands Dirty
First, let's start with Postgres' ts_parse
function.
This function will take text and convert it into tokens. This is the
first step in the process of converting documents into lexemes. In
order to follow along pop open psql
from the command line
and try this select:
SELECT * FROM ts_parse('default', 'the dog likes to run!');This will output the following tokens:
tokid | token -------+------- 1 | the 12 | 1 | dog 12 | 1 | likes 12 | 1 | to 12 | 1 | run 12 | ! (10 rows)Okay, cool, now we have our tokens. Let's try turning this into lexemes. Again from a
psql
terminal enter this:
SELECT to_tsvector('pg_catalog.english','the dog likes to run!');And this will output what we're after:
to_tsvector -------------------------- 'dog':2 'like':3 'run':5Notice how "likes" was shortened to "like?" That's the dictionary normalization in process. Cool, eh? And it's this sequence of strings which is, ultimately, what we'll be searching against in Part Two.
Before we wrap, let's loop this back to our original text from NOAA.
Let's see what Postgres' function to_tsvector
will give
us:
SELECT to_tsvector('pg_catalog.english','At 1100 PM EDT...A low pressure system has been lingering offshore of the Mid-Atlantic region since Sunday afternoon. Two low pressure centers have developed approximately 240 miles off the Mid-Atlantic and New England coasts. An occluded front is draped on the northern side of both lows and extends northeast out into the open Atlantic. National Weather Service Doppler Radar and observations indicate onshore flow along the Southern New England coast, pushing light to moderate rain showers across the Northeast.');
to_tsvector ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- '1100':2 '240':29 'across':82 'afternoon':21 'along':70 'approxim':28 'atlant':17,35,59 'center':25 'coast':39,75 'develop':27 'doppler':63 'drape':44 'edt':4 'england':38,74 'extend':53 'flow':69 'front ':42 'indic':67 'light':77 'linger':11 'low':6,23,51 'mid':16,34 'mid-atlant':15,33 'mile':30 'moder':79 'nation':60 'new':37,73 'northeast':54,84 'northern':47 'observ':66 'occlud':41 'offshor':12 'onsho r':68 'open':58 'pm':3 'pressur':7,24 'push':76 'radar':64 'rain':80 'region':18 'servic':62 'shower':81 'side':48 'sinc':19 'southern':72 'sunday':20 'system':8 'two':22 'weather':61^ scrooooooooll right and left on this one, there are a lot of lexemes! Take a close look at those lexemes though, you'll see some shortened ones such as "pressur", "approxim", and "atlant." Isn't that cool? Before we dive into Part Two, you can imagine the performance differences of a basic
ILIKE
search against
the full text of the weather alert versus, say, a GiST index
utilizing these lexemes.
Stay tuned :)
P.S. I don't like to leave something like this without references in case someone would like to dig further. Here are some links which may prove useful: Postgres Text Search Parsers Text Search Intro Text Search Types