Full text search with postgreSQL
At Nous we are working on a web portal for students, which makes sharing and authoring study materials easy. Effective full text search is a necessary feature and we started looking for viable solutions quite early.
The essential requirements were ease of deployment and integration with our pylons-base application. The tools we considered were:
But they all had a serious drawback. Stemming is required for effective text search and since we are planning to deploy this system in several languages, we needed good stemmers. And they are not that easy to find, especially for eastern european languages. Writing the stemmer ourselves was not an option..
PostgreSQL's Text Search
One of the major changes in the 8.3 release of postgresql was the integration of the tsearch2 extension. This makes using a database-level full text search implementation easier than even before.
Among the features that make full text search in postgresql shine are:
- ability to have multiple configurations, used interchangeably (e.g. indexing multilingual content with different stemmers)
- stop-word and normalization dictionaries
- ispell dictionaries for normalizing different forms of a word into the same lexeme
The ability to use ispell dictionaries was especially desired, as they are available for a much wider variety of languages than stemmers.
Setting up full text search on a drupal database
Since full text search is part of core as of postgresql 8.3, configuring it is much easier than in previous versions (no time wasted installing the extension itself). Here I will show a simple example configuration using myspell dictionaries for string normalization.
Installing the desired dictionary
If you are using ubuntu or debian, installing the dictionary is as simple as
sudo apt-get install myspell-{mylang}
substituting {mylang} for the language you need. When the dictionary is installed, a script 'pg_updatedicts' should start automatically and link the dictionary's files to a postgresql folder.
Configuring postgresql
Having installed the dictionary, we need to configure the database to use it. This is done by creating a text search dictionary in the database:
CREATE TEXT SEARCH DICTIONARY lithuanian (
TEMPLATE = ispell,
DictFile = system_lt_lt,
AffFile = system_lt_lt
);
Here the DictFile and AffFile options set the dictionary and affix files of the ispell dictionary to be used. If it is a available, a stop-word file can also be specified with the option StopWords.
The next step in setting full text search on the database is creating the search configuration. Text search configurations define the parser and the set of dictionaries that normalize and filter the tokens returned by the parser.
A simple configuration may look like this:
CREATE TEXT SEARCH CONFIGURATION public.lt ( COPY = pg_catalog.english );
ALTER TEXT SEARCH CONFIGURATION lt
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH lithuanian, english, simple;
What this does is create a configuration by copying the default english search configuration and then altering it by chaining several dictionaries for all tokens that look like text. Chaining dictionaries gives us a nice effect where only the words not matched by the first dictionary are passed on to the second one. In the previous example we identify lithuanian words first, then english ones (normalizing them with an ispell dictionary) and at last we turn all the unmatched tokens into lexemes without modifying them.
Testing the configuration
To test the configuration we have just created, we need to explicitly enable it. This can be done either globally in the postgresql.conf file, per database:
ALTER DATABASE ... SET
or per session:
SET default_text_search_config = 'public.lt';
Having done this in any of the above mentioned ways, we can test the database:
SELECT plainto_tsquery('some text to be indexed);
This should return a ts_vector containing the lexemes ready either for indexing or searching.
Full text search queries
Setting up a table (or an additional column) for storing text search vectors is completely up to the developer (as is setting up triggers to maintain these vectors), but, assuming the text search vector column is there, simple querying is accomplished by:
SELECT * FROM table WHERE ts_column @@ plainto_tsquery('query text')
Further reading
All the essential information can be found on the postgresql documentation page as well as on a couple of blogs, discussing using this search method with django or solving various search related problems.






Comments
Post new comment