Document similarity in postgresql

Once you have a full text search index available, it seems only natural to be able to find similar documents. Doing this with postgresql’s full text search requires a few tricks though.

When considering search in the vector space model, documents and queries are very similar (identical if we have unweighted terms). Hence comparing a document to a query or to another document is basically the same operation: finding the cosine of the angle between two vectors. With postgresql it’s not that easy.

The main (and most apparent) difficulty comes from the fact that there is no way to match two tsvectors: the @@ (full text search) operator and the ts_rank function take two parameters that have to be a tsvector and a tsquery. Which means that we need to convert a tsvector to a tsquery.

Type casting (vector::tsquery) is not supported, so this is the solution I came up with:

    select plainto_tsquery(strip(t.terms)::text) from search_items t;

The query that ranks documents by their similarity to the target document (e.g. the page the user is viewing at the moment) looks like this:

    select s.content_item_id,
    ts_rank(s.terms, replace(strip(original.terms)::text, ' ', ' | ')::tsquery) as similarity

    from search_items s,
      (select terms, content_item_id from search_items limit 1) as original
        where s.content_item_id != original.content_item_id

      order by similarity desc;

Here I use replace(strip(original.terms)::text, ‘ ‘, ’ | ‘)::tsquery instead of plainto_tsquery(strip(t.terms)::text) because I want to find similar documents even if they do not contain all of the original terms. The method plainto_tsquery creates a query for the conjunction of all terms, which is a bit too restrictive when searching for similar documents. Because of that I go the extra length and create a disjunctive query.

One problem I have stumbled upon is that the query ranking is not global. Thus it is hard to find the two most similar documents in the index.

As for the performance, mangling the tsvector into a query does not seem to make a big impact: ranking documents by their similarity in an index of ~2000 items averaging at 22 terms per item (~700 terms maximum) takes 46 ms compared to 44 ms for a query constructed from text.