Skip to content

Incremental Collaborative Filtering in SQL

November 13, 2010

Collaborative filtering is typically about getting recommendations by people with similar interests from within a larger group. This process has been popularized by sites like amazon.com providing personalized recommendations based on purchase history.

This post proposes a rating-based implementation of collaborative filtering within a couple constraints. First, the ratings must be calculated and persisted in an SQL database as opposed to residing in memory. Second, the recommendations must remain current at all times by maintaining ratings incrementally. This should be contrasted with other systems where recommendations are calculated periodically in a batch process.

History

In 1992, David Goldberg first used the term collaborative filtering in his paper Using collaborative filtering to weave an information tapestry. At the time, he implemented a system which enabled users to annotate documents as either interesting or uninteresting. These ratings were then used to calculate recommendations for the active user which falls under the category of user-based filtering.

In 2000, Vucetic and Obradovic proposed an alternative to address the inherent performance problems with user-based filtering in their paper A Regression-Based Approach for Scaling-Up Personalized Recommender Systems in E-Commerce. The most significant difference is taking an item-based approach by inferring interest of the current user from the relationships between each pair of items.

In 2005, Daniel Lemire and Anna Maclachlan proposed another alternative to address the computational complexity of existing predictor functions in their paper Slope One Predictors for Online Rating-Based Collaborative Filtering. The difference is that functions typically use linear regression to predict the ratings which can suffer from a large number of regressors whereas the Slope One predictor is sometimes faster while needing only half the regressors.

Tables

To provide concrete examples, PostgreSQL will be used and these statement will create the initial tables to represent people and items:

CREATE TABLE person (id SERIAL NOT NULL PRIMARY KEY, name TEXT NOT NULL);
CREATE TABLE item (id SERIAL NOT NULL PRIMARY KEY, name TEXT NOT NULL);

Next, the rating table is created to represent the value of ratings collected by people across a range of items. The data type of the value column is defined as real to store single precision floating-point numbers but this could be changed to any numerical type:

CREATE TABLE rating (
    person_id INTEGER NOT NULL REFERENCES person,
    item_id INTEGER NOT NULL REFERENCES item,
    value REAL NOT NULL,
    PRIMARY KEY (person_id, item_id));

Finally, the frequency table is created to represent the popularity differential between each pair of items resulting in a square matrix. The size of the table can be reduced by observing that the lower triangular matrix is actually the negative transposition of the upper one resulting in a skew-symmetric matrix. The size can be further reduced by removing the main diagonal because the value in the sum column will always be zero. These optimizations can be enforced using a constraint to make sure that rows are only inserted in the lower triangular matrix:

CREATE TABLE frequency (
    item1_id INTEGER NOT NULL REFERENCES item,
    item2_id INTEGER NOT NULL REFERENCES item,
    count INTEGER NOT NULL DEFAULT 0,
    sum REAL NOT NULL DEFAULT 0.0,
    PRIMARY KEY (item1_id, item2_id),
    CONSTRAINT valid_item_id CHECK (item1_id > item2_id));

If the size of the table is still too large, it would also be possible to only insert rows in the matrix above a certain threshold resulting in a sparse matrix. The tradeoff is that the triggers to update the table become more complex because some entries might be missing. So, the right solution really depends on context.

Triggers

Database triggers can be used to automatically maintain integrity of the frequency table when changes are made to the other tables. To define the corresponding procedures, PL/pgSQL will be used to continue expressing statements in SQL.

The first trigger consists of initializing the lower triangular matrix of the frequency table when a row is inserted in the item table. Even though this is expressed in a single statement, this results in i-1 rows where i is the current number of items which implies that the total size of this table is i(i-1)/2. So, this is where a sparse matrix might reduce storage and improve performance:

CREATE OR REPLACE FUNCTION item_insert() RETURNS TRIGGER AS $$
BEGIN
    INSERT INTO frequency (item1_id, item2_id) SELECT NEW.id, id FROM item WHERE id <> NEW.id;
    RETURN NULL;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER item_insert_trigger AFTER INSERT ON item
    FOR EACH ROW EXECUTE PROCEDURE item_insert();

The next triggers consist of maintaining the count and value columns of the frequency table when changing the rating table. The complexity mostly resides in having to update each pair of items for the person in the rating. Assuming the rows have already been inserted in the frequency table by the above trigger, this results in a single statement updating at most r-1 rows where r is the number of ratings by the person in the rating:

CREATE OR REPLACE FUNCTION rating_insert() RETURNS TRIGGER AS $$
BEGIN
    UPDATE frequency SET count = count + 1, sum = sum + r2.value - r1.value
    FROM rating r1, rating r2
    WHERE r1.person_id = r2.person_id AND r2.person_id = NEW.person_id
    AND ((r1.item_id = NEW.item_id AND r1.item_id = item2_id AND r2.item_id = item1_id)
        OR (r2.item_id = NEW.item_id AND r2.item_id = item1_id AND r1.item_id = item2_id));
    RETURN NULL;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER rating_insert_trigger AFTER INSERT ON rating
    FOR EACH ROW EXECUTE PROCEDURE rating_insert();

The corresponding triggers for UPDATE and DELETE events on the rating table are quite similar; the only difference is the content of the set clause.

Recommendations

At this point, the frequency table only contains part of the matrix. Another part is the main diagonal which is actually superfluous because the value in the sum column will always be zero. The last part is the upper triangular matrix which can be computed as the negative transposition of the lower one:

CREATE VIEW frequency_matrix (item1_id, item2_id, count, sum) AS
    SELECT item1_id, item2_id, count, sum FROM frequency
    UNION SELECT item2_id, item1_id, count, -sum FROM frequency;

The resulting matrix can be used to provide non-personalized recommendations which don’t depend on a particular person. For example, the following statement returns an ordered list of related items for the given $item_id based on the ratings from all people:

SELECT item2_id AS item_id, (sum / count) AS average FROM frequency_matrix
    WHERE count > 0 AND item1_id = $item_id ORDER BY average DESC;

The same matrix can also be used to provide personalized recommendations for a given person. For example, the following statement returns an ordered list of the most relevant items for the given $person_id:

SELECT f.item1_id AS item_id, sum(f.sum + f.count * r1.value) / sum(f.count) AS average
    FROM frequency_matrix f JOIN rating r1 ON r1.item_id = f.item2_id
    LEFT JOIN rating r2 ON r2.item_id = f.item1_id AND r2.person_id = r1.person_id
    WHERE r1.person_id = $person_id AND f.count > 0 AND r2.item_id IS NULL
    GROUP BY f.item1_id ORDER BY average DESC;

Note that the above statement contains an anti-join query to exclude items already rated by the person which can be expressed with NOT IN, NOT EXISTS or LEFT JOIN. The latter operation is recommended because it usually results in a more efficient execution plan by the database optimizer.

Summary

This post has provided an implementation of collaborative filtering in SQL which can be maintained incrementally. The most significant bottleneck seems to be the order of growth for the frequency table which might have both storage and performance implications. If this proves to be a problem, the table can be further optimized by using a sparse matrix.

About these ads
9 Comments leave one →
  1. Andy Watts permalink
    December 16, 2010 13:34

    Great post. Would you mind explaining the SQL you use to calculate the average rating?

    …. sum(f.sum + f.count * r1.value) / sum(f.count) AS average

    I guess I’m missing something, but the following seemed sufficient..
    …. sum(f.sum) / sum(f.count) AS average

    Andy

    • March 20, 2012 13:19

      The difference is that the former is based on the user’s average plus the average deviation from the user mean for the item in question across all users in the set, whereas the latter only accounts for the user’s average irrespective of the average deviation. In other words, the reference to the rating as r1.value is quite important to determine the best recommendation for the user.

  2. Andy Watts permalink
    December 20, 2010 04:14

    Given this constraint on the frequency table..
    CONSTRAINT valid_item_id CHECK (item1_id > item2_id)

    I think your item_insert trigger function would be better changed to
    – INSERT INTO frequency (item1_id, item2_id) SELECT NEW.id, id FROM item WHERE id NEW.id;
    + INSERT INTO frequency (item1_id, item2_id) SELECT NEW.id, id FROM item WHERE NEW.id > id;

    • March 20, 2012 13:36

      The value of NEW.id is an increment on a sequence generator, so it will always be greater or equal than the existing values for id in the database. The id might be equal to an existing value because the item_insert trigger is defined to be executed after insert, so this is the only case that needs to be handled by the where condition with the not equal operator.

  3. Vasil permalink
    December 23, 2010 01:57

    Is this based on the Slope One algorithm?
    It looks similar but I haven’t studied it enough to tell if there’s any difference

    Being able to maintain the model incrementally is rarely seen in a CF algorithm

    • March 20, 2012 13:43

      Yes, this is based on the Slope One algorithm. The History section provides a link to the article used for this implementation.

  4. Asher permalink
    March 20, 2012 02:49

    Great post – what would be the corresponding set clause for the rating_update function?

    • March 20, 2012 13:51

      For the rating_update function, the count remains the same and only the sum in the frequency table needs updating: sum = sum + NEW.value – OLD.value

      • Asher permalink
        March 20, 2012 21:01

        Thanks – I had used something similar, but in my tests it seemed that, that would end up with an incorrect value. What finally worked for me was “reverting” the diffs before updating:

        first running

        SET sum = sum – (NEW.value – OLD.value)

        and then

        SET sum = sum + NEW.value – OLD.value

        But obviously running two queries isn’t ideal. I’m wondering if I’m missing something, if it can be done in a single query. Although, the way that I’m using it, isn’t a stored procedure.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: