Incremental Collaborative Filtering in SQL
Collaborative filtering is typically about getting recommendations by people with similar interests from within a larger group. This process has been popularized by sites like amazon.com providing personalized recommendations based on purchase history.
This post proposes a rating-based implementation of collaborative filtering within a couple constraints. First, the ratings must be calculated and persisted in an SQL database as opposed to residing in memory. Second, the recommendations must remain current at all times by maintaining ratings incrementally. This should be contrasted with other systems where recommendations are calculated periodically in a batch process.
In 1992, David Goldberg first used the term collaborative filtering in his paper Using collaborative filtering to weave an information tapestry. At the time, he implemented a system which enabled users to annotate documents as either interesting or uninteresting. These ratings were then used to calculate recommendations for the active user which falls under the category of user-based filtering.
In 2000, Vucetic and Obradovic proposed an alternative to address the inherent performance problems with user-based filtering in their paper A Regression-Based Approach for Scaling-Up Personalized Recommender Systems in E-Commerce. The most significant difference is taking an item-based approach by inferring interest of the current user from the relationships between each pair of items.
In 2005, Daniel Lemire and Anna Maclachlan proposed another alternative to address the computational complexity of existing predictor functions in their paper Slope One Predictors for Online Rating-Based Collaborative Filtering. The difference is that functions typically use linear regression to predict the ratings which can suffer from a large number of regressors whereas the Slope One predictor is sometimes faster while needing only half the regressors.
To provide concrete examples, PostgreSQL will be used and these statement will create the initial tables to represent people and items:
CREATE TABLE person (id SERIAL NOT NULL PRIMARY KEY, name TEXT NOT NULL); CREATE TABLE item (id SERIAL NOT NULL PRIMARY KEY, name TEXT NOT NULL);
Next, the rating table is created to represent the value of ratings collected by people across a range of items. The data type of the value column is defined as real to store single precision floating-point numbers but this could be changed to any numerical type:
CREATE TABLE rating ( person_id INTEGER NOT NULL REFERENCES person, item_id INTEGER NOT NULL REFERENCES item, value REAL NOT NULL, PRIMARY KEY (person_id, item_id));
Finally, the frequency table is created to represent the popularity differential between each pair of items resulting in a square matrix. The size of the table can be reduced by observing that the lower triangular matrix is actually the negative transposition of the upper one resulting in a skew-symmetric matrix. The size can be further reduced by removing the main diagonal because the value in the sum column will always be zero. These optimizations can be enforced using a constraint to make sure that rows are only inserted in the lower triangular matrix:
CREATE TABLE frequency ( item1_id INTEGER NOT NULL REFERENCES item, item2_id INTEGER NOT NULL REFERENCES item, count INTEGER NOT NULL DEFAULT 0, sum REAL NOT NULL DEFAULT 0.0, PRIMARY KEY (item1_id, item2_id), CONSTRAINT valid_item_id CHECK (item1_id > item2_id));
If the size of the table is still too large, it would also be possible to only insert rows in the matrix above a certain threshold resulting in a sparse matrix. The tradeoff is that the triggers to update the table become more complex because some entries might be missing. So, the right solution really depends on context.
Database triggers can be used to automatically maintain integrity of the frequency table when changes are made to the other tables. To define the corresponding procedures, PL/pgSQL will be used to continue expressing statements in SQL.
The first trigger consists of initializing the lower triangular matrix of the frequency table when a row is inserted in the item table. Even though this is expressed in a single statement, this results in i-1 rows where i is the current number of items which implies that the total size of this table is i(i-1)/2. So, this is where a sparse matrix might reduce storage and improve performance:
CREATE OR REPLACE FUNCTION item_insert() RETURNS TRIGGER AS $$ BEGIN INSERT INTO frequency (item1_id, item2_id) SELECT NEW.id, id FROM item WHERE id <> NEW.id; RETURN NULL; END; $$ LANGUAGE plpgsql; CREATE TRIGGER item_insert_trigger AFTER INSERT ON item FOR EACH ROW EXECUTE PROCEDURE item_insert();
The next triggers consist of maintaining the count and value columns of the frequency table when changing the rating table. The complexity mostly resides in having to update each pair of items for the person in the rating. Assuming the rows have already been inserted in the frequency table by the above trigger, this results in a single statement updating at most r-1 rows where r is the number of ratings by the person in the rating:
CREATE OR REPLACE FUNCTION rating_insert() RETURNS TRIGGER AS $$ BEGIN UPDATE frequency SET count = count + 1, sum = sum + r2.value - r1.value FROM rating r1, rating r2 WHERE r1.person_id = r2.person_id AND r2.person_id = NEW.person_id AND ((r1.item_id = NEW.item_id AND r1.item_id = item2_id AND r2.item_id = item1_id) OR (r2.item_id = NEW.item_id AND r2.item_id = item1_id AND r1.item_id = item2_id)); RETURN NULL; END; $$ LANGUAGE plpgsql; CREATE TRIGGER rating_insert_trigger AFTER INSERT ON rating FOR EACH ROW EXECUTE PROCEDURE rating_insert();
The corresponding triggers for UPDATE and DELETE events on the rating table are quite similar; the only difference is the content of the set clause.
At this point, the frequency table only contains part of the matrix. Another part is the main diagonal which is actually superfluous because the value in the sum column will always be zero. The last part is the upper triangular matrix which can be computed as the negative transposition of the lower one:
CREATE VIEW frequency_matrix (item1_id, item2_id, count, sum) AS SELECT item1_id, item2_id, count, sum FROM frequency UNION SELECT item2_id, item1_id, count, -sum FROM frequency;
The resulting matrix can be used to provide non-personalized recommendations which don’t depend on a particular person. For example, the following statement returns an ordered list of related items for the given $item_id based on the ratings from all people:
SELECT item2_id AS item_id, (sum / count) AS average FROM frequency_matrix WHERE count > 0 AND item1_id = $item_id ORDER BY average DESC;
The same matrix can also be used to provide personalized recommendations for a given person. For example, the following statement returns an ordered list of the most relevant items for the given $person_id:
SELECT f.item1_id AS item_id, sum(f.sum + f.count * r1.value) / sum(f.count) AS average FROM frequency_matrix f JOIN rating r1 ON r1.item_id = f.item2_id LEFT JOIN rating r2 ON r2.item_id = f.item1_id AND r2.person_id = r1.person_id WHERE r1.person_id = $person_id AND f.count > 0 AND r2.item_id IS NULL GROUP BY f.item1_id ORDER BY average DESC;
Note that the above statement contains an anti-join query to exclude items already rated by the person which can be expressed with NOT IN, NOT EXISTS or LEFT JOIN. The latter operation is recommended because it usually results in a more efficient execution plan by the database optimizer.
This post has provided an implementation of collaborative filtering in SQL which can be maintained incrementally. The most significant bottleneck seems to be the order of growth for the frequency table which might have both storage and performance implications. If this proves to be a problem, the table can be further optimized by using a sparse matrix.