Skip to main content

Given tables tracking users and their hobbies, how do we write a query that ranks pairs of users on hobby similarity?

-- ---------------------------------------------------------------------------
-- Pairwise matchmaking

-- from the Artful Common Queries page
-- http://www.artfulsoftware.com/infotree/qrytip.php?id=1149
-- ---------------------------------------------------------------------------

-- Given tables tracking users and their hobbies, how do we write a query that
-- ranks pairs of users on hobby similarity?

DROP TABLE
IF EXISTS users,
 hobbies,
 users_hobbies;

CREATE TABLE users (id INT, NAME CHAR(16));

INSERT INTO users
VALUES
  (1, 'John'),
  (2, 'Lewis'),
  (3, 'Muhammad');

CREATE TABLE hobbies (id INT, title CHAR(16));

INSERT INTO hobbies
VALUES
  (1, 'Sports'),
  (2, 'Computing'),
  (3, 'Drinking'),
  (4, 'Racing'),
  (5, 'Swimming'),
  (6, 'Photography');

CREATE TABLE users_hobbies (user_id INT, hobby_id INT);

INSERT INTO users_hobbies
VALUES
  (1,2),(1,3),(1,6),(2,1),(2,5),(2,6),(3,2),(3,5),(3,6),(1,2),(1,3),(1,6),
  (2,1),(2,5),(2,6),(3,2),(3,5),(3,6),(1,2),(1,3),(1,6),(2,1),(2,5),(2,6),
  (3,2),(3,5),(3,6);

-- It's a SQL version of a famous computing problem known as nearest neighbour
-- search or similarity search: given a metric space of K vectors, return
-- the N most similar.

-- Start by defining a similarity measure for one pair of users: if user
-- A has x hobbies and user B has y hobbies, one measure of their similarity
-- is the number of hobbies they share, divided by the number of hobbies that
-- either has. Is that plausible? If A and B both have hobbies 1,14,27,
-- they are 100*3/3=100% similar. If A has 9,13 while B has 6,9,15, together
-- they have 4 hobbies, one of which they share, so their similarity is 25%.

-- That's reasonable, but incomplete. If the comparison space has 100 hobbies,
-- and one pair shares 1 of 4 while another pair shares 4 of 16, are those
-- two pairs equally similar? Arguably not, since the second pair shares a
-- greater proportion of the total possible. Then the similarity measure
-- should take this into account, so if N=the total number of hobbies,
-- S=the number of hobbies shared by a pair, and T=the total number of
-- distinct hobbies they have together, their
-- similarity is (S/T) * (T/N), or simply S/N.

-- So from first logical principles, the solution is a three-step:
-- 1) Write the above logic in SQL
-- 2) Write a query that applies that SQL to all pairs
-- 3) Rank the scores that result

-- Fortunately, aggregation collapses these three logical steps to 2:

-- Step 1: count hobbies, then collect hobbies pairwise by user:

SET @N = (Select Count(DISTINCT id) FROM hobbies);
SELECT @N;
-- +------+
-- | @N   |
-- +------+
-- |    6 |
-- +------+
SELECT
  a.user_id,
  b.user_id,
  Group_Concat(DISTINCT a.hobby_id) AS 'Pairwise shared hobbies'
FROM
  users_hobbies a
JOIN users_hobbies b ON a.user_id < b.user_id
AND a.hobby_id = b.hobby_id
GROUP BY
  a.user_id,
  b.user_id;
-- +---------+---------+-------------------------+
-- | user_id | user_id | Pairwise shared hobbies |
-- +---------+---------+-------------------------+
-- |       1 |       2 | 6                       |
-- |       1 |       3 | 2,6                     |
-- |       2 |       3 | 5,6                     |
-- +---------+---------+-------------------------+

-- Step 2: Count, calculate and order by the percentages:

SELECT
  Concat(a.user_id, ' & ', b.user_id) AS Pair,
  Round(Count(DISTINCT a.hobby_id) / @N, 2) AS Pct
FROM
  users_hobbies a
JOIN users_hobbies b ON a.user_id < b.user_id
AND a.hobby_id = b.hobby_id
GROUP BY
  a.user_id,
  b.user_id
ORDER BY
  Pct DESC;
-- +-------+------+
-- | Pair  | Pct  |
-- +-------+------+
-- | 1 & 3 | 0.33 |
-- | 2 & 3 | 0.33 |
-- | 1 & 2 | 0.17 |
-- +-------+------+

-- But there is no hard and fast rule that determines a uniquely correct measure
-- of pairwise similarity. Some circumstances may require the above formula.
-- Others may require simple or complicated weights computed from pair and
-- population sizes. Implement them by applying a corrective calculation to the
-- denominator @N in Count( DISTINCT a.hobby_id ) / @N, 2 ).