Tutorial

Using Multiple Tables: JOIN

So far we've looked at queries restricted to a single table. As we saw in the introductory section, the real power of a database (as compared to a spreadsheet) is in being able to combine tables. We do this with the JOIN clause.

Joining tables works by finding shared values (keys) in two tables. If a post was created by a user with ID 9753, we can find the user in the Users table with an ID of 9753. This allows us to display the name of the user (which comes from the Users table) as part of the results of a query on the Posts table:

SELECT p.Title, p.Id, p.Score, p.ViewCount, u.DisplayName
FROM Posts p
JOIN Users u ON p.OwnerUserId = u.Id
WHERE p.PostTypeId = 1
AND p.Score >= 25
Run Query

If you're new to SQL, the best way to read this statement is from the inside out. The logic is:

  • Start with the Posts table: FROM Posts p. Because we're now talking about more than one table we need to use names; p refers to any row in the Posts table. Think of it sort of like a loop counter; p means "whatever row we're currently looking at". A query will end up looking at every row in the table to decide whether that row is relevant.
  • Now we're going to cross-reference this with the Users table: JOIN Users u ON p.OwnerUserId = u.Id. Just as we used p to mean a row in the Posts table, here we use u to mean a row in the Users table. The JOIN says to take the value in the OwnerUserId column in p (a post), find the same value in the Id column in u (a user), and match them up. JOIN does a lot of the heavy lifting in SEDE (and many other databases).
  • Remember what we said about reading this query from the inside out? Now we go up to the first line: SELECT p.Title, p.Id, p.Score, p.ViewCount, u.DisplayName. Notice that we've now added prefixes to the column names to indicate which table they come from. To the four columns from the Posts table we've added the display name from the Users table.
  • Finally, we go to the bottom for further restrictions. The WHERE clause restricts the values we show. Note that we've also added the prefixes to the values here.

This query produces results in an unspecified order:

results of previous query

We can use any column in either of the joined tables to control the ordering. It doesn't need to be one of the columns listed in the SELECT statement. For the purpose of illustration, we'll use user reputation, starting with the highest, to sort the results. This also has the effect of grouping all questions from the same author together. We add the ORDER BY clause to the previous query:

SELECT p.Title, p.Id, p.Score, p.ViewCount, u.DisplayName
FROM Posts p
JOIN Users u ON p.OwnerUserId = u.Id
WHERE p.PostTypeId = 1
AND p.Score >= 25
ORDER BY u.Reputation DESC
Run Query

See how that changes the results.

(If you wanted the lowest-reputation user first, you would replace DESC (descending) with ASC (ascending) in the ORDER BY clause.)

JOINs are not limited to two tables. They also aren't limited to different tables. Consider the Posts table, which contains both questions and answers (among other things). Suppose you want to access a question and its accepted answer? You can do that by joining Posts to Posts:

SELECT p.Title, p.Id, p.Score, a.Score AS "Accepted Score", 
u.DisplayName AS "Asker", au.DisplayName AS "Answerer"
  
FROM Posts p
JOIN Posts a ON p.AcceptedAnswerId = a.Id
JOIN Users u ON p.OwnerUserId = u.Id
JOIN Users au ON a.OwnerUserId = au.Id
WHERE p.PostTypeId = 1
AND p.Score >= 25
and p.AcceptedAnswerId IS NOT NULL
ORDER BY p.Score DESC
Run Query

Yes, that's a four-way JOIN. For each post (that is a question), we find its accepted answer (if that is not null). The answer, being a post, has an ID. We can therefore join a question to its answer: JOIN Posts a ON p.AcceptedAnswerId = a.Id.

The JOIN of the (question) post to its user is unchanged. We need a separate JOIN to do the same for the answering user, because answers are in a, not p, even though they are both references to the Posts table.

Notice that with all the extra posts and users involved in this query, we've used AS clauses in the SELECT to specify more meaningful column headers:

SELECT p.Title, p.Id, p.Score, a.Score AS "Accepted Score", 
u.DisplayName AS "Asker", au.DisplayName AS "Answerer"

We've also changed the sorting:

results of previous query