Using SELECT DISTINCT ON in PostgreSQL
tl;dr: PostgreSQL’s SELECT DISTINCT ON
is a functional solution for getting top counts of a GROUP BY
.
The Context
I recently did a coding challenge to build a survey application that analyzes responses at the end. The analysis centered around a question along the lines of “What was the most popular answer for people of each group?”, where the groups were decided by one of the questions.
Here’s an example, where the groups might be “0-17” and “18+”.
11. What's your age?
2 A. 0-17 B. 18+
32. What's your favorite season?
4 A. Winter B. Spring C. Summer D. Fall
53. What programming language do you like the best?
6 A. C B. Java C. Python
The following presents my solution (which admittedly could be improved on) and a problem I ran into.
The Database
My DB schema was something like:
1questions (
2 id SERIAL PRIMARY KEY,
3 question VARCHAR(256) NOT NULL,
4 options VARCHAR(64)[] NOT NULL,
5);
6
7answers (
8 response_id SERIAL NOT NULL,
9 question_id INTEGER REFERENCES questions(id),
10 answer VARCHAR(64),
11 PRIMARY KEY (response_id, question_id)
12);
When a survey is submitted, the backend tracks the response_id
generated for the first insert and uses that same response_id
for the other questions in the survey.
Our data might look something like this
id | question | options |
---|---|---|
0 | How old are you? | {0-17,18+} |
1 | What’s your favorite season? | {Winter,Spring,Summer,Fall} |
response_id | question_id | answer |
---|---|---|
0 | 0 | 0-17 |
0 | 1 | Fall |
1 | 0 | 18+ |
1 | 1 | Spring |
The Analysis
Now, to actually analyze the data and answer the big question. Using this example, we’ll try to figure out the responses for those 18+.
First, we’ll start with some subqueries. We want to separate out the group in question.
1WITH adults AS SELECT response_id FROM answers WHERE question_id = 0 AND answer = '18+'
Then, we’ll want to get the count per question_id. We can sort these to get the most popular responses per category first.
1SELECT COUNT(*) AS response_count, question_id, answer FROM
2 (SELECT * FROM adults JOIN answers ON adults.response_id = answers.response_id)
3GROUP BY question_id, answer
4ORDER BY response_count DESC;
This will get us a result like:
response_count | question_id | answer |
---|---|---|
10 | 0 | 18+ |
5 | 1 | Fall |
3 | 1 | Spring |
2 | 1 | Summer |
My Problem
I got stuck at this point. My original thought had been to get the MAX
from the counts, then join it back with the counts in order to get the response text. This might have been a working solution with other versions of SQL, but I was using Postgres.
1WITH
2 adults AS ...,
3 counts AS ...,
4SELECT answer, questions.question, MAX(response_count)
5 FROM counts JOIN questions ON counts.question_id = questions.question_id;
The above query throws an error:
ERROR: column “counts.answer” must appear in the GROUP BY clause or be used in an aggregate function
In the docs, Postgres specifies that:
When GROUP BY is present, or any aggregate functions are present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions or when the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column.
A Solution
Luckily, Postgres provides something called SELECT DISTINCT ON
. According to the docs,
SELECT DISTINCT ON ( expression [, ...] )
keeps only the first row of each set of rows where the given expressions evaluate to equal… Note that the “first row” of each set is unpredictable unlessORDER BY
is used to ensure that the desired row appears first.
Using this, we can write a query to SELECT DISTINCT ON
the question_id. This will drop every row except for the first for every question_id. By the docs, we should use an ORDER BY
with this statement.
This gives us a final query of:
1WITH
2 adults AS (SELECT response_id FROM answers WHERE question_id = 0 AND answer = '18+'),
3 counts AS (SELECT COUNT(*) AS response_count, question_id, answer FROM
4 (SELECT * FROM adults JOIN answers ON adults.response_id = answers.response_id)
5 GROUP BY question_id, answer)
6SELECT DISTINCT ON (question_id) question, answer, response_count
7 FROM counts JOIN questions ON counts.question_id = questions.id
8 ORDER BY question_id, response_count DESC;
Running the query produces a result like:
question | answer | response_count |
---|---|---|
What’s your age? | 18+ | 10 |
What’s your favorite season? | Spring | 5 |
What programming language do you like the best? | C | 7 |
One Last Caveat
You may have noticed that I changed the subquery for counts
and moved the ORDER BY
into the outermost SELECT
. After running a few tests, I found that the sorting within the subquery did not translate to having the correct order in the final result. In fact, it would be completely backwards, working when I used ORDER BY ... ASC
.
The documentation does specify that the behavior of SELECT DISTINCT ON
is unpredictable without an ORDER BY
, which I’ve taken to mean as an ORDER BY
in the same clause. From my observations, subquery orderings are still susceptible to the unpredictability here.