note-to-self

Using SELECT DISTINCT ON in PostgreSQL

tl;dr: PostgreSQL’s SELECT DISTINCT ON is a functional solution for getting top counts of a GROUP BY.

The Context

I recently did a coding challenge to build a survey application that analyzes responses at the end. The analysis centered around a question along the lines of “What was the most popular answer for people of each group?”, where the groups were decided by one of the questions.

Here’s an example, where the groups might be “0-17” and “18+”.

11. What's your age?
2    A. 0-17     B. 18+
32. What's your favorite season?
4    A. Winter   B. Spring   C. Summer   D. Fall
53. What programming language do you like the best?
6    A. C        B. Java     C. Python

The following presents my solution (which admittedly could be improved on) and a problem I ran into.

The Database

My DB schema was something like:

 1questions (
 2    id SERIAL PRIMARY KEY,
 3    question VARCHAR(256) NOT NULL,
 4    options VARCHAR(64)[] NOT NULL,
 5);
 6
 7answers (
 8    response_id SERIAL NOT NULL,
 9    question_id INTEGER REFERENCES questions(id),
10    answer VARCHAR(64),
11    PRIMARY KEY (response_id, question_id)
12);

When a survey is submitted, the backend tracks the response_id generated for the first insert and uses that same response_id for the other questions in the survey.

Our data might look something like this

idquestionoptions
0How old are you?{0-17,18+}
1What’s your favorite season?{Winter,Spring,Summer,Fall}
response_idquestion_idanswer
000-17
01Fall
1018+
11Spring

The Analysis

Now, to actually analyze the data and answer the big question. Using this example, we’ll try to figure out the responses for those 18+.

First, we’ll start with some subqueries. We want to separate out the group in question.

1WITH adults AS SELECT response_id FROM answers WHERE question_id = 0 AND answer = '18+'

Then, we’ll want to get the count per question_id. We can sort these to get the most popular responses per category first.

1SELECT COUNT(*) AS response_count, question_id, answer FROM
2    (SELECT * FROM adults JOIN answers ON adults.response_id = answers.response_id)
3GROUP BY question_id, answer
4ORDER BY response_count DESC;

This will get us a result like:

response_countquestion_idanswer
10018+
51Fall
31Spring
21Summer

My Problem

I got stuck at this point. My original thought had been to get the MAX from the counts, then join it back with the counts in order to get the response text. This might have been a working solution with other versions of SQL, but I was using Postgres.

1WITH
2    adults AS ...,
3    counts AS ...,
4SELECT answer, questions.question, MAX(response_count)
5    FROM counts JOIN questions ON counts.question_id = questions.question_id;

The above query throws an error:

ERROR: column “counts.answer” must appear in the GROUP BY clause or be used in an aggregate function

In the docs, Postgres specifies that:

When GROUP BY is present, or any aggregate functions are present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions or when the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column.

A Solution

Luckily, Postgres provides something called SELECT DISTINCT ON. According to the docs,

SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal… Note that the “first row” of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first.

Using this, we can write a query to SELECT DISTINCT ON the question_id. This will drop every row except for the first for every question_id. By the docs, we should use an ORDER BYwith this statement.

This gives us a final query of:

1WITH
2    adults AS (SELECT response_id FROM answers WHERE question_id = 0 AND answer = '18+'),
3    counts AS (SELECT COUNT(*) AS response_count, question_id, answer FROM
4            (SELECT * FROM adults JOIN answers ON adults.response_id = answers.response_id)
5        GROUP BY question_id, answer)
6SELECT DISTINCT ON (question_id) question, answer, response_count
7    FROM counts JOIN questions ON counts.question_id = questions.id
8    ORDER BY question_id, response_count DESC;

Running the query produces a result like:

questionanswerresponse_count
What’s your age?18+10
What’s your favorite season?Spring5
What programming language do you like the best?C7

One Last Caveat

You may have noticed that I changed the subquery for counts and moved the ORDER BY into the outermost SELECT. After running a few tests, I found that the sorting within the subquery did not translate to having the correct order in the final result. In fact, it would be completely backwards, working when I used ORDER BY ... ASC.

The documentation does specify that the behavior of SELECT DISTINCT ON is unpredictable without an ORDER BY, which I’ve taken to mean as an ORDER BY in the same clause. From my observations, subquery orderings are still susceptible to the unpredictability here.

<< Previous Post

|

Next Post >>