Skip to main content

FILTER Clause

The FILTER clause may optionally follow an aggregate function in a SELECT statement. This will filter the rows of data that are fed into the aggregate function in the same way that a WHERE clause filters rows, but localized to the specific aggregate function. FILTERs are not currently able to be used when the aggregate function is in a windowing context.

There are multiple types of situations where this is useful, including when evaluating multiple aggregates with different filters, and when creating a pivoted view of a dataset. FILTER provides a cleaner syntax for pivoting data when compared with the more traditional CASE WHEN approach discussed below.

Some aggregate functions also do not filter out null values, so using a FILTER clause will return valid results when at times the CASE WHEN approach will not. This occurs with the functions FIRST and LAST, which are desirable in a non-aggregating pivot operation where the goal is to simply re-orient the data into columns rather than re-aggregate it. FILTER also improves null handling when using the LIST and ARRAY_AGG functions, as the CASE WHEN approach will include null values in the list result, while the FILTER clause will remove them.

Examples

-- Compare total row count to:
-- The number of rows where i <= 5
-- The number of rows where i is odd
SELECT
count(*) as total_rows,
count(*) FILTER (WHERE i <= 5) as lte_five,
count(*) FILTER (WHERE i % 2 = 1) as odds
FROM generate_series(1,10) tbl(i);
total_rowslte_fiveodds
1055
-- Different aggregate functions may be used, and multiple WHERE expressions are also permitted
-- The sum of i for rows where i <= 5
-- The median of i where i is odd
SELECT
sum(i) FILTER (WHERE i <= 5) as lte_five_sum,
median(i) FILTER (WHERE i % 2 = 1) as odds_median,
median(i) FILTER (WHERE i % 2 = 1 AND i <= 5) as odds_lte_five_median
FROM generate_series(1,10) tbl(i);
lte_five_sumodds_medianodds_lte_five_median
155.03.0

The FILTER clause can also be used to pivot data from rows into columns. This is a static pivot, as columns must be defined prior to runtime in SQL. However, this kind of statement can be dynamically generated in a host programming language to leverage DuckDB's SQL engine for rapid, larger than memory pivoting.

--First generate an example dataset
CREATE TEMP TABLE stacked_data as
SELECT
i,
CASE WHEN i <= rows * 0.25 THEN 2022
WHEN i <= rows * 0.5 THEN 2023
WHEN i <= rows * 0.75 THEN 2024
WHEN i <= rows * 0.875 THEN 2025
ELSE NULL
END as year
FROM (
SELECT
i,
count(*) over () as rows
FROM generate_series(1,100000000) tbl(i)
) tbl;

--"Pivot" the data out by year (move each year out to a separate column)
SELECT
count(i) FILTER (WHERE year = 2022) as "2022",
count(i) FILTER (WHERE year = 2023) as "2023",
count(i) FILTER (WHERE year = 2024) as "2024",
count(i) FILTER (WHERE year = 2025) as "2025",
count(i) FILTER (WHERE year IS NULL) as "NULLs"
FROM stacked_data;

--This syntax produces the same results as the the FILTER clauses above
SELECT
count(CASE WHEN year = 2022 THEN i END) as "2022",
count(CASE WHEN year = 2023 THEN i END) as "2023",
count(CASE WHEN year = 2024 THEN i END) as "2024",
count(CASE WHEN year = 2025 THEN i END) as "2025",
count(CASE WHEN year IS NULL THEN i END) as "NULLs"
FROM stacked_data;
2022202320242025NULLs
2500000025000000250000001250000012500000

However, the CASE WHEN approach will not work as expected when using an aggregate function that does not ignore NULL values. The FIRST function falls into this category, so FILTER is preferred in this case.

--"Pivot" the data out by year (move each year out to a separate column)
SELECT
first(i) FILTER (WHERE year = 2022) as "2022",
first(i) FILTER (WHERE year = 2023) as "2023",
first(i) FILTER (WHERE year = 2024) as "2024",
first(i) FILTER (WHERE year = 2025) as "2025",
first(i) FILTER (WHERE year IS NULL) as "NULLs"
FROM stacked_data;
2022202320242025NULLs
147456125804801507494417643136187500001
--This will produce NULL values whenever the first evaluation of the CASE WHEN clause returns a NULL
SELECT
first(CASE WHEN year = 2022 THEN i END) as "2022",
first(CASE WHEN year = 2023 THEN i END) as "2023",
first(CASE WHEN year = 2024 THEN i END) as "2024",
first(CASE WHEN year = 2025 THEN i END) as "2025",
first(CASE WHEN year IS NULL THEN i END) as "NULLs"
FROM stacked_data;
2022202320242025NULLs
1228801NULLNULLNULLNULL

Aggregate Function Syntax (Including Filter Clause)