If you are in any data-related career, you must be very well-versed in SQL concepts. One such key concept is SQL Window Function. This is a must-know SQL advanced concept for every Data engineer, data analyst, or any other data-related profile. As the name suggests it is just like creating different data windows and applying different SQL functions on those specific windows.
SQL Window Function
SQL Window function is also known as the analytical function. It’s an advanced SQL feature, which allows performing complex calculations on top of a specific set of rows in a table. SQL window function operates on a window of rows that a data engineer defines using an over clause. It helps do complex calculations like running totals, moving averages, ranking, and row numbers. This is a must-know SQL advanced concept for every data engineer.
Understanding SQL Window Function
Let us understand the need for the SQL window function. Before that, we need a data set to run the SQL statements.
DataSet for running the SQL statement
drop table employee;
create table employee
(id int,
name varchar(10),
deptId int,
salary float);
truncate table employee;
insert into employee values (123,'abc',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (124,'bcd',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (125,'cde',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (126,'def',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (127,'efg',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (128,'fgh',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (129,'ghi',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (130,'hij',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (131,'ijk',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (132,'jkl',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (133,'klm',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (134,'lmn',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (135,'mno',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (136,'nop',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (137,'opq',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (138,'pqr',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (139,'qrs',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (140,'rst',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (141,'stu',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (142,'tuv',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (143,'uvw',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (144,'vwx',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (145,'wxy',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (146,'xyz',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (147,'yza',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (148,'zab',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (149,'zab',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
insert into employee values (150,'abc',cast(rand()*5 + 5 as int),cast(rand()*500*rand()*123 + 100000 as int));
drop table department;
create table department (deptid int, deptName varchar(10));
insert into department values (5,'sales');
insert into department values (6,'hr');
insert into department values (7,'IT');
insert into department values (8,'finance');
insert into department values (9,'admin');
select * from Employee;
select * from department;
Traditional SQL aggregate functions vs Window Functions
Here are a few examples of traditional aggregate functions. All of these can be used along with or without group by clause. The point to note here is that all of these functions will return just 1 value for each set.
SUM() returns the sum of all values
COUNT() returns the count of values in the set
AVG() returns the average of all values in the set.
MAX() returns the maximum among all values.
MIN() returns the minimum among all values.
select d.deptName as "Department",Count(1) from employee e inner join
department d on e.deptId=d.deptid
group by d.deptName;
select d.deptName as "Department",sum(e.salary) "Total Salary for dept" from employee e inner join
department d on e.deptId=d.deptid
group by d.deptName;
select d.deptName as "Department",avg(e.salary) "Total Salary for dept" from employee e inner join
department d on e.deptId=d.deptid
group by d.deptName;
It can be seen that for every department, it has created a set and gives a single result value for every set. However, it’s different in the case of the SQL Window function. There will 1 output value for each row. We will see that in more detail in upcoming sections.
Syntax and Basic Structure
For the window function, there is a standard syntax and structure.
select [List column names], [Agg_function] over ( partition by [Column name for window creation] [order by [column name 2]] from [table_name]
partition by [Column name for window creation]: This is defining the window within the resultset on which aggregate function needs to be applied.
Types of Window Functions
There are two main types of windowing functions in SQL.
Aggregate Functions
These functions calculate a single value based on a group of rows within a data set. Examples are like, SUM, AVG, COUNT, MAX, MIN, etc.
Analytical Functions
These functions are used to perform advanced calculations on data coming under a window out of a larger data set. Analytical functions return a single value for each row coming under the selected window. Examples are ROW_NUMBER, LEAD, LAG, RANK, DENSE_RANK, FIRST_VALUE, LAST_VALUE, NTILE, etc.
SQL Window Function Examples
Examples of SQL Window Function using Aggregate Functions
Let’s look at some examples of aggregate functions as window functions.
SQL for showing the average, the sum of salaries, and the count of employees coming under a department.
select e.*,d.deptName,
sum(e.salary) over (partition by d.deptid ) "Total Salary for dept" ,
Avg(e.salary) over (partition by d.deptid ) "Average Salary for dept",
Count(e.id) over (partition by d.deptid) "Count of employees in each dept"
from employee e inner join
department d on e.deptId=d.deptid;
Here every row is showing all 3 values for each department against every employee record of that department. for example under “sales” department, every employee record is showing total and average salary for Sales department.
Now what if we want to show the moving average, and moving total under every department window? The query will change as per below.
select e.*,d.deptName,
sum(e.salary) over (partition by d.deptid order by e.id) "Moving Total Salary for dept" ,
Avg(e.salary) over (partition by d.deptid order by e.id) "Moving Average Salary for dept"
from employee e inner join
department d on e.deptId=d.deptid;
Examples of SQL Window Function using Analytical Functions
Here are few examples of Window functions using analytical functions:
ROW_NUMBER()
select e.*,d.deptName,
ROW_NUMBER() over (partition by d.deptid order by e.salary) "ROW NUMBER"
from employee e inner join
department d on e.deptId=d.deptid
where d.deptid in (8,9);
So every window data set is having a specific value for each row coming under each department. Lets make few changes in the data and then lets revisit the output.
update employee set salary = (select e1.salary from employee e1 where e1.id=134)
where id =141;
update employee set salary = (select e1.salary from employee e1 where e1.id=142)
where id =150;
-- Let's re-run the query
select e.*,d.deptName,
ROW_NUMBER() over (partition by d.deptid order by e.salary) "ROW NUMBER"
from employee e inner join
department d on e.deptId=d.deptid
where d.deptid in (8,9);
So there is no change in the output, even if the salary is same for two consecutive records.
RANK() and DENSE_RANK()
Now let’s check the output with Rank() and dense_rank() function.
select e.*,d.deptName,
ROW_NUMBER() over (partition by d.deptid order by e.salary) "ROW NUMBER",
RANK() over (partition by d.deptid order by e.salary) "RANK",
DENSE_RANK() over (partition by d.deptid order by e.salary) "DENSE RANK"
from employee e inner join
department d on e.deptId=d.deptid
where d.deptid in (8,9);
So in the output it is clearly visible that ROW_NUMBER will keep on incrementing for every next row coming in the order and it’s not dependent on value of sorted column. How ever in case of RANK and DENSE_RANK its little different.
So RANK(), will be same for row having duplicate sorted column value, but the next row having non-duplicate value will have rank incremented by number of duplicate value rows, like in this case it’s 2 for both id as 134 and 141 but for next id i.e. 136 it’s directly jumps to 4.
But in case of DENSE_RANK(), it’s little different, it will remain same for all rows having duplicate value , but it will not skip any rank for next value like in this case dense_rank is 2 for both IDs 134 and 141 but it is having next dense_rank as 3 for id as 136, so no jumping at all.
LEAD(<return value>,<Offset value>): This gives access to next record’s value, based on offset.
LAG(<return value>,<Offset value>): This gives access to previous record’s value, based on offset.
select e.id,e.name,d.deptName,e.salary,
Lead(salary,1) over (partition by d.deptid order by e.salary) "Next Employee salary",
lag(salary,1) over (partition by d.deptid order by e.salary) "Previous Employee salary"
from employee e inner join
department d on e.deptId=d.deptid
where d.deptid in (8,9);
Advantages of SQL Windowing Function
There are many advantages of SQL Window functions. Few are listed below
Ability to operate on a subset of records
The most important advantage is that, it enables us to do calculation on a subset of rows based on certain windowing/partition criteria.
Efficient use of database resources
By using window functions, we may avoid creating multiple temp tables, which will result in less memory and CPU usage.
Simplified Query Logic
Windowing queries are pretty much simple to write and understand. Just imagine if we need to extract same output without using the window functions, it will be really really very complex query.
Conclusion
In this article we tried to understand the usage of SQL window function, how to use it and how it is different from normal aggregation. I hope, this article resolves all your queries, if there are still any remaining, please feel free to comment, I will try my best to resolve your queries related to SQL Window Function.
You may also like to go through below video for understanding window functions.
If you want to learn about Parameter Sniffing, please check out this article.