towardsdatascience.com Open in urlscan Pro
162.159.152.4  Public Scan

Submitted URL: https://towardsdatascience.com/how-to-use-group-by-and-partition-by-in-sql-f3d241846e3e
Effective URL: https://towardsdatascience.com/how-to-use-group-by-and-partition-by-in-sql-f3d241846e3e?gi=dc71f8cb6eab
Submission: On April 22 via manual from IN — Scanned from DE

Form analysis 0 forms found in the DOM

Text Content

Open in app

Sign In

Get started


Home
Notifications
Lists
Stories

--------------------------------------------------------------------------------

Write




RESPONSES



What are your thoughts?

Cancel
Respond

Also publish to my profile

There are currently no responses for this story.

Be the first to respond.

Published in

Towards Data Science

You have 2 free member-only stories left this month.

Sign up for Medium and get an extra one



Chi Nguyen
Follow

Mar 1, 2021

·
5 min read
·

Listen



Save







HOW TO USE GROUP BY AND PARTITION BY IN SQL


Photo by Markus Spiske on Unsplash


INTRODUCTION

When I first learned SQL, I had a problem of differentiating between PARTITION
BY and GROUP BY, as they both have a function for grouping. I believe many
people who begin to work with SQL may encounter the same problem. Therefore, in
this article I want to share with you some examples of using PARTITION BY, and
the difference between it and GROUP BY in a select statement.


EXPLORATION


SAMPLE DATA

Firstly, I create a simple dataset with 4 columns. The df table below describes
the amount of money and type of fruit that each employee in different functions
will bring in their company trip.


Figure 1: df table

Figure 2: df information


PARTITION BY VS GROUP BY

The following is the syntax of Partition By:

SELECT expression 1, expression 2, ...
aggregate function () 
OVER (PARTITION BY expression 1 order_clause frame_clause)
FROM table

When we want to do an aggregation on a specific column, we can apply PARTITION
BY clause with the OVER clause. Let’s look at the example below to see how the
dataset has been transformed.


Figure 3: Output of Partition By Clause

In the example, I want to calculate the total and average amount of money that
each function brings for the trip. What you can see in the screenshot is the
result of my PARTITION BY query.

Now, if I use GROUP BY instead of PARTITION BY in the above case, what would the
result look like?

First, the syntax of GROUP BY can be written as:

SELECT expression 1, expression 2
aggregate function ()
FROM tables
WHERE conditions
GROUP BY expression 1, expression 2

When I apply this to the query to find the total and average amount of money in
each function, the aggregated output is similar to a PARTITION BY clause.
However, as you notice, there is a difference in the figure 3 and figure 4
result.


Figure 4: Output of Group By clause
 * GROUP BY gives per function in the company a result (Figure 4). Meanwhile, as
   we have 7 records in the df table, PARTITION BY retrieves all 7 rows with
   total_amount and average_amount in each row (Figure 3). Therefore, in
   conclusion, the PARTITION BY retrieves all the records in the table, while
   the GROUP BY only returns a limited number.
 * One more thing is that GROUP BY does not allow to add columns which are not
   parts of GROUP BY clause in select statement. However, with PARTITION BY
   clause, we can add required columns.


PARTITION BY WITH ROW NUMBER

We can combine PARTITION BY and ROW NUMBER to have the row number sorted by a
specific value. For example, if I want to see which person in each function
brings the most amount of money, I can easily find out by applying the
ROW_NUMBER function to each team and getting each person’s amount of money
ordered by descending values.


Figure 5: Row Number function example


PARTITION BY WITH CUMULATIVE VALUE

PARTITION BY + ROWS UNBOUNDED PRECEDING

For easier imagination, I will begin with an example to explain the idea of this
section.

I created a new table named df8.


Figure 6: New data frame (df8)

By applying ROW_NUMBER, I got the row number value sorted by amount of money for
each employee in each function. Basically until this step, as you can see in
figure 7, everything is similar to the example above.


Figure 7: Output

However, as I want to calculate one more column, which is the average money
amount of the current row and the higher value amount before the current row in
partition. For example in the figure 8, we can see that:

 * In the Tech team, Sam alone has an average cumulative amount of 400000.
 * However, in row number 2 of the Tech team, the average cumulative amount is
   340050, which equals the average of (Hoang’s amount + Sam’s amount).
 * In row number 3, the money amount of Dung is lower than Hoang and Sam, so his
   average cumulative amount is average of (Hoang’s, Sam’s and Dung’s amount)

=> This is a general idea of how ROWS UNBOUNDED PRECEDING and PARTITION BY
clause are used together.


Figure 8: Cumulative Average Amount using ROWS UNBOUNDED PRECEDING

PARTITION BY + ROWS BETWEEN CURRENT ROW AND 1

The usage of this combination is to calculate the aggregated values (average,
sum, etc) of the current row and the following row in partition. Let’s continue
to work with df9 data to see how this is done.


Figure 9: Cumulative Average Amount using ROWS BETWEEN CURRENT ROW AND 1

From figure 9, we can figure out that:

 * In Tech function row number 1, the average cumulative amount of Sam is
   340050, which equals the average amount of her and her following person
   (Hoang) in row number 2.
 * Then, the average cumulative amount of Hoang is the average of Hoang’s amount
   and Dung’s amount in row number 3.


CONCLUSION

In this article, I provided my understanding of PARTITION BY and GROUP BY along
with some different cases of using PARTITION BY. I hope the above information
will be helpful for you.


REFERENCE


SQL PARTITION BY CLAUSE OVERVIEW


APRIL 9, 2019 BY THIS ARTICLE WILL COVER THE SQL PARTITION BY CLAUSE AND, IN
PARTICULAR, THE DIFFERENCE WITH GROUP BY…

www.sqlshack.com





135





135

135





SIGN UP FOR THE VARIABLE


BY TOWARDS DATA SCIENCE

Every Thursday, the Variable delivers the very best of Towards Data Science:
from hands-on tutorials and cutting-edge research to original features you don't
want to miss. Take a look.

Get this newsletter


MORE FROM TOWARDS DATA SCIENCE

Follow

Your home for data science. A Medium publication sharing concepts, ideas and
codes.

Mathias Kemeter

·Mar 1, 2021


CREATING A PLAYGROUND FOR SPATIAL ANALYTICS

How to setup and use QGIS with SAP HANA Cloud in less than 30 Minutes — You are
interested in Spatial Analytics and looking for some content to get started? The
combination of SAP HANA Cloud Trial with QGIS is providing you a low entry
barrier into spatial data processing at scale. …

Database

12 min read





--------------------------------------------------------------------------------

Share your ideas with millions of readers.

Write on Medium

--------------------------------------------------------------------------------

Mostafa Ibrahim

·Mar 1, 2021


WHAT IS A TENSOR PROCESSING UNIT (TPU) AND HOW DOES IT WORK?

TPUs are over 20x times faster than state-of-art GPUs… But how? — TPUs are
hardware accelerators specialized in deep learning tasks. In this code lab, you
will see how to use them with Keras and Tensorflow 2. Cloud TPUs are available
in a base configuration with 8 cores and also in larger configurations called
“TPU pods” of up to 2048 cores. …

Artificial Intelligence

5 min read





--------------------------------------------------------------------------------

Adam Green

·Mar 1, 2021


SEVEN STRATEGIES TO GENERATE DATA SCIENCE PROJECT IDEAS

Find data science portfolio project ideas with these seven strategies — Personal
projects are one of the best things about data science. Most projects require
large teams & expensive software — try building a bridge as a lone civil
engineer with only a laptop and an internet connection! The openness of the data
science community with data, tools and education means…

Data Science

14 min read





--------------------------------------------------------------------------------

Nik Piepenbreier

·Mar 1, 2021


VISUALIZING COVID-19 VACCINE DATA IN PYTHON (IN 5 SIMPLE STEPS)

Make your MatPlotlib graphs stand out — Matplotlib is one of the most commonly
used data visualization libraries in Python. However, its default settings lead
to drab, uninspiring graphs. This quick tutorial will guide you through some of
the tips and tricks you can use to design much prettier Matplotlib graphs. What
you’ll be making! We’ll be taking some COVID-19 Vaccine…

Python

4 min read





--------------------------------------------------------------------------------

Indhumathy Chelliah

·Mar 1, 2021


LINE OF BEST FIT IN LINEAR REGRESSION

Correlation Coefficient, Coefficient of determination, Model Coefficient —
Linear Regression Linear Regression is one of the most important algorithms in
machine learning. It is the statistical way of measuring the relationship
between one or more independent variables vs one dependent variable. The Linear
Regression model attempts to find the relationship between variables by finding
the best fit line. Let’s learn…

Artificial Intelligence

7 min read





--------------------------------------------------------------------------------

Read more from Towards Data Science


RECOMMENDED FROM MEDIUM

Rendered.ai Marketing

in

RenderedAI

RENDERED.AI PARTICIPATES IN #DATAFORWILDLIFE CHALLENGE TO TACKLE WILDLIFE CRIME
ONLINE



Ajay Mane

10 MOST POPULAR STATISTICAL HYPOTHESIS TESTING METHODS USING PYTHON



Tobiloba Adaramati

in

Udacity Technology Scholarship

RESEARCH METHODS 101



Anmol Hans

in

Analytics Vidhya

IPL ANALYSIS USING PANDAS: READ THIS IF YOU ARE PLANNING TO MAKE A TEAM IN
DREAM11



Andrew Muller

TANZANIAN WATER WELL CLASSIFICATION



Yue Ma & Ryan Deng

RECOMMENDER SYSTEM FOR RESTAURANTS BASED ON COMMENTS



James Reade

COVID 19 TRENDS — DETERMINING THEM AUTOMATICALLY



Mcesnokova

RODRIGUEZ TAVERNA VS LEITE LIVE STREAM]~


Get started

Sign In


CHI NGUYEN



480 Followers



MSc in Statistics. Sharing my leaning tips in the journey of becoming a better
data analyst. Linkedin: https://www.linkedin.com/in/chinguyenphamhai/


Follow



MORE FROM MEDIUM

Vicky Yu

in

Towards Data Science

DATA CLEANING: THE UNSEXY SIDE OF DATA SCIENCE



Ivo Bernardo

in

Geek Culture

5 DATA PATHS TO FOLLOW AFTER LEARNING YOUR FIRST PROGRAMMING LANGUAGE



Soner Yıldırım

SQL FROM ZERO TO HERO IN 30 DAYS — DAY 8 — DATE MANIPULATION



Angelica Lo Duca

HOW TO REPRESENT AND QUERY HIERARCHICAL DATA IN SQL



Help

Status

Writers

Blog

Careers

Privacy

Terms

About

Knowable

To make Medium work, we log user data. By using Medium, you agree to our Privacy
Policy, including cookie policy.