towardsdatascience.com
Open in
urlscan Pro
162.159.152.4
Public Scan
Submitted URL: https://towardsdatascience.com/how-to-use-group-by-and-partition-by-in-sql-f3d241846e3e
Effective URL: https://towardsdatascience.com/how-to-use-group-by-and-partition-by-in-sql-f3d241846e3e?gi=dc71f8cb6eab
Submission: On April 22 via manual from IN — Scanned from DE
Effective URL: https://towardsdatascience.com/how-to-use-group-by-and-partition-by-in-sql-f3d241846e3e?gi=dc71f8cb6eab
Submission: On April 22 via manual from IN — Scanned from DE
Form analysis
0 forms found in the DOMText Content
Open in app Sign In Get started Home Notifications Lists Stories -------------------------------------------------------------------------------- Write RESPONSES What are your thoughts? Cancel Respond Also publish to my profile There are currently no responses for this story. Be the first to respond. Published in Towards Data Science You have 2 free member-only stories left this month. Sign up for Medium and get an extra one Chi Nguyen Follow Mar 1, 2021 · 5 min read · Listen Save HOW TO USE GROUP BY AND PARTITION BY IN SQL Photo by Markus Spiske on Unsplash INTRODUCTION When I first learned SQL, I had a problem of differentiating between PARTITION BY and GROUP BY, as they both have a function for grouping. I believe many people who begin to work with SQL may encounter the same problem. Therefore, in this article I want to share with you some examples of using PARTITION BY, and the difference between it and GROUP BY in a select statement. EXPLORATION SAMPLE DATA Firstly, I create a simple dataset with 4 columns. The df table below describes the amount of money and type of fruit that each employee in different functions will bring in their company trip. Figure 1: df table Figure 2: df information PARTITION BY VS GROUP BY The following is the syntax of Partition By: SELECT expression 1, expression 2, ... aggregate function () OVER (PARTITION BY expression 1 order_clause frame_clause) FROM table When we want to do an aggregation on a specific column, we can apply PARTITION BY clause with the OVER clause. Let’s look at the example below to see how the dataset has been transformed. Figure 3: Output of Partition By Clause In the example, I want to calculate the total and average amount of money that each function brings for the trip. What you can see in the screenshot is the result of my PARTITION BY query. Now, if I use GROUP BY instead of PARTITION BY in the above case, what would the result look like? First, the syntax of GROUP BY can be written as: SELECT expression 1, expression 2 aggregate function () FROM tables WHERE conditions GROUP BY expression 1, expression 2 When I apply this to the query to find the total and average amount of money in each function, the aggregated output is similar to a PARTITION BY clause. However, as you notice, there is a difference in the figure 3 and figure 4 result. Figure 4: Output of Group By clause * GROUP BY gives per function in the company a result (Figure 4). Meanwhile, as we have 7 records in the df table, PARTITION BY retrieves all 7 rows with total_amount and average_amount in each row (Figure 3). Therefore, in conclusion, the PARTITION BY retrieves all the records in the table, while the GROUP BY only returns a limited number. * One more thing is that GROUP BY does not allow to add columns which are not parts of GROUP BY clause in select statement. However, with PARTITION BY clause, we can add required columns. PARTITION BY WITH ROW NUMBER We can combine PARTITION BY and ROW NUMBER to have the row number sorted by a specific value. For example, if I want to see which person in each function brings the most amount of money, I can easily find out by applying the ROW_NUMBER function to each team and getting each person’s amount of money ordered by descending values. Figure 5: Row Number function example PARTITION BY WITH CUMULATIVE VALUE PARTITION BY + ROWS UNBOUNDED PRECEDING For easier imagination, I will begin with an example to explain the idea of this section. I created a new table named df8. Figure 6: New data frame (df8) By applying ROW_NUMBER, I got the row number value sorted by amount of money for each employee in each function. Basically until this step, as you can see in figure 7, everything is similar to the example above. Figure 7: Output However, as I want to calculate one more column, which is the average money amount of the current row and the higher value amount before the current row in partition. For example in the figure 8, we can see that: * In the Tech team, Sam alone has an average cumulative amount of 400000. * However, in row number 2 of the Tech team, the average cumulative amount is 340050, which equals the average of (Hoang’s amount + Sam’s amount). * In row number 3, the money amount of Dung is lower than Hoang and Sam, so his average cumulative amount is average of (Hoang’s, Sam’s and Dung’s amount) => This is a general idea of how ROWS UNBOUNDED PRECEDING and PARTITION BY clause are used together. Figure 8: Cumulative Average Amount using ROWS UNBOUNDED PRECEDING PARTITION BY + ROWS BETWEEN CURRENT ROW AND 1 The usage of this combination is to calculate the aggregated values (average, sum, etc) of the current row and the following row in partition. Let’s continue to work with df9 data to see how this is done. Figure 9: Cumulative Average Amount using ROWS BETWEEN CURRENT ROW AND 1 From figure 9, we can figure out that: * In Tech function row number 1, the average cumulative amount of Sam is 340050, which equals the average amount of her and her following person (Hoang) in row number 2. * Then, the average cumulative amount of Hoang is the average of Hoang’s amount and Dung’s amount in row number 3. CONCLUSION In this article, I provided my understanding of PARTITION BY and GROUP BY along with some different cases of using PARTITION BY. I hope the above information will be helpful for you. REFERENCE SQL PARTITION BY CLAUSE OVERVIEW APRIL 9, 2019 BY THIS ARTICLE WILL COVER THE SQL PARTITION BY CLAUSE AND, IN PARTICULAR, THE DIFFERENCE WITH GROUP BY… www.sqlshack.com 135 135 135 SIGN UP FOR THE VARIABLE BY TOWARDS DATA SCIENCE Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge research to original features you don't want to miss. Take a look. Get this newsletter MORE FROM TOWARDS DATA SCIENCE Follow Your home for data science. A Medium publication sharing concepts, ideas and codes. Mathias Kemeter ·Mar 1, 2021 CREATING A PLAYGROUND FOR SPATIAL ANALYTICS How to setup and use QGIS with SAP HANA Cloud in less than 30 Minutes — You are interested in Spatial Analytics and looking for some content to get started? The combination of SAP HANA Cloud Trial with QGIS is providing you a low entry barrier into spatial data processing at scale. … Database 12 min read -------------------------------------------------------------------------------- Share your ideas with millions of readers. Write on Medium -------------------------------------------------------------------------------- Mostafa Ibrahim ·Mar 1, 2021 WHAT IS A TENSOR PROCESSING UNIT (TPU) AND HOW DOES IT WORK? TPUs are over 20x times faster than state-of-art GPUs… But how? — TPUs are hardware accelerators specialized in deep learning tasks. In this code lab, you will see how to use them with Keras and Tensorflow 2. Cloud TPUs are available in a base configuration with 8 cores and also in larger configurations called “TPU pods” of up to 2048 cores. … Artificial Intelligence 5 min read -------------------------------------------------------------------------------- Adam Green ·Mar 1, 2021 SEVEN STRATEGIES TO GENERATE DATA SCIENCE PROJECT IDEAS Find data science portfolio project ideas with these seven strategies — Personal projects are one of the best things about data science. Most projects require large teams & expensive software — try building a bridge as a lone civil engineer with only a laptop and an internet connection! The openness of the data science community with data, tools and education means… Data Science 14 min read -------------------------------------------------------------------------------- Nik Piepenbreier ·Mar 1, 2021 VISUALIZING COVID-19 VACCINE DATA IN PYTHON (IN 5 SIMPLE STEPS) Make your MatPlotlib graphs stand out — Matplotlib is one of the most commonly used data visualization libraries in Python. However, its default settings lead to drab, uninspiring graphs. This quick tutorial will guide you through some of the tips and tricks you can use to design much prettier Matplotlib graphs. What you’ll be making! We’ll be taking some COVID-19 Vaccine… Python 4 min read -------------------------------------------------------------------------------- Indhumathy Chelliah ·Mar 1, 2021 LINE OF BEST FIT IN LINEAR REGRESSION Correlation Coefficient, Coefficient of determination, Model Coefficient — Linear Regression Linear Regression is one of the most important algorithms in machine learning. It is the statistical way of measuring the relationship between one or more independent variables vs one dependent variable. The Linear Regression model attempts to find the relationship between variables by finding the best fit line. Let’s learn… Artificial Intelligence 7 min read -------------------------------------------------------------------------------- Read more from Towards Data Science RECOMMENDED FROM MEDIUM Rendered.ai Marketing in RenderedAI RENDERED.AI PARTICIPATES IN #DATAFORWILDLIFE CHALLENGE TO TACKLE WILDLIFE CRIME ONLINE Ajay Mane 10 MOST POPULAR STATISTICAL HYPOTHESIS TESTING METHODS USING PYTHON Tobiloba Adaramati in Udacity Technology Scholarship RESEARCH METHODS 101 Anmol Hans in Analytics Vidhya IPL ANALYSIS USING PANDAS: READ THIS IF YOU ARE PLANNING TO MAKE A TEAM IN DREAM11 Andrew Muller TANZANIAN WATER WELL CLASSIFICATION Yue Ma & Ryan Deng RECOMMENDER SYSTEM FOR RESTAURANTS BASED ON COMMENTS James Reade COVID 19 TRENDS — DETERMINING THEM AUTOMATICALLY Mcesnokova RODRIGUEZ TAVERNA VS LEITE LIVE STREAM]~ Get started Sign In CHI NGUYEN 480 Followers MSc in Statistics. Sharing my leaning tips in the journey of becoming a better data analyst. Linkedin: https://www.linkedin.com/in/chinguyenphamhai/ Follow MORE FROM MEDIUM Vicky Yu in Towards Data Science DATA CLEANING: THE UNSEXY SIDE OF DATA SCIENCE Ivo Bernardo in Geek Culture 5 DATA PATHS TO FOLLOW AFTER LEARNING YOUR FIRST PROGRAMMING LANGUAGE Soner Yıldırım SQL FROM ZERO TO HERO IN 30 DAYS — DAY 8 — DATE MANIPULATION Angelica Lo Duca HOW TO REPRESENT AND QUERY HIERARCHICAL DATA IN SQL Help Status Writers Blog Careers Privacy Terms About Knowable To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including cookie policy.