towardsdatascience.com Open in urlscan Pro
162.159.152.4  Public Scan

Submitted URL: https://towardsdatascience.com/powerful-one-liners-in-pandas-every-data-scientist-should-know-737e721b81b6
Effective URL: https://towardsdatascience.com/powerful-one-liners-in-pandas-every-data-scientist-should-know-737e721b81b6?gi=2cb1aa803696
Submission: On August 02 via manual from US — Scanned from DE

Form analysis 0 forms found in the DOM

Text Content

Open in app

Sign In

Get started


Home
Notifications
Lists
Stories

--------------------------------------------------------------------------------

Write


Published in

Towards Data Science

Avi Chawla
Follow

Jul 12

·
8 min read
·

Listen



Save







POWERFUL ONE-LINERS IN PANDAS EVERY DATA SCIENTIST SHOULD KNOW


THINGS YOU CAN DO IN ONE LINE USING PANDAS


Photo by KirstenMarie on Unsplash

Training data-driven machine learning models has never been as easy as today.
For instance, assume you are training a vanilla neural network. Here, adjusting
the architecture for the number of hidden layers and their dimension, tweaking
the hyperparameters, or changing the loss function can all be done with a slight
modification in the model definition or its optimizer.

While on one hand, this is advantageous as it reduces the heavy lifting of
spending time designing architectures from scratch. However, this has often led
machine learning practitioners/researchers to neglect the importance of data
visualizations and analysis — leading them to train deep models directly without
establishing a clear understanding of their data.

Therefore, in this post, I would like to introduce you to a handful of essential
and powerful one-liners specifically for tabular data using Pandas that will
help you better understand your data and consequently (and hopefully) help you
design and build better machine learning models.




DATASET

For this post, I will experiment with a dummy dataset of one thousand Employees
which I created myself in Python. The image below gives an overview of the
dataset we are experimenting with.


First five rows of the DataFrame (Image by author)

The code block below demonstrates my implementation:




ONE-LINERS IN PANDAS

Next, let’s discuss some popular functions available in Pandas to make a
meaningful understanding of the available data.


#1 N-LARGEST VALUES IN A SERIES

Say we want to start off by finding the top-n paid roles in this dataset. You
can do this using the nlargest() method in Pandas. This method returns the first
n rows with the largest values in column(s), ordered in descending order.

Note that nlargest() returns the entire DataFrame, i.e., the function also
returns the columns not specified for ordering. However, they are not used to
order the DataFrame. The code snippet below depicts the use of nlargest() method
on our DataFrame.



The output of the nlargest method (Image by Author)

When duplicate values exist, we need to specify which particular row(s) we want
in the final output. This is done using the keep argument that can take the
following values:

 1. keep = "first": prioritizes the first occurrence.
 2. keep = "last": prioritizes the last occurrence.
 3. keep = "all": Does not drop any duplicates, even if it means selecting more
    than n items (like in the image above).

It is often mistaken that the nlargest()is precisely equivalent to using the
sort_values()method as follows:



Output of sort_values method (Image by Author)

However, the keepargument used in nlargest() makes all the difference.
Considering the example above, nlargest() with keep=”all"returns potential
duplicates as well. This, on the other hand, can not be done in the case of
sort_values() method.


#2 N-SMALLEST VALUES IN A SERIES

Similar to the nlargest() method discussed above, you can find the rows
corresponding to the lowest-n values using the nsmallest() method in Pandas.
This method returns the first n rows with the smallest values in column(s),
arranged in ascending order. The arguments passed here are the same as those
specified in the nlargest() method. The code snippet below depicts the use of
nsmallest() method on our DataFrame.



The output of the nsmallest method (Image by Author)


#3 CROSSTABS

Crosstab allows you to compute a cross-tabulation of two (or more)
columns/series and returns a frequency of each combination by default. In other
words, crosstab() takes one column/list, displays its unique values as indexes,
and then takes another column/list and displays its unique values as the column
headers. The values in the individual cells are computed using an aggregation
function. By default, they indicate the co-occurrence frequency.

Say, for instance, we wish to compute the number of employees working from each
location within every company. This can be done as follows:



The output of Crosstab to compute the frequency of co-occurrence (Image by
Author)

As it can be hard to interpret numerical values in a crosstab (and to make it
more visually appealing), we can generate a heatmap from a crosstab shown below
as follows:



Heatmap depicting the co-occurrence dataframe (Image by author)

If you wish to compute aggregation on some column other than the ones that make
up the indexes and the column headers, you can do so by passing the aggregation
column to values argument of crosstab()as shown below:



Heatmap depicting the average salary (Image by author)


#4 PIVOT TABLE

Pivot tables are a commonly used data analysis tool in Excel. Similar to
crosstabs discussed above, pivot tables in Pandas provide a way to
cross-tabulate your data.

Although they both share numerous similarities and are conceptually the same in
the context of Pandas, there are a few implementational differences that make
them different (further reading here). The code snippet below demonstrates the
use of pivot_table() method to find the frequency of co-occurrence between the
“Company Name” and “Location”:



The output of the pivot table to compute the frequency of co-occurrence (Image
by Author)

Similar to what we did in Crosstab, we can create a heatmap to make it visually
appealing as well as more interpretable. This can be done as shown in the code
snippet to generate the following heatmap:



Heatmap depicting the co-occurrence dataframe (Image by author)


#5 HANDLING DUPLICATED DATA

In addition to the regular data analysis, appropriately handling duplicate
values in your data also plays a vital role in building your data pipeline. One
major caveat of having duplicates in your data is that they take up unnecessary
storage space and slow down the computation by acquiring resources. Furthermore,
duplicate data can skew analysis results, leading us to draw wrong insights.
Therefore, removing or handling duplicates in your data is extremely important.

First, let’s look at how you can mark duplicate values in your DataFrame. For
this, we’ll use the duplicated()method in Pandas. This returns a boolean Series
that indicates duplicate rows. For demonstration purposes, I’ll only use a
random sample of 10 rows of the original salary dataset, of which the last two
rows have been intentionally duplicated. The sampled rows are shown in the image
below.


A Dataframe with duplicates (Image by author)
 * Mark duplicated rows

Pandas allows you to assign boolean labels to rows based on all columns (or a
subset of columns) which are duplicates. This can be done using the duplicated()
method of Pandas as shown below:



When there are duplicate values, keep is used to indicate which specific
duplicates to mark.

 1. keep = "first": (Default) Marks all duplicates as True except for the first
    occurrence.
 2. keep = "last": Marks all duplicates as True except for the last occurrence.
 3. keep = False: Marks all duplicates as True.

You can filter all the rows which appear only once by passing the boolean series
as flags for filtering a Pandas DataFrame as follows:



A filtered Dataframe with no duplicates (Image by author)

To check duplicates on a subset of columns, pass the list of columns as the
subset argument of duplicated() method as shown below:



Filtering the DataFrame using the above boolean series as shown below outputs
the DataFrame following the code:



A filtered Dataframe with duplicates considering two columns (Image by author)
 * Remove duplicates

In addition to marking potential duplicates using boolean labels discussed
above, one might also need to get rid of duplicates. To reiterate, the data I am
referring to specifically for the “Handling Duplicated Data” section comprises
just ten rows. This is shown below:


A Dataframe with duplicates (Image by author)

You can remove the duplicate rows either based on values in all columns or a
subset of columns using the drop_duplicates() method as shown below:



A DataFrame after dropping duplicate rows (Image by author)

Similar to duplicated(), the keep argument is used to indicate which specific
duplicates you want to keep.

 1. keep = "first": (Default) Drops all duplicates except for the first
    occurrence.
 2. keep = "last": Drops all duplicates except for the last occurrence.
 3. keep = False: Drops all duplicates.

To drop duplicates based on the values in a subset of columns, pass the list of
columns as the subset argument to the drop_duplicates() method:



A DataFrame after dropping duplicate rows considering two columns (Image by
author)


To conclude, in this post, I presented a few popular methods available in Pandas
for effective data analysis in Tabular Data. Though this post will be helpful
for you to make you comfortable with the syntax of these methods, I would highly
recommend downloading a dataset on your own and experimenting with it in a
jupyter notebook.

Further, there is no better place than referencing the official Pandas
documentation available here to acquire fundamental and practical knowledge of
various effective methods in Pandas. Pandas' official documentation provides a
detailed explanation of each of the arguments accepted by a function along with
a practical example, which in my opinion, is an excellent way to acquire both
beginner level and advanced Pandas expertise.

P.S. I have been able to cover only five methods in the post. I’ll release the
next set of Pandas methods for effective data analysis in another post soon :).
Meanwhile, if you enjoyed reading this article, I am sure you would enjoy the
following articles too:


20% OF PANDAS FUNCTIONS THAT DATA SCIENTISTS USE 80% OF THE TIME


PUTTING PARETO’S PRINCIPLE TO WORK ON THE PANDAS LIBRARY

towardsdatascience.com




TOP AI RESOURCES YOU MUST FOLLOW IF YOU ARE INTO AI


HOW TO KEEP UP WITH THE LATEST MACHINE LEARNING ADVANCEMENTS

medium.com



Thanks for reading.




756



7



756

756

7




SIGN UP FOR THE VARIABLE


BY TOWARDS DATA SCIENCE

Every Thursday, the Variable delivers the very best of Towards Data Science:
from hands-on tutorials and cutting-edge research to original features you don't
want to miss. Take a look.

Get this newsletter


MORE FROM TOWARDS DATA SCIENCE

Follow

Your home for data science. A Medium publication sharing concepts, ideas and
codes.

Leonie Monigatti

·Jul 12




WHY YOUR DATA VISUALIZATIONS SHOULD BE COLORBLIND-FRIENDLY

Especially if You Are Trying to Convince Men — Has this happened to you before?
You thought your data analysis report was convincing. The data supported your
arguments and you polished your data visualizations. But the person you needed
to convince to achieve your goal was just like “meh”. Was your argument not
strong enough? Or was your data…

Data Visualization

7 min read





--------------------------------------------------------------------------------

Share your ideas with millions of readers.

Write on Medium

--------------------------------------------------------------------------------

Hanzala Qureshi

·Jul 12




IMPROVE HEALTH OF YOUR DATA BY USING THESE 5 SCORING METHODS

Steps That Will Ultimately Help You Improve the Health of Your Data — Recently,
I have been asked the question: what is even more important than the data
quality? And I realised I might be providing an incomplete narrative in my Data
Quality (DQ) blogs. DQ is undoubtedly an essential aspect of data; however,
there are many more facets than simply improving its…

Data Health

4 min read





--------------------------------------------------------------------------------

Xiaoxu Gao

·Jul 12




PANDAS VS DASK VS DATATABLE: A PERFORMANCE COMPARISON FOR PROCESSING CSV FILES

Pandas might not be the best option anymore — When it comes to processing CSV
files, the first tool that appears in everyone’s mind is pandas. There is no
doubt that pandas is a great framework, the dataframe provides an extremely
streamlined form of data representation that helps us to analyze and understand
data better. Recently, I did a…

Python

6 min read





--------------------------------------------------------------------------------

Christopher Tao

·Jul 12




DO NOT USE IF-ELSE FOR VALIDATING DATA OBJECTS IN PYTHON ANYMORE — COLANDER

Define your data structure in a flexible schema using customised classes —
Several weeks ago, I introduced a Python library called Cerberus. It can enable
us to write a “schema” to validate our data objects (JSON/Dictionary) in a neat
and readable way, rather than using endless if-else conditions. Do Not Use
If-Else For Validating Data Objects In Python Anymore Cerberus — A neat and
readable way to validate attributes of a dictionary.towardsdatascience.com

Python

9 min read





--------------------------------------------------------------------------------

Marcello Politi

·Jul 12


EVERYTHING YOU NEED TO KNOW TO GET STARTED WITH NLP

Stop googling around, read this first! — In this last period, I have been
concentrating on reviewing and studying the most important things in the NLP
field in order to face various interviews with as much peace of mind as
possible. Therefore, the first thing I did was to go over all the basics of this
subject…

NLP

13 min read





--------------------------------------------------------------------------------

Read more from Towards Data Science


RECOMMENDED FROM MEDIUM

Melissa Tsungai Zisengwe

in

jamlab

SEVEN TIPS TO KICKSTART A DATA JOURNALISM PROJECT



Arjith Babu

in

Analytics Vidhya

MULTICOLLINEARITY — A BEGINNER’S GUIDE



ODSC - Open Data Science

ARE SUCCESSFUL DATA SCIENTISTS HIRED OR TRAINED?



Michael Reed

in

Google Cloud - Community

CREATING A DATAPROC CLUSTER: CONSIDERATIONS, GOTCHAS & RESOURCES



Defensible

DYNAMIC BUILDING RISK ADDED TO DEFENSIBLE FIRE RISK APP



Neil S W Murray

COLD HARD DATA FEELINGS



Iurii Katser

in

Product AI

FAULT DETECTION AND RUL DETERMINATION FOR GRANULATION MACHINES



Sid Sharma

in

Towards Data Science

DATA ANALYST 3.0: THE NEXT EVOLUTION OF DATA WORKFLOWS



AboutHelpTermsPrivacy

--------------------------------------------------------------------------------


GET THE MEDIUM APP


Get started

Sign In


AVI CHAWLA


1K Followers


Top Writer in AI | Data Scientist at Mastercard AI |
https://www.linkedin.com/in/avi-chawla/


Follow



MORE FROM MEDIUM

Josh Berry

in

Towards Data Science

TOP 5 BOOKMARKS EVERY DATA ANALYST SHOULD HAVE



Jake from Mito

3 PACKAGES THAT MAKE YOU BETTER AT PYTHON



Sid Ghani

in

CodeX

5 SUPER USEFUL WEBSITES FOR DATA ANALYSTS



Haider Imtiaz

in

Python in Plain English

10 PYTHON SCRIPTS TO AUTOMATE YOUR DAILY TASK



Help

Status

Writers

Blog

Careers

Privacy

Terms

About

Knowable

To make Medium work, we log user data. By using Medium, you agree to our Privacy
Policy, including cookie policy.