towardsdatascience.com
Open in
urlscan Pro
162.159.152.4
Public Scan
Submitted URL: https://towardsdatascience.com/powerful-one-liners-in-pandas-every-data-scientist-should-know-737e721b81b6
Effective URL: https://towardsdatascience.com/powerful-one-liners-in-pandas-every-data-scientist-should-know-737e721b81b6?gi=2cb1aa803696
Submission: On August 02 via manual from US — Scanned from DE
Effective URL: https://towardsdatascience.com/powerful-one-liners-in-pandas-every-data-scientist-should-know-737e721b81b6?gi=2cb1aa803696
Submission: On August 02 via manual from US — Scanned from DE
Form analysis
0 forms found in the DOMText Content
Open in app Sign In Get started Home Notifications Lists Stories -------------------------------------------------------------------------------- Write Published in Towards Data Science Avi Chawla Follow Jul 12 · 8 min read · Listen Save POWERFUL ONE-LINERS IN PANDAS EVERY DATA SCIENTIST SHOULD KNOW THINGS YOU CAN DO IN ONE LINE USING PANDAS Photo by KirstenMarie on Unsplash Training data-driven machine learning models has never been as easy as today. For instance, assume you are training a vanilla neural network. Here, adjusting the architecture for the number of hidden layers and their dimension, tweaking the hyperparameters, or changing the loss function can all be done with a slight modification in the model definition or its optimizer. While on one hand, this is advantageous as it reduces the heavy lifting of spending time designing architectures from scratch. However, this has often led machine learning practitioners/researchers to neglect the importance of data visualizations and analysis — leading them to train deep models directly without establishing a clear understanding of their data. Therefore, in this post, I would like to introduce you to a handful of essential and powerful one-liners specifically for tabular data using Pandas that will help you better understand your data and consequently (and hopefully) help you design and build better machine learning models. DATASET For this post, I will experiment with a dummy dataset of one thousand Employees which I created myself in Python. The image below gives an overview of the dataset we are experimenting with. First five rows of the DataFrame (Image by author) The code block below demonstrates my implementation: ONE-LINERS IN PANDAS Next, let’s discuss some popular functions available in Pandas to make a meaningful understanding of the available data. #1 N-LARGEST VALUES IN A SERIES Say we want to start off by finding the top-n paid roles in this dataset. You can do this using the nlargest() method in Pandas. This method returns the first n rows with the largest values in column(s), ordered in descending order. Note that nlargest() returns the entire DataFrame, i.e., the function also returns the columns not specified for ordering. However, they are not used to order the DataFrame. The code snippet below depicts the use of nlargest() method on our DataFrame. The output of the nlargest method (Image by Author) When duplicate values exist, we need to specify which particular row(s) we want in the final output. This is done using the keep argument that can take the following values: 1. keep = "first": prioritizes the first occurrence. 2. keep = "last": prioritizes the last occurrence. 3. keep = "all": Does not drop any duplicates, even if it means selecting more than n items (like in the image above). It is often mistaken that the nlargest()is precisely equivalent to using the sort_values()method as follows: Output of sort_values method (Image by Author) However, the keepargument used in nlargest() makes all the difference. Considering the example above, nlargest() with keep=”all"returns potential duplicates as well. This, on the other hand, can not be done in the case of sort_values() method. #2 N-SMALLEST VALUES IN A SERIES Similar to the nlargest() method discussed above, you can find the rows corresponding to the lowest-n values using the nsmallest() method in Pandas. This method returns the first n rows with the smallest values in column(s), arranged in ascending order. The arguments passed here are the same as those specified in the nlargest() method. The code snippet below depicts the use of nsmallest() method on our DataFrame. The output of the nsmallest method (Image by Author) #3 CROSSTABS Crosstab allows you to compute a cross-tabulation of two (or more) columns/series and returns a frequency of each combination by default. In other words, crosstab() takes one column/list, displays its unique values as indexes, and then takes another column/list and displays its unique values as the column headers. The values in the individual cells are computed using an aggregation function. By default, they indicate the co-occurrence frequency. Say, for instance, we wish to compute the number of employees working from each location within every company. This can be done as follows: The output of Crosstab to compute the frequency of co-occurrence (Image by Author) As it can be hard to interpret numerical values in a crosstab (and to make it more visually appealing), we can generate a heatmap from a crosstab shown below as follows: Heatmap depicting the co-occurrence dataframe (Image by author) If you wish to compute aggregation on some column other than the ones that make up the indexes and the column headers, you can do so by passing the aggregation column to values argument of crosstab()as shown below: Heatmap depicting the average salary (Image by author) #4 PIVOT TABLE Pivot tables are a commonly used data analysis tool in Excel. Similar to crosstabs discussed above, pivot tables in Pandas provide a way to cross-tabulate your data. Although they both share numerous similarities and are conceptually the same in the context of Pandas, there are a few implementational differences that make them different (further reading here). The code snippet below demonstrates the use of pivot_table() method to find the frequency of co-occurrence between the “Company Name” and “Location”: The output of the pivot table to compute the frequency of co-occurrence (Image by Author) Similar to what we did in Crosstab, we can create a heatmap to make it visually appealing as well as more interpretable. This can be done as shown in the code snippet to generate the following heatmap: Heatmap depicting the co-occurrence dataframe (Image by author) #5 HANDLING DUPLICATED DATA In addition to the regular data analysis, appropriately handling duplicate values in your data also plays a vital role in building your data pipeline. One major caveat of having duplicates in your data is that they take up unnecessary storage space and slow down the computation by acquiring resources. Furthermore, duplicate data can skew analysis results, leading us to draw wrong insights. Therefore, removing or handling duplicates in your data is extremely important. First, let’s look at how you can mark duplicate values in your DataFrame. For this, we’ll use the duplicated()method in Pandas. This returns a boolean Series that indicates duplicate rows. For demonstration purposes, I’ll only use a random sample of 10 rows of the original salary dataset, of which the last two rows have been intentionally duplicated. The sampled rows are shown in the image below. A Dataframe with duplicates (Image by author) * Mark duplicated rows Pandas allows you to assign boolean labels to rows based on all columns (or a subset of columns) which are duplicates. This can be done using the duplicated() method of Pandas as shown below: When there are duplicate values, keep is used to indicate which specific duplicates to mark. 1. keep = "first": (Default) Marks all duplicates as True except for the first occurrence. 2. keep = "last": Marks all duplicates as True except for the last occurrence. 3. keep = False: Marks all duplicates as True. You can filter all the rows which appear only once by passing the boolean series as flags for filtering a Pandas DataFrame as follows: A filtered Dataframe with no duplicates (Image by author) To check duplicates on a subset of columns, pass the list of columns as the subset argument of duplicated() method as shown below: Filtering the DataFrame using the above boolean series as shown below outputs the DataFrame following the code: A filtered Dataframe with duplicates considering two columns (Image by author) * Remove duplicates In addition to marking potential duplicates using boolean labels discussed above, one might also need to get rid of duplicates. To reiterate, the data I am referring to specifically for the “Handling Duplicated Data” section comprises just ten rows. This is shown below: A Dataframe with duplicates (Image by author) You can remove the duplicate rows either based on values in all columns or a subset of columns using the drop_duplicates() method as shown below: A DataFrame after dropping duplicate rows (Image by author) Similar to duplicated(), the keep argument is used to indicate which specific duplicates you want to keep. 1. keep = "first": (Default) Drops all duplicates except for the first occurrence. 2. keep = "last": Drops all duplicates except for the last occurrence. 3. keep = False: Drops all duplicates. To drop duplicates based on the values in a subset of columns, pass the list of columns as the subset argument to the drop_duplicates() method: A DataFrame after dropping duplicate rows considering two columns (Image by author) To conclude, in this post, I presented a few popular methods available in Pandas for effective data analysis in Tabular Data. Though this post will be helpful for you to make you comfortable with the syntax of these methods, I would highly recommend downloading a dataset on your own and experimenting with it in a jupyter notebook. Further, there is no better place than referencing the official Pandas documentation available here to acquire fundamental and practical knowledge of various effective methods in Pandas. Pandas' official documentation provides a detailed explanation of each of the arguments accepted by a function along with a practical example, which in my opinion, is an excellent way to acquire both beginner level and advanced Pandas expertise. P.S. I have been able to cover only five methods in the post. I’ll release the next set of Pandas methods for effective data analysis in another post soon :). Meanwhile, if you enjoyed reading this article, I am sure you would enjoy the following articles too: 20% OF PANDAS FUNCTIONS THAT DATA SCIENTISTS USE 80% OF THE TIME PUTTING PARETO’S PRINCIPLE TO WORK ON THE PANDAS LIBRARY towardsdatascience.com TOP AI RESOURCES YOU MUST FOLLOW IF YOU ARE INTO AI HOW TO KEEP UP WITH THE LATEST MACHINE LEARNING ADVANCEMENTS medium.com Thanks for reading. 756 7 756 756 7 SIGN UP FOR THE VARIABLE BY TOWARDS DATA SCIENCE Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge research to original features you don't want to miss. Take a look. Get this newsletter MORE FROM TOWARDS DATA SCIENCE Follow Your home for data science. A Medium publication sharing concepts, ideas and codes. Leonie Monigatti ·Jul 12 WHY YOUR DATA VISUALIZATIONS SHOULD BE COLORBLIND-FRIENDLY Especially if You Are Trying to Convince Men — Has this happened to you before? You thought your data analysis report was convincing. The data supported your arguments and you polished your data visualizations. But the person you needed to convince to achieve your goal was just like “meh”. Was your argument not strong enough? Or was your data… Data Visualization 7 min read -------------------------------------------------------------------------------- Share your ideas with millions of readers. Write on Medium -------------------------------------------------------------------------------- Hanzala Qureshi ·Jul 12 IMPROVE HEALTH OF YOUR DATA BY USING THESE 5 SCORING METHODS Steps That Will Ultimately Help You Improve the Health of Your Data — Recently, I have been asked the question: what is even more important than the data quality? And I realised I might be providing an incomplete narrative in my Data Quality (DQ) blogs. DQ is undoubtedly an essential aspect of data; however, there are many more facets than simply improving its… Data Health 4 min read -------------------------------------------------------------------------------- Xiaoxu Gao ·Jul 12 PANDAS VS DASK VS DATATABLE: A PERFORMANCE COMPARISON FOR PROCESSING CSV FILES Pandas might not be the best option anymore — When it comes to processing CSV files, the first tool that appears in everyone’s mind is pandas. There is no doubt that pandas is a great framework, the dataframe provides an extremely streamlined form of data representation that helps us to analyze and understand data better. Recently, I did a… Python 6 min read -------------------------------------------------------------------------------- Christopher Tao ·Jul 12 DO NOT USE IF-ELSE FOR VALIDATING DATA OBJECTS IN PYTHON ANYMORE — COLANDER Define your data structure in a flexible schema using customised classes — Several weeks ago, I introduced a Python library called Cerberus. It can enable us to write a “schema” to validate our data objects (JSON/Dictionary) in a neat and readable way, rather than using endless if-else conditions. Do Not Use If-Else For Validating Data Objects In Python Anymore Cerberus — A neat and readable way to validate attributes of a dictionary.towardsdatascience.com Python 9 min read -------------------------------------------------------------------------------- Marcello Politi ·Jul 12 EVERYTHING YOU NEED TO KNOW TO GET STARTED WITH NLP Stop googling around, read this first! — In this last period, I have been concentrating on reviewing and studying the most important things in the NLP field in order to face various interviews with as much peace of mind as possible. Therefore, the first thing I did was to go over all the basics of this subject… NLP 13 min read -------------------------------------------------------------------------------- Read more from Towards Data Science RECOMMENDED FROM MEDIUM Melissa Tsungai Zisengwe in jamlab SEVEN TIPS TO KICKSTART A DATA JOURNALISM PROJECT Arjith Babu in Analytics Vidhya MULTICOLLINEARITY — A BEGINNER’S GUIDE ODSC - Open Data Science ARE SUCCESSFUL DATA SCIENTISTS HIRED OR TRAINED? Michael Reed in Google Cloud - Community CREATING A DATAPROC CLUSTER: CONSIDERATIONS, GOTCHAS & RESOURCES Defensible DYNAMIC BUILDING RISK ADDED TO DEFENSIBLE FIRE RISK APP Neil S W Murray COLD HARD DATA FEELINGS Iurii Katser in Product AI FAULT DETECTION AND RUL DETERMINATION FOR GRANULATION MACHINES Sid Sharma in Towards Data Science DATA ANALYST 3.0: THE NEXT EVOLUTION OF DATA WORKFLOWS AboutHelpTermsPrivacy -------------------------------------------------------------------------------- GET THE MEDIUM APP Get started Sign In AVI CHAWLA 1K Followers Top Writer in AI | Data Scientist at Mastercard AI | https://www.linkedin.com/in/avi-chawla/ Follow MORE FROM MEDIUM Josh Berry in Towards Data Science TOP 5 BOOKMARKS EVERY DATA ANALYST SHOULD HAVE Jake from Mito 3 PACKAGES THAT MAKE YOU BETTER AT PYTHON Sid Ghani in CodeX 5 SUPER USEFUL WEBSITES FOR DATA ANALYSTS Haider Imtiaz in Python in Plain English 10 PYTHON SCRIPTS TO AUTOMATE YOUR DAILY TASK Help Status Writers Blog Careers Privacy Terms About Knowable To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including cookie policy.