www.packtpub.com
Open in
urlscan Pro
2606:4700:10::6816:88b
Public Scan
Submitted URL: https://packt.omeclk.com/portal/wts/ug^cnN2co6ecv64gygEczea8eE8zmq4rhkrt7yCrCka
Effective URL: https://www.packtpub.com/en-us/product/python-data-cleaning-cookbook-9781803239873?utm_source=ProgrammingPro+Newsletter&u...
Submission: On September 17 via api from SA — Scanned from DE
Effective URL: https://www.packtpub.com/en-us/product/python-data-cleaning-cookbook-9781803239873?utm_source=ProgrammingPro+Newsletter&u...
Submission: On September 17 via api from SA — Scanned from DE
Form analysis
3 forms found in the DOMhttps://www.packtpub.com/en-us/search
<form class="search desktop" data-search="https://www.packtpub.com/api/rebuild/header/search" data-method="POST" action="https://www.packtpub.com/en-us/search">
<svg role="presentation" class="icon icon-5 search-box-icon">
<use href="https://www.packtpub.com/rebuild/build/assets/common-CwvaZMrJ.svg#search"></use>
<desc>Search icon</desc>
</svg>
<input type="text" name="query" class="search-input" placeholder="Search..." id="search" autocomplete="off">
<span class="loader d-none"></span>
<svg role="presentation" class="search-close d-none">
<use href="https://www.packtpub.com/rebuild/build/assets/common-CwvaZMrJ.svg#close"></use>
<desc>Close icon</desc>
</svg>
<div class="search-results d-none scrollbar" id="results"></div>
</form>
GET https://www.packtpub.com/checkout/add/ebook/9781803239873
<form action="https://www.packtpub.com/checkout/add/ebook/9781803239873" method="get">
<button id="product-buy-now" type="submit" class="rebuild-btn rebuild-btn-primary" data-analytics-type="add_to_cart" data-analytics-currency="USD" data-analytics-item-id="US-9781803246291-EBOOK"
data-analytics-item-title="Python Data Cleaning Cookbook - Second Edition" data-analytics-item-language="" data-analytics-item-framework="" data-analytics-item-concept="" data-analytics-item-publication-year="2024"
data-analytics-item-quantity="1" data-analytics-item-index="0" data-analytics-item-format="ebook" data-analytics-item-price="27.98" data-analytics-item-discount="12.01"> Buy Now </button>
</form>
GET https://www.packtpub.com/checkout/add/ebook/9781803239873
<form action="https://www.packtpub.com/checkout/add/ebook/9781803239873" method="get">
<button id="product-buy-now" type="submit" class="rebuild-btn rebuild-btn-primary" data-analytics-type="add_to_cart" data-analytics-currency="USD" data-analytics-item-id="US-9781803246291-EBOOK"
data-analytics-item-title="Python Data Cleaning Cookbook - Second Edition" data-analytics-item-language="" data-analytics-item-framework="" data-analytics-item-concept="" data-analytics-item-publication-year="2024"
data-analytics-item-quantity="1" data-analytics-item-index="0" data-analytics-item-format="ebook" data-analytics-item-price="27.98" data-analytics-item-discount="12.01"> Buy Now </button>
</form>
Text Content
Search icon Close icon Search icon CANCEL Subscription 0 Cart icon Cart Close icon You have no products in your basket yet Save more on your purchases! Buy 2 products and save 10% Buy 3 products and save 15% Buy 5 products and save 20% Savings automatically calculated. No voucher code required Profile icon Account Close icon Sign in New User? Create Account Your Subscription Your Owned Titles Your Account Your Orders Country Selection: CHANGE COUNTRY Modal Close icon Country selected Country selected United States Country selected United Kingdom Country selected India Country selected Germany Country selected France Country selected Canada Country selected Russia Country selected Spain Country selected Brazil Country selected Australia Country selected -------------------------------------------------------------------------------- Argentina Country selected Austria Country selected Belgium Country selected Bulgaria Country selected Chile Country selected Colombia Country selected Cyprus Country selected Czechia Country selected Denmark Country selected Ecuador Country selected Egypt Country selected Estonia Country selected Finland Country selected Greece Country selected Hungary Country selected Indonesia Country selected Ireland Country selected Italy Country selected Japan Country selected Latvia Country selected Lithuania Country selected Luxembourg Country selected Malaysia Country selected Malta Country selected Mexico Country selected Netherlands Country selected New Zealand Country selected Norway Country selected Philippines Country selected Poland Country selected Portugal Country selected Romania Country selected Singapore Country selected Slovakia Country selected Slovenia Country selected South Africa Country selected South Korea Country selected Sweden Country selected Switzerland Country selected Taiwan Country selected Thailand Country selected Turkey Country selected Ukraine Country selected Arrow left icon All Products Best Sellers New Releases Books Videos Audiobooks Learning Hub Newsletters Free Learning Arrow right icon Home > Data > Data Analysis > Python Data Cleaning Cookbook - Second Edition PYTHON DATA CLEANING COOKBOOK: PREPARE YOUR DATA FOR ANALYSIS WITH PANDAS, NUMPY, MATPLOTLIB, SCIKIT-LEARN, AND OPENAI, SECOND EDITION Profile Icon Michael Walker By Michael Walker $27.98 $39.99 Full star icon Full star icon Full star icon Full star icon Full star icon 5 (2 Ratings) Book May 2024 486 pages 2nd Edition eBook $27.98 $39.99 Print $39.98 $49.99 Subscription Free Trial Renews at $19.99p/m Profile Icon Michael Walker By Michael Walker $27.98 $39.99 Full star icon Full star icon Full star icon Full star icon Full star icon 5 (2 Ratings) Book May 2024 486 pages 2nd Edition eBook $27.98 $39.99 Print $39.98 $49.99 Subscription Free Trial Renews at $19.99p/m eBook $27.98 $39.99 Print $39.98 $49.99 Subscription Free Trial Renews at $19.99p/m WHAT DO YOU GET WITH EBOOK? Product feature icon Instant access to your Digital eBook purchase Product feature icon Download this book in EPUB and PDF formats Product feature icon Access this title in our online reader with advanced features Product feature icon DRM FREE - Read whenever, wherever and however you want Buy Now ADD TO CART Table of content icon View table of contents Preview book icon Preview Book PYTHON DATA CLEANING COOKBOOK - SECOND EDITION ANTICIPATING DATA CLEANING ISSUES WHEN IMPORTING TABULAR DATA WITH PANDAS Scientific distributions of Python (Anaconda, WinPython, Canopy, and so on) provide analysts with an impressive range of data manipulation, exploration, and visualization tools. One important tool is pandas. Developed by Wes McKinney in 2008, but really gaining in popularity after 2012, pandas is now an essential library for data analysis in Python. The recipes in this book demonstrate how many common data preparation tasks can be done more easily with pandas than with other tools. While we work with pandas extensively in this book, we also use other popular packages such as Numpy, matplotlib, and scipy. A key pandas object is the DataFrame, which represents data as a tabular structure, with rows and columns. In this way, it is similar to the other data stores we discuss in this chapter. However, a pandas DataFrame also has indexing functionality that makes selecting, combining, and transforming data relatively straightforward, as the recipes in this book will demonstrate. Before we can make use of this great functionality, we have to import our data into pandas. Data comes to us in a wide variety of formats: as CSV or Excel files, as tables from SQL databases, from statistical analysis packages such as SPSS, Stata, SAS, or R, from non-tabular sources such as JSON, and from web pages. We examine tools to import tabular data in this recipe. Specifically, we cover the following topics: * Importing CSV files * Importing Excel files * Importing data from SQL databases * Importing SPSS, Stata, and SAS data * Importing R data * Persisting tabular data TECHNICAL REQUIREMENTS The code and notebooks for this chapter are available on GitHub at https://github.com/michaelbwalker/Python-Data-Cleaning-Cookbook-Second-Edition. You can use any IDE (Integrated Development Environment) of your choice – IDLE, Visual Studio, Sublime, Spyder, and so on – or Jupyter Notebook to work with any of the code in this chapter, or any chapter in this book. A good guide to get started with Jupyter Notebook can be found here: https://www.dataquest.io/blog/jupyter-notebook-tutorial/. I used the Spyder IDE to write the code in this chapter. I used pandas 2.2.1 and NumPy version 1.24.3 for all of the code in this chapter and subsequent chapters. I have also tested all code with pandas 1.5.3. IMPORTING CSV FILES The read_csv method of the pandas library can be used to read a file with comma separated values (CSV) and load it into memory as a pandas DataFrame. In this recipe, we import a CSV file and address some common issues: creating column names that make sense to us, parsing dates, and dropping rows with critical missing data. Raw data is often stored as CSV files. These files have a carriage return at the end of each line of data to demarcate a row, and a comma between each data value to delineate columns. Something other than a comma can be used as the delimiter, such as a tab. Quotation marks may be placed around values, which can be helpful when the delimiter occurs naturally within certain values, which sometimes happens with commas. All data in a CSV file are characters, regardless of the logical data type. This is why it is easy to view a CSV file, presuming it is not too large, in a text editor. The pandas read_csv method will make an educated guess about the data type of each column, but you will need to help it along to ensure that these guesses are on the mark. GETTING READY Create a folder for this chapter, and then create a new Python script or Jupyter Notebook file in that folder. Create a data subfolder, and then place the landtempssample.csv file in that subfolder. Alternatively, you could retrieve all of the files from the GitHub repository, including the data files. Here is a screenshot of the beginning of the CSV file: Figure 1.1: Land Temperatures Data Data note This dataset, taken from the Global Historical Climatology Network integrated database, is made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-monthly. I used the data from version 4. The data in this recipe uses a 100,000-row sample of the full dataset, which is also available in the repository. HOW TO DO IT… We will import a CSV file into pandas, taking advantage of some very useful read_csv options: 1. Import the pandas library, and set up the environment to make viewing the output easier: import pandas as pd pd.options.display.float_format = '{:,.2f}'.format pd.set_option('display.width', 85) pd.set_option('display.max_columns', 8) 2. Read the data file, set new names for the headings, and parse the date column. Pass an argument of 1 to the skiprows parameter to skip the first row, pass a list of columns to parse_dates to create a pandas datetime column from those columns, and set low_memory to False. This will cause pandas to load all of the data into memory at once, rather than in chunks. We do this so that pandas can identify the data type of each column automatically. In the There’s more… section, we see how to set the data type for each column manually: landtemps = pd.read_csv('data/landtempssample.csv', ... names=['stationid','year','month','avgtemp','latitude', ... 'longitude','elevation','station','countryid','country'], ... skiprows=1, ... parse_dates=[['month','year']], ... low_memory=False) type(landtemps) <class 'pandas.core.frame.DataFrame'> Note We have to use skiprows because we are passing a list of column names to read_csv. If we use the column names in the CSV file, we do not need to specify values for either names or skiprows. 1. Get a quick glimpse of the data. View the first few rows. Show the data type for all columns, as well as the number of rows and columns: landtemps.head(7) month_year stationid ... countryid country 0 2000-04-01 USS0010K01S ... US United States 1 1940-05-01 CI000085406 ... CI Chile 2 2013-12-01 USC00036376 ... US United States 3 1963-02-01 ASN00024002 ... AS Australia 4 2001-11-01 ASN00028007 ... AS Australia 5 1991-04-01 USW00024151 ... US United States 6 1993-12-01 RSM00022641 ... RS Russia [7 rows x 9 columns] landtemps.dtypes month_year datetime64[ns] stationed object avgtemp float64 latitude float64 longitude float64 elevation float64 station object countryid object country object dtype: object landtemps.shape (100000, 9) 1. Give the date column a more appropriate name and view the summary statistics for average monthly temperature: landtemps.rename(columns={'month_year':'measuredate'}, inplace=True) landtemps.dtypes measuredate datetime64[ns] stationid object avgtemp float64 latitude float64 longitude float64 elevation float64 station object countryid object country object dtype: object landtemps.avgtemp.describe() count 85,554.00 mean 10.92 std 11.52 min -70.70 25% 3.46 50% 12.22 75% 19.57 max 39.95 Name: avgtemp, dtype: float64 2. Look for missing values for each column. Use isnull, which returns True for each value that is missing for each column, and False when not missing. Chain this with sum to count the missing values for each column. (When working with Boolean values, sum treats True as 1 and False as 0. I will discuss method chaining in the There’s more... section of this recipe): landtemps.isnull().sum() measuredate 0 stationed 0 avgtemp 14446 latitude 0 longitude 0 elevation 0 station 0 countryid 0 country 5 dtype: int64 1. Remove rows with missing data for avgtemp. Use the subset parameter to tell dropna to drop rows when avgtemp is missing. Set inplace to True. Leaving inplace at its default value of False would display the DataFrame, but the changes we have made would not be retained. Use the shape attribute of the DataFrame to get the number of rows and columns: landtemps.dropna(subset=['avgtemp'], inplace=True) landtemps.shape (85554, 9) That’s it! Importing CSV files into pandas is as simple as that. HOW IT WORKS... Almost all of the recipes in this book use the pandas library. We refer to it as pd to make it easier to reference later. This is customary. We also use float_format to display float values in a readable way and set_option to make the Terminal output wide enough to accommodate the number of variables. Much of the work is done by the first line in Step 2. We use read_csv to load a pandas DataFrame in memory and call it landtemps. In addition to passing a filename, we set the names parameter to a list of our preferred column headings. We also tell read_csv to skip the first row, by setting skiprows to 1, since the original column headings are in the first row of the CSV file. If we do not tell it to skip the first row, read_csv will treat the header row in the file as actual data. read_csv also solves a date conversion issue for us. We use the parse_dates parameter to ask it to convert the month and year columns to a date value. Step 3 runs through a few standard data checks. We use head(7) to print out all columns for the first seven rows. We use the dtypes attribute of the DataFrame to show the data type of all columns. Each column has the expected data type. In pandas, character data has the object data type, a data type that allows for mixed values. shape returns a tuple, whose first element is the number of rows in the DataFrame (100,000 in this case) and whose second element is the number of columns (9). When we used read_csv to parse the month and year columns, it gave the resulting column the name month_year. We used the rename method in Step 4 to give that column a more appropriate name. We need to specify inplace=True to replace the old column name with the new column name in memory. The describe method provides summary statistics on the avgtemp column. Notice that the count for avgtemp indicates that there are 85,554 rows that have valid values for avgtemp. This is out of 100,000 rows for the whole DataFrame, as provided by the shape attribute. The listing of missing values for each column in Step 5 (landtemps.isnull().sum()) confirms this: 100,000 – 85,554 = 14,446. Step 6 drops all rows where avgtemp is NaN. (The NaN value, not a number, is the pandas representation of missing values.) subset is used to indicate which column to check for missing values. The shape attribute for landtemps now indicates that there are 85,554 rows, which is what we would expect, given the previous count from describe. THERE’S MORE... If the file you are reading uses a delimiter other than a comma, such as a tab, this can be specified in the sep parameter of read_csv. When creating the pandas DataFrame, an index was also created. The numbers to the far left of the output when head was run are index values. Any number of rows can be specified for head. The default value is 5. Instead of setting low_memory to False, to get pandas to make good guesses regarding data types, we could have set data types manually: landtemps = pd.read_csv('data/landtempssample.csv', names=['stationid','year','month','avgtemp','latitude', 'longitude','elevation','station','countryid','country'], skiprows=1, parse_dates=[['month','year']], dtype={'stationid':'object', 'avgtemp':'float64', 'latitude':'float64','longitude':'float64', 'elevation':'float64','station':'object', 'countryid':'object','country':'object'}, ) landtemps.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 month_year 100000 non-null datetime64[ns] 1 stationid 100000 non-null object 2 avgtemp 85554 non-null float64 3 latitude 100000 non-null float64 4 longitude 100000 non-null float64 5 elevation 100000 non-null float64 6 station 100000 non-null object 7 countryid 100000 non-null object 8 country 99995 non-null object dtypes: datetime64[ns](1), float64(4), object(4) memory usage: 6.9+ MB The landtemps.isnull().sum() statement is an example of chaining methods. First, isnull returns a DataFrame of True and False values, resulting from testing whether each column value is null. The sum function takes that DataFrame and sums the True values for each column, interpreting the True values as 1 and the False values as 0. We would have obtained the same result if we had used the following two steps: checknull = landtemps.isnull() checknull.sum() There is no hard and fast rule for when to chain methods and when not to do so. I find chaining helpful when the overall operation feels like a single step, even if it’s two or more steps mechanically. Chaining also has the side benefit of not creating extra objects that I might not need. The dataset used in this recipe is just a sample from the full land temperatures database, with almost 17 million records. You can run the larger file if your machine can handle it, with the following code: landtemps = pd.read_csv('data/landtemps.zip', ... compression='zip', names=['stationid','year', ... 'month','avgtemp','latitude','longitude', ... 'elevation','station','countryid','country'], ... skiprows=1, ... parse_dates=[['month','year']], ... low_memory=False) read_csv can read a compressed ZIP file. We get it to do this by passing the name of the ZIP file and the type of compression. SEE ALSO Subsequent recipes in this chapter, and in other chapters, set indexes to improve navigation over rows and merging. A significant amount of reshaping of the Global Historical Climatology Network raw data was done before using it in this recipe. We demonstrate this in Chapter 11, Tidying and Reshaping Data. IMPORTING EXCEL FILES The read_excel method of the pandas library can be used to import data from an Excel file and load it into memory as a pandas DataFrame. In this recipe, we import an Excel file and handle some common issues when working with Excel files: extraneous header and footer information, selecting specific columns, removing rows with no data, and connecting to particular sheets. Despite the tabular structure of Excel, which invites the organization of data into rows and columns, spreadsheets are not datasets and do not require people to store data in that way. Even when some data conforms with those expectations, there is often additional information in rows or columns before or after the data to be imported. Data types are not always as clear as they are to the person who created the spreadsheet. This will be all too familiar to anyone who has ever battled with importing leading zeros. Moreover, Excel does not insist that all data in a column be of the same type, or that column headings be appropriate for use with a programming language such as Python. Fortunately, read_excel has a number of options for handling messiness in Excel data. These options make it relatively easy to skip rows, select particular columns, and pull data from a particular sheet or sheets. GETTING READY You can download the GDPpercapita22b.xlsx file, as well as the code for this recipe, from the GitHub repository for this book. The code assumes that the Excel file is in a data subfolder. Here is a view of the beginning of the file (some columns were hidden for display purposes): Figure 1.2: View of the dataset And here is a view of the end of the file: Figure 1.3: View of the dataset Data note This dataset, from the Organisation for Economic Co-operation and Development, is available for public use at https://stats.oecd.org/. HOW TO DO IT… We import an Excel file into pandas and do some initial data cleaning: 1. Import the pandas library: import pandas as pd 2. Read the Excel per capita GDP data. Select the sheet with the data we need, but skip the columns and rows that we do not want. Use the sheet_name parameter to specify the sheet. Set skiprows to 4 and skipfooter to 1 to skip the first four rows (the first row is hidden) and the last row. We provide values for usecols to get data from column A and columns C through W (column B is blank). Use head to view the first few rows and shape to get the number of rows and columns: percapitaGDP = pd.read_excel("data/GDPpercapita22b.xlsx", ... sheet_name="OECD.Stat export", ... skiprows=4, ... skipfooter=1, ... usecols="A,C:W") percapitaGDP.head() Year 2000 ... 2019 2020 0 Metropolitan areas ... NaN ... NaN NaN 1 AUS: Australia .. ... ... ... ... ... 2 AUS01: Greater Sydney ... ... ... 45576 45152 3 AUS02: Greater Melbourne ... ... ... 42299 40848 4 AUS03: Greater Brisbane ... ... ... 42145 40741 [5 rows x 22 columns] percapitaGDP.shape (731, 22) Note You may encounter a problem with read_excel if the Excel file does not use utf-8 encoding. One way to resolve this is to save the Excel file as a CSV file, reopen it, and then save it with utf-8 encoding. 1. Use the info method of the DataFrame to view data types and the non-null count. Notice that all columns have the object data type: percapitaGDP.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 731 entries, 0 to 730 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 731 non-null object 1 2000 730 non-null object 2 2001 730 non-null object 3 2002 730 non-null object 4 2003 730 non-null object 5 2004 730 non-null object 6 2005 730 non-null object 7 2006 730 non-null object 8 2007 730 non-null object 9 2008 730 non-null object 10 2009 730 non-null object 11 2010 730 non-null object 12 2011 730 non-null object 13 2012 730 non-null object 14 2013 730 non-null object 15 2014 730 non-null object 16 2015 730 non-null object 17 2016 730 non-null object 18 2017 730 non-null object 19 2018 730 non-null object 20 2019 730 non-null object 21 2020 730 non-null object dtypes: object(22) memory usage: 125.8+ KB 2. Rename the Year column to metro, and remove the leading spaces. Give an appropriate name to the metropolitan area column. There are extra spaces before the metro values in some cases. We can test for leading spaces with startswith(' ') and then use any to establish whether there are one or more occasions when the first character is blank. We can use endswith(' ') to examine trailing spaces. We use strip to remove both leading and trailing spaces. When we test for trailing spaces again, we see that there are none: percapitaGDP.rename(columns={'Year':'metro'}, inplace=True) percapitaGDP.metro.str.startswith(' ').any() True percapitaGDP.metro.str.endswith(' ').any() False percapitaGDP.metro = percapitaGDP.metro.str.strip() percapitaGDP.metro.str.startswith(' ').any() False 1. Convert the data columns to numeric. Iterate over all of the GDP year columns (2000–2020) and convert the data type from object to float. Coerce the conversion even when there is character data – the .. in this example. We want character values in those columns to become missing, which is what happens. Rename the year columns to better reflect the data in those columns: for col in percapitaGDP.columns[1:]: ... percapitaGDP[col] = pd.to_numeric(percapitaGDP[col], ... errors='coerce') ... percapitaGDP.rename(columns={col:'pcGDP'+col}, ... inplace=True) ... percapitaGDP.head() metro pcGDP2000 pcGDP2001 ... \ 0 Metropolitan areas NaN NaN ... 1 AUS: Australia NaN NaN ... 2 AUS01: Greater Sydney NaN 41091 ... 3 AUS02: Greater Melbourne NaN 40488 ... 4 AUS03: Greater Brisbane NaN 35276 ... pcGDP2018 pcGDP2019 pcGDP2020 0 NaN NaN NaN 1 NaN NaN NaN 2 47171 45576 45152 3 43237 42299 40848 4 44328 42145 40741 [5 rows x 22 columns] percapitaGDP.dtypes metro object pcGDP2000 float64 pcGDP2001 float64 abbreviated to save space pcGDP2019 float64 pcGDP2020 float64 dtype: object 1. Use the describe method to generate summary statistics for all numeric data in the DataFrame: percapitaGDP.describe() pcGDP2000 pcGDP2001 pcGDP2002 ... pcGDP2018 \ count 158 450 479 ... 692 mean 33961 38874 39621 ... 41667 std 15155 13194 13061 ... 17440 min 2686 7805 7065 ... 5530 25% 21523 30790 31064 ... 31322 50% 35836 38078 39246 ... 41428 75% 42804 46576 47874 ... 51130 max 95221 96941 98929 ... 147760 pcGDP2019 pcGDP2020 count 596 425 mean 42709 39792 std 18893 19230 min 5698 5508 25% 29760 24142 50% 43505 41047 75% 53647 51130 max 146094 131082 [8 rows x 21 columns] 2. Remove rows where all of the per capita GDP values are missing. Use the subset parameter of dropna to inspect all columns, starting with the second column (it is zero-based) and going through to the last column. Use how to specify that we want to drop rows only if all of the columns specified in subset are missing. Use shape to show the number of rows and columns in the resulting DataFrame: percapitaGDP.dropna(subset=percapitaGDP.columns[1:], how="all", inplace=True) percapitaGDP.shape (692, 22) 1. Set the index for the DataFrame using the metropolitan area column. Confirm that there are 692 valid values for metro and that there are 692 unique values, before setting the index: percapitaGDP.metro.count() 692 percapitaGDP.metro.nunique() 692 percapitaGDP.set_index('metro', inplace=True) percapitaGDP.head() pcGDP2000 pcGDP2001 ... \ metro ... AUS01: Greater Sydney NaN 41091 ... AUS02: Greater Melbourne NaN 40488 ... AUS03: Greater Brisbane NaN 35276 ... AUS04: Greater Perth NaN 43355 ... AUS05: Greater Adelaide NaN 36081 ... pcGDP2019 pcGDP2020 metro AUS01: Greater Sydney 45576 45152 AUS02: Greater Melbourne 42299 40848 AUS03: Greater Brisbane 42145 40741 AUS04: Greater Perth 70970 78489 AUS05: Greater Adelaide 38314 39181 [5 rows x 21 columns] percapitaGDP.loc['AUS02: Greater Melbourne'] pcGDP2000 NaN pcGDP2001 40488 ... pcGDP2019 42299 pcGDP2020 40848 Name: AUS02: Greater Melbourne, dtype: float64 We have now imported the Excel data into a pandas DataFrame and cleaned up some of the messiness in the spreadsheet. HOW IT WORKS… We mostly manage to get the data we want in Step 2 by skipping rows and columns we do not want, but there are still a number of issues – read_excel interprets all of the GDP data as character data, many rows are loaded with no useful data, and the column names do not represent the data well. In addition, the metropolitan area column might be useful as an index, but there are leading and trailing blanks, and there may be missing or duplicated values. read_excel interprets Year as the column name for the metropolitan area data because it looks for a header above the data for that Excel column and finds Year there. We rename that column metro in Step 4. We also use strip to fix the problem with leading and trailing blanks. We could have just used lstrip to remove leading blanks, or rstrip if there had been trailing blanks. It is a good idea to assume that there might be leading or trailing blanks in any character data, cleaning that data shortly after the initial import. The spreadsheet authors used .. to represent missing data. Since this is actually valid character data, those columns get the object data type (that is how pandas treats columns with character or mixed data). We coerce a conversion to numeric type in Step 5. This also results in the original values of .. being replaced with NaN (not a number), how pandas represents missing values for numbers. This is what we want. We can fix all of the per capita GDP columns with just a few lines because pandas makes it easy to iterate over the columns of a DataFrame. By specifying [1:], we iterate from the second column to the last column. We can then change those columns to numeric and rename them to something more appropriate. There are several reasons why it is a good idea to clean up the column headings for the annual GDP columns – it helps us to remember what the data actually is; if we merge it with other data by metropolitan area, we will not have to worry about conflicting variable names; and we can use attribute access to work with pandas Series based on those columns, which I will discuss in more detail in the There’s more… section of this recipe. describe in Step 6 shows us that fewer than 500 rows have valid data for per capita GDP for some years. When we drop all rows that have missing values for all per capita GDP columns in step 7, we end up with 692 rows in the DataFrame. THERE’S MORE… Once we have a pandas DataFrame, we have the ability to treat columns as more than just columns. We can use attribute access (such as percapitaGPA.metro) or bracket notation (percapitaGPA['metro']) to get the functionality of a pandas Series. Either method makes it possible to use Series string inspecting methods, such as str.startswith, and counting methods, such as nunique. Note that the original column names of 20## did not allow attribute access because they started with a number, so percapitaGDP.pcGDP2001.count() works, but percapitaGDP.2001.count() returns a syntax error because 2001 is not a valid Python identifier (since it starts with a number). pandas is rich with features for string manipulation and for Series operations. We will try many of them out in subsequent recipes. This recipe showed those that I find most useful when importing Excel data. SEE ALSO There are good reasons to consider reshaping this data. Instead of 21 columns of GDP per capita data for each metropolitan area, we should have 21 rows of data for each metropolitan area, with columns for year and GDP per capita. Recipes for reshaping data can be found in Chapter 11, Tidying and Reshaping Data. IMPORTING DATA FROM SQL DATABASES In this recipe, we will use pymssql and mysql apis to read data from Microsoft SQL Server and MySQL (now owned by Oracle) databases, respectively. Data from sources such as these tends to be well structured, since it is designed to facilitate simultaneous transactions by members of organizations and those who interact with them. Each transaction is also likely related to some other organizational transaction. This means that although data tables from enterprise systems such as these are more reliably structured than data from CSV files and Excel files, their logic is less likely to be self-contained. You need to know how the data from one table relates to data from another table to understand its full meaning. These relationships need to be preserved, including the integrity of primary and foreign keys, when pulling data. Moreover, well-structured data tables are not necessarily uncomplicated data tables. There are often sophisticated coding schemes that determine data values, and these coding schemes can change over time. For example, codes for merchandise at a retail store chain might be different in 1998 than they are in 2024. Similarly, frequently there are codes for missing values, such as 99,999, that pandas will understand as valid values. Since much of this logic is business logic, and implemented in stored procedures or other applications, it is lost when pulled out of this larger system. Some of what is lost will eventually have to be reconstructed when preparing data for analysis. This almost always involves combining data from multiple tables, so it is important to preserve the ability to do that. However, it also may involve adding some of the coding logic back after loading the SQL table into a pandas DataFrame. We explore how to do that in this recipe. GETTING READY This recipe assumes you have pymssql and mysql apis installed. If you do not, it is relatively straightforward to install them with pip. From the Terminal, or powershell (in Windows), enter pip install pymssql or pip install mysql-connector-python. We will work with data on educational attainment in this recipe. Data note The dataset used in this recipe is available for public use at https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip. HOW TO DO IT... We import SQL Server and MySQL data tables into a pandas DataFrame, as follows: 1. Import pandas, numpy, pymssql, and mysql. This step assumes that you have installed pymssql and mysql apis: import pandas as pd import numpy as np import pymssql import mysql.connector 1. Use pymssql api and read_sql to retrieve and load data from a SQL Server instance. Select the columns we want from the SQL Server data, and use SQL aliases to improve column names (for example, fedu AS fathereducation). Create a connection to the SQL Server data by passing database credentials to the pymssql connect function. Create a pandas DataFrame by passing the SELECT statement and connection object to read_sql. Use close to return the connection to the pool on the server: sqlselect = "SELECT studentid, school, sex, age, famsize,\ ... medu AS mothereducation, fedu AS fathereducation,\ ... traveltime, studytime, failures, famrel, freetime,\ ... goout, g1 AS gradeperiod1, g2 AS gradeperiod2,\ ... g3 AS gradeperiod3 From studentmath" server = "pdcc.c9sqqzd5fulv.us-west-2.rds.amazonaws.com" user = "pdccuser" password = "pdccpass" database = "pdcctest" conn = pymssql.connect(server=server, ... user=user, password=password, database=database) studentmath = pd.read_sql(sqlselect,conn) conn.close() Note Although tools such as pymssql make connecting to a SQL Server instance relatively straightforward, the syntax still might take a little time to get used to if it is unfamiliar. The previous step shows the parameter values you will typically need to pass to a connection object – the name of the server, the name of a user with credentials on the server, the password for that user, and the name of a SQL database on the server. 1. Check the data types and the first few rows: studentmath.dtypes studentid object school object sex object age int64 famsize object mothereducation int64 fathereducation int64 traveltime int64 studytime int64 failures int64 famrel int64 freetime int64 gout int64 gradeperiod1 int64 gradeperiod2 int64 gradeperiod3 int64 dtype: object studentmath.head() studentid school ... gradeperiod2 gradeperiod3 0 001 GP ... 6 6 1 002 GP ... 5 6 2 003 GP ... 8 10 3 004 GP ... 14 15 4 005 GP ... 10 10 [5 rows x 16 columns] 2. Connecting to a MySQL server is not very different from connecting to a SQL Server instance. We can use the connect method of the mysql connector to do that and then use read_sql to load the data. Create a connection to the mysql data, pass that connection to read_sql to retrieve the data, and load it into a pandas DataFrame (the same data file on student math scores was uploaded to SQL Server and MySQL, so we can use the same SQL SELECT statement we used in the previous step): host = "pdccmysql.c9sqqzd5fulv.us-west-2.rds.amazonaws.com" user = "pdccuser" password = "pdccpass" database = "pdccschema" connmysql = mysql.connector.connect(host=host, \ ... database=database,user=user,password=password) studentmath = pd.read_sql(sqlselect,connmysql) connmysql.close() 1. Rearrange the columns, set an index, and check for missing values. Move the grade data to the left of the DataFrame, just after studentid. Also, move the freetime column to the right after traveltime and studytime. Confirm that each row has an ID and that the IDs are unique, and set studentid as the index: newcolorder = ['studentid', 'gradeperiod1', ... 'gradeperiod2','gradeperiod3', 'school', ... 'sex', 'age', 'famsize','mothereducation', ... 'fathereducation', 'traveltime', ... 'studytime', 'freetime', 'failures', ... 'famrel','goout'] studentmath = studentmath[newcolorder] studentmath.studentid.count() 395 studentmath.studentid.nunique() 395 studentmath.set_index('studentid', inplace=True) 1. Use the DataFrame’s count function to check for missing values: studentmath.count() gradeperiod1 395 gradeperiod2 395 gradeperiod3 395 school 395 sex 395 age 395 famsize 395 mothereducation 395 fathereducation 395 traveltime 395 studytime 395 freetime 395 failures 395 famrel 395 goout 395 dtype: int64 2. Replace coded data values with more informative values. Create a dictionary with the replacement values for the columns, and then use replace to set those values: setvalues= \ ... {"famrel":{1:"1:very bad",2:"2:bad", ... 3:"3:neutral",4:"4:good",5:"5:excellent"}, ... "freetime":{1:"1:very low",2:"2:low", ... 3:"3:neutral",4:"4:high",5:"5:very high"}, ... "goout":{1:"1:very low",2:"2:low",3:"3:neutral", ... 4:"4:high",5:"5:very high"}, ... "mothereducation":{0:np.nan,1:"1:k-4",2:"2:5-9", ... 3:"3:secondary ed",4:"4:higher ed"}, ... "fathereducation":{0:np.nan,1:"1:k-4",2:"2:5-9", ... 3:"3:secondary ed",4:"4:higher ed"}} studentmath.replace(setvalues, inplace=True) 1. Change the type for columns with the changed data to category. Check any changes in memory usage: setvalueskeys = [k for k in setvalues] studentmath[setvalueskeys].memory_usage(index=False) famrel 3160 freetime 3160 goout 3160 mothereducation 3160 fathereducation 3160 dtype: int64 for col in studentmath[setvalueskeys].columns: ... studentmath[col] = studentmath[col]. \ ... astype('category') ... studentmath[setvalueskeys].memory_usage(index=False) famrel 607 freetime 607 goout 607 mothereducation 599 fathereducation 599 dtype: int64 1. Calculate percentages for values in the famrel column. Run value_counts, and set normalize to True to generate percentages: studentmath['famrel'].value_counts(sort=False, normalize=True) 1:very bad 0.02 2:bad 0.05 3:neutral 0.17 4:good 0.49 5:excellent 0.27 Name: famrel, dtype: float64 1. Use apply to calculate percentages for multiple columns: studentmath[['freetime','goout']].\ ... apply(pd.Series.value_counts, sort=False, ... normalize=True) freetime goout 1:very low 0.05 0.06 2:low 0.16 0.26 3:neutral 0.40 0.33 4:high 0.29 0.22 5:very high 0.10 0.13 studentmath[['mothereducation','fathereducation']].\ ... apply(pd.Series.value_counts, sort=False, ... normalize=True) mothereducation fathereducation 1:k-4 0.15 0.21 2:5-9 0.26 0.29 3:secondary ed 0.25 0.25 4:higher ed 0.33 0.24 The preceding steps retrieved a data table from a SQL database, loaded that data into pandas, and did some initial data checking and cleaning. HOW IT WORKS… Since data from enterprise systems is typically better structured than CSV or Excel files, we do not need to do things such as skip rows or deal with different logical data types in a column. However, some massaging is still usually required before we can begin exploratory analysis. There are often more columns than we need, and some column names are not intuitive or not ordered in the best way for analysis. The meaningfulness of many data values is not stored in the data table to avoid entry errors and save on storage space. For example, 3 is stored for mother’s education rather than secondary education. It is a good idea to reconstruct that coding as early in the cleaning process as possible. To pull data from a SQL database server, we need a connection object to authenticate us on the server, as well as a SQL select string. These can be passed to read_sql to retrieve the data and load it into a pandas DataFrame. I usually use the SQL SELECT statement to do a bit of cleanup of column names at this point. I sometimes also reorder columns, but I did that later in this recipe. We set the index in Step 5, first confirming that every row has a value for studentid and that it is unique. This is often more important when working with enterprise data because we will almost always need to merge the retrieved data with other data files on the system. Although an index is not required for this merging, the discipline of setting one prepares us for the tricky business of merging data further down the road. It will also likely improve the speed of the merge. We use the DataFrame’s count function to check for missing values and that there are no missing values – for non-missing values, the count is 395 (the number of rows) for every column. This is almost too good to be true. There may be values that are logically missing – that is, valid numbers that nonetheless connote missing values, such as -1, 0, 9, or 99. We address this possibility in the next step. Step 7 demonstrates a useful technique for replacing data values for multiple columns. We create a dictionary to map original values to new values for each column and then run it using replace. To reduce the amount of storage space taken up by the new verbose values, we convert the data type of those columns to category. We do this by generating a list of the keys of our setvalues dictionary – setvalueskeys = [k for k in setvalues] generates [famrel, freetime, goout, mothereducation, and fathereducation]. We then iterate over those five columns and use the astype method to change the data type to category. Notice that the memory usage for those columns is reduced substantially. Finally, we check the assignment of new values by using value_counts to view relative frequencies. We use apply because we want to run value_counts on multiple columns. To prevent value_counts sorting by frequency, we set sort to False. The DataFrame replace method is also a handy tool for dealing with logical missing values that will not be recognized as missing when retrieved by read_sql. The 0 values for mothereducation and fathereducation seem to fall into that category. We fix this problem in the setvalues dictionary by indicating that the 0 values for mothereducation and fathereducation should be replaced with NaN. It is important to address these kinds of missing values shortly after the initial import because they are not always obvious and can significantly impact all subsequent work. Users of packages such as SPPS, SAS, and R will notice the difference between this approach and value labels in SPSS and R, as well as the proc format in SAS. In pandas, we need to change the actual data to get more informative values. However, we reduce how much data is actually stored by giving the column a category data type. This is similar to factors in R. THERE’S MORE… I moved the grade data to near the beginning of the DataFrame. I find it helpful to have potential target or dependent variables in the leftmost columns, keeping them at the forefront of your mind. It is also helpful to keep similar columns together. In this example, personal demographic variables (sex and age) are next to one another, as are family variables (mothereducation and fathereducation), and how students spend their time (traveltime, studytime, and freetime). You could have used map instead of replace in Step 7. Prior to version 19.2 of pandas, map was significantly more efficient. Since then, the difference in efficiency has been much smaller. If you are working with a very large dataset, the difference may still be enough to consider using map. SEE ALSO The recipes in Chapter 10, Addressing Data Issues When Combining DataFrames, go into detail on merging data. We will take a closer look at bivariate and multivariate relationships between variables in Chapter 4, Identifying Outliers in Subsets of Data. We will demonstrate how to use some of these same approaches in packages such as SPSS, SAS, and R in subsequent recipes in this chapter. IMPORTING SPSS, STATA, AND SAS DATA We will use pyreadstat to read data from three popular statistical packages into pandas. The key advantage of pyreadstat is that it allows data analysts to import data from these packages without losing metadata, such as variable and value labels. The SPSS, Stata, and SAS data files we receive often come to us with the data issues of CSV and Excel files and SQL databases having been resolved. We do not typically have the invalid column names, changes in data types, and unclear missing values that we can get with CSV or Excel files, nor do we usually get the detachment of data from business logic, such as the meaning of data codes, that we often get with SQL data. When someone or some organization shares a data file from one of these packages with us, they have often added variable labels and value labels for categorical data. For example, a hypothetical data column called presentsat has the overall satisfaction with presentation variable label and 1–5 value labels, with 1 being not at all satisfied and 5 being highly satisfied. The challenge is retaining that metadata when importing data from those systems into pandas. There is no precise equivalent to variable and value labels in pandas, and built-in tools for importing SAS, Stata, and SAS data lose the metadata. In this recipe, we will use pyreadstat to load variable and value label information and use a couple of techniques to represent that information in pandas. GETTING READY This recipe assumes you have installed the pyreadstat package. If it is not installed, you can install it with pip. From the Terminal, or Powershell (in Windows), enter pip install pyreadstat. You will need the SPSS, Stata, and SAS data files for this recipe to run the code. We will work with data from the United States National Longitudinal Surveys (NLS) of Youth. Data note The NLS of Youth is conducted by the United States Bureau of Labor Statistics. This survey started with a cohort of individuals in 1997. Each survey respondent was high school age when they first completed the survey, having been born between 1980 and 1985. There were annual follow-up surveys each year through 2023. For this recipe, I pulled 42 variables on grades, employment, income, and attitudes toward government, from the hundreds of data items on the survey. Separate files for SPSS, Stata, and SAS can be downloaded from the repository. The original NLS data can be downloaded from https://www.nlsinfo.org/investigator/pages/search, along with code for creating SPSS, Stata, or SAS files from the ASCII data files included in the download. HOW TO DO IT... We will import data from SPSS, Stata, and SAS, retaining metadata such as value labels: 1. Import pandas, numpy, and pyreadstat. This step assumes that you have installed pyreadstat: import pandas as pd import numpy as np import pyreadstat 1. Retrieve the SPSS data. Pass a path and filename to the read_sav method of pyreadstat. Display the first few rows and a frequency distribution. Note that the column names and value labels are non-descriptive, and that read_sav returns both a pandas DataFrame and a meta object: nls97spss, metaspss = pyreadstat.read_sav('data/nls97.sav') nls97spss.dtypes R0000100 float64 R0536300 float64 R0536401 float64 ... U2962900 float64 U2963000 float64 Z9063900 float64 dtype: object nls97spss.head() R0000100 R0536300 ... U2963000 Z9063900 0 1 2 ... nan 52 1 2 1 ... 6 0 2 3 2 ... 6 0 3 4 2 ... 6 4 4 5 1 ... 5 12 [5 rows x 42 columns] nls97spss['R0536300'].value_counts(normalize=True) 1.00 0.51 2.00 0.49 Name: R0536300, dtype: float64 1. Grab the metadata to improve column labels and value labels. The metaspss object created when we called read_sav has the column labels and the value labels from the SPSS file. Use the variable_value_labels dictionary to map values to value labels for one column (R0536300). (This does not change the data. It only improves our display when we run value_counts.) Use the set_value_labels method to actually apply the value labels to the DataFrame: metaspss.variable_value_labels['R0536300'] {0.0: 'No Information', 1.0: 'Male', 2.0: 'Female'} nls97spss['R0536300'].\ ... map(metaspss.variable_value_labels['R0536300']).\ ... value_counts(normalize=True) Male 0.51 Female 0.49 Name: R0536300, dtype: float64 nls97spss = pyreadstat.set_value_labels(nls97spss, metaspss, formats_as_category=True) 1. Use column labels in the metadata to rename the columns. To use the column labels from metaspss in our DataFrame, we can simply assign the column labels in metaspss to our DataFrame’s column names. Clean up the column names a bit by changing them to lowercase, changing spaces to underscores, and removing all remaining non-alphanumeric characters: nls97spss.columns = metaspss.column_labels nls97spss['KEY!SEX (SYMBOL) 1997'].value_counts(normalize=True) Male 0.51 Female 0.49 Name: KEY!SEX (SYMBOL) 1997, dtype: float64 nls97spss.dtypes PUBID - YTH ID CODE 1997 float64 KEY!SEX (SYMBOL) 1997 category KEY!BDATE M/Y (SYMBOL) 1997 float64 KEY!BDATE M/Y (SYMBOL) 1997 float64 CV_SAMPLE_TYPE 1997 category KEY!RACE_ETHNICITY (SYMBOL) 1997 category "... abbreviated to save space" HRS/WK R WATCHES TELEVISION 2017 category HRS/NIGHT R SLEEPS 2017 float64 CVC_WKSWK_YR_ALL L99 float64 dtype: object nls97spss.columns = nls97spss.columns.\ ... str.lower().\ ... str.replace(' ','_').\ ... str.replace('[^a-z0-9_]', '', regex=True) nls97spss.set_index('pubid__yth_id_code_1997', inplace=True) 1. Simplify the process by applying the value labels from the beginning. The data values can actually be applied in the initial call to read_sav by setting apply_value_formats to True. This eliminates the need to call the set_value_labels function later: nls97spss, metaspss = pyreadstat.read_sav('data/nls97.sav', apply_value_formats=True, formats_as_category=True) nls97spss.columns = metaspss.column_labels nls97spss.columns = nls97spss.columns.\ ... str.lower().\ ... str.replace(' ','_').\ ... str.replace('[^a-z0-9_]', '', regex=True) 1. Show the columns and a few rows: nls97spss.dtypes pubid__yth_id_code_1997 float64 keysex_symbol_1997 category keybdate_my_symbol_1997 float64 keybdate_my_symbol_1997 float64 hrsnight_r_sleeps_2017 float64 cvc_wkswk_yr_all_l99 float64 dtype: object nls97spss.head() pubid__yth_id_code_1997 keysex_symbol_1997 ... \ 0 1 Female ... 1 2 Male ... 2 3 Female ... 3 4 Female ... 4 5 Male ... hrsnight_r_sleeps_2017 cvc_wkswk_yr_all_l99 0 nan 52 1 6 0 2 6 0 3 6 4 4 5 12 [5 rows x 42 columns] 2. Run frequencies on one of the columns, and set the index: nls97spss.govt_responsibility__provide_jobs_2006.\ ... value_counts(sort=False) Definitely should be 454 Definitely should not be 300 Probably should be 617 Probably should not be 462 Name: govt_responsibility__provide_jobs_2006, dtype: int64 nls97spss.set_index('pubid__yth_id_code_1997', inplace=True) 3. That demonstrated how to convert data from SPSS. Let’s try that with Stata data. 4. Import the Stata data, apply value labels, and improve the column headings. Use the same methods for the Stata data that we used for the SPSS data: nls97stata, metastata = pyreadstat.read_dta('data/nls97.dta', apply_value_formats=True, formats_as_category=True) nls97stata.columns = metastata.column_labels nls97stata.columns = nls97stata.columns.\ ... str.lower().\ ... str.replace(' ','_').\ ... str.replace('[^a-z0-9_]', '', regex=True) nls97stata.dtypes pubid__yth_id_code_1997 float64 keysex_symbol_1997 category keybdate_my_symbol_1997 float64 keybdate_my_symbol_1997 float64 hrsnight_r_sleeps_2017 float64 cvc_wkswk_yr_all_l99 float64 dtype: object 1. View a few rows of the data and run frequencies: nls97stata.head() pubid__yth_id_code_1997 keysex_symbol_1997 ... \ 0 1 Female ... 1 2 Male ... 2 3 Female ... 3 4 Female ... 4 5 Male ... hrsnight_r_sleeps_2017 cvc_wkswk_yr_all_l99 0 -5 52 1 6 0 2 6 0 3 6 4 4 5 12 [5 rows x 42 columns] nls97stata.govt_responsibility__provide_jobs_2006.\ ... value_counts(sort=False) -5.0 1425 -4.0 5665 -2.0 56 -1.0 5 Definitely should be 454 Definitely should not be 300 Probably should be 617 Probably should not be 462 Name: govt_responsibility__provide_jobs_2006, dtype: int64 2. Fix the logical missing values that show up with the Stata data and set an index. We can use the replace method to set any value that is between –9 and –1 in any column to missing: nls97stata.min(numeric_only=True) pubid__yth_id_code_1997 1 keybdate_my_symbol_1997 1 keybdate_my_symbol_1997 1,980 trans_sat_verbal_hstr -4 trans_sat_math_hstr -4 trans_crd_gpa_overall_hstr -9 trans_crd_gpa_eng_hstr -9 trans_crd_gpa_math_hstr -9 trans_crd_gpa_lp_sci_hstr -9 cv_ba_credits_l1_2011 -5 cv_bio_child_hh_2017 -5 cv_bio_child_nr_2017 -5 hrsnight_r_sleeps_2017 -5 cvc_wkswk_yr_all_l99 -4 dtype: float64 nls97stata.replace(list(range(-9,0)), np.nan, inplace=True) nls97stata.min(numeric_only=True) pubid__yth_id_code_1997 1 keybdate_my_symbol_1997 1 keybdate_my_symbol_1997 1,980 trans_sat_verbal_hstr 14 trans_sat_math_hstr 7 trans_crd_gpa_overall_hstr 10 trans_crd_gpa_eng_hstr 0 trans_crd_gpa_math_hstr 0 trans_crd_gpa_lp_sci_hstr 0 cv_ba_credits_l1_2011 0 cv_bio_child_hh_2017 0 cv_bio_child_nr_2017 0 hrsnight_r_sleeps_2017 0 cvc_wkswk_yr_all_l99 0 dtype: float64 nls97stata.set_index('pubid__yth_id_code_1997', inplace=True) The process is fairly similar when working with SAS data files, as the next few steps illustrate. 1. Retrieve the SAS data, using the SAS catalog file for value labels: The data values for SAS are stored in a catalog file. Setting the catalog file path and filename retrieves the value labels and applies them: nls97sas, metasas = pyreadstat.read_sas7bdat('data/nls97.sas7bdat', catalog_file='data/nlsformats3.sas7bcat', formats_as_category=True) nls97sas.columns = metasas.column_labels nls97sas.columns = nls97sas.columns.\ ... str.lower().\ ... str.replace(' ','_').\ ... str.replace('[^a-z0-9_]', '', regex=True) nls97sas.head() pubid__yth_id_code_1997 keysex_symbol_1997 ... \ 0 1 Female ... 1 2 Male ... 2 3 Female ... 3 4 Female ... 4 5 Male ... hrsnight_r_sleeps_2017 cvc_wkswk_yr_all_l99 0 nan 52 1 6 0 2 6 0 3 6 4 4 5 12 [5 rows x 42 columns] nls97sas.keysex_symbol_1997.value_counts() Male 4599 Female 4385 Name: keysex_symbol_1997, dtype: int64 nls97sas.set_index('pubid__yth_id_code_1997', inplace=True) This demonstrates how to import SPSS, SAS, and Stata data without losing important metadata. HOW IT WORKS... The read_sav, read_dta, and read_sas7bdat methods of Pyreadstat, for SPSS, Stata, and SAS data files, respectively, work in a similar manner. Value labels can be applied when reading in the data by setting apply_value_formats to True for SPSS and Stata files (Steps 5 and 8), or by providing a catalog file path and filename for SAS (Step 12). We can set formats_as_category to True to change the data type to category for those columns where the data values will change. The meta object has the column names and the column labels from the statistical package, so metadata column labels can be assigned to pandas DataFrame column names at any point (nls97spss.columns = metaspss.column_labels). We can even revert to the original column headings after assigning meta column labels to them by setting pandas column names to the metadata column names (nls97spss.columns = metaspss.column_names). In Step 3, we looked at some of the SPSS data before applying value labels. We looked at the dictionary for one variable (metaspss.variable_value_labels['R0536300']), but we could have viewed it for all variables (metaspss.variable_value_labels). When we are satisfied that the labels make sense, we can set them by calling the set_value_labels function. This is a good approach when you do not know the data well and want to inspect the labels before applying them. The column labels from the meta object are often a better choice than the original column headings. Column headings can be quite cryptic, particularly when the SPSS, Stata, or SAS file is based on a large survey, as in this example. However, the labels are not usually ideal for column headings either. They sometimes have spaces, capitalization that is not helpful, and non-alphanumeric characters. We chain some string operations to switch to lowercase, replace spaces with underscores, and remove non-alphanumeric characters. Handling missing values is not always straightforward with these data files, since there are often many reasons why data is missing. If the file is from a survey, the missing value may be because of a survey skip pattern, or a respondent failed to respond, or the response was invalid, and so on. The NLS has nine possible values for missing, from –1 to –9. The SPSS import automatically set those values to NaN, while the Stata import retained the original values. (We could have gotten the SPSS import to retain those values by setting user_missing to True.) For the Stata data, we need to tell it to replace all values from –1 to –9 with NaN. We do this by using the DataFrame’s replace function and passing it a list of integers from –9 to –1 (list(range(-9,0))). THERE’S MORE… You may have noticed similarities between this recipe and the previous one in terms of how value labels are set. The set_value_labels function is like the DataFrame replace operation we used to set value labels in that recipe. We passed a dictionary to replace that mapped columns to value labels. The set_value_labels function in this recipe essentially does the same thing, using the variable_value_labels property of the meta object as the dictionary. Data from statistical packages is often not as well structured as SQL databases tend to be in one significant way. Since they are designed to facilitate analysis, they often violate database normalization rules. There is often an implied relational structure that might have to be unflattened at some point. For example, the data may combine individual and event-level data – a person and hospital visits, a brown bear and the date it emerged from hibernation. Often, this data will need to be reshaped for some aspects of the analysis. SEE ALSO The pyreadstat package is nicely documented at https://github.com/Roche/pyreadstat. The package has many useful options for selecting columns and handling missing data that space did not permit me to demonstrate in this recipe. In Chapter 11, Tidying and Reshaping Data, we will examine how to normalize data that may have been flattened for analytical purposes. IMPORTING R DATA We will use pyreadr to read an R data file into pandas. Since pyreadr cannot capture the metadata, we will write code to reconstruct value labels (analogous to R factors) and column headings. This is similar to what we did in the Importing data from SQL databases recipe. The R statistical package is, in many ways, similar to the combination of Python and pandas, at least in its scope. Both have strong tools across a range of data preparation and data analysis tasks. Some data scientists work with both R and Python, perhaps doing data manipulation in Python and statistical analysis in R, or vice versa, depending on their preferred packages. However, there is currently a scarcity of tools for reading data saved in R, as rds or rdata files, into Python. The analyst often saves the data as a CSV file first and then loads it into Python. We will use pyreadr, from the same author as pyreadstat, because it does not require an installation of R. When we receive an R file, or work with one we have created ourselves, we can count on it being fairly well structured, at least compared to CSV or Excel files. Each column will have only one data type, column headings will have appropriate names for Python variables, and all rows will have the same structure. However, we may need to restore some of the coding logic, as we did when working with SQL data. GETTING READY This recipe assumes you have installed the pyreadr package. If it is not installed, you can install it with pip. From the Terminal, or Powershell (in Windows), enter pip install pyreadr. We will again work with the NLS in this recipe. You will need to download the rds file used in this recipe from the GitHub repository in order to run the code. HOW TO DO IT… We will import data from R without losing important metadata: 1. Load pandas, numpy, pprint, and the pyreadr package: import pandas as pd import numpy as np import pyreadr import pprint 2. Get the R data. Pass the path and filename to the read_r method to retrieve the R data, and load it into memory as a pandas DataFrame. read_r can return one or more objects. When reading an rds file (as opposed to an rdata file), it will return one object, having the key None. We indicate None to get the pandas DataFrame: nls97r = pyreadr.read_r('data/nls97.rds')[None] nls97r.dtypes R0000100 int32 R0536300 int32 ... U2962800 int32 U2962900 int32 U2963000 int32 Z9063900 int32 dtype: object nls97r.head(10) R0000100 R0536300 ... U2963000 Z9063900 0 1 2 ... -5 52 1 2 1 ... 6 0 2 3 2 ... 6 0 3 4 2 ... 6 4 4 5 1 ... 5 12 5 6 2 ... 6 6 6 7 1 ... -5 0 7 8 2 ... -5 39 8 9 1 ... 4 0 9 10 1 ... 6 0 [10 rows x 42 columns] 1. Set up dictionaries for value labels and column headings. Load a dictionary that maps columns to the value labels and create a list of preferred column names as follows: with open('data/nlscodes.txt', 'r') as reader: ... setvalues = eval(reader.read()) ... pprint.pprint(setvalues) {'R0536300': {0.0: 'No Information', 1.0: 'Male', 2.0: 'Female'}, 'R1235800': {0.0: 'Oversample', 1.0: 'Cross-sectional'}, 'S8646900': {1.0: '1. Definitely', 2.0: '2. Probably ', 3.0: '3. Probably not', 4.0: '4. Definitely not'}} ...abbreviated to save space newcols = ['personid','gender','birthmonth', ... 'birthyear','sampletype','category', ... 'satverbal','satmath','gpaoverall', ... 'gpaeng','gpamath','gpascience','govjobs', ... 'govprices','govhealth','goveld','govind', ... 'govunemp','govinc','govcollege', ... 'govhousing','govenvironment','bacredits', ... 'coltype1','coltype2','coltype3','coltype4', ... 'coltype5','coltype6','highestgrade', ... 'maritalstatus','childnumhome','childnumaway', ... 'degreecol1','degreecol2','degreecol3', ... 'degreecol4','wageincome','weeklyhrscomputer', ... 'weeklyhrstv','nightlyhrssleep', ... 'weeksworkedlastyear'] 1. Set value labels and missing values, and change selected columns to the category data type. Use the setvalues dictionary to replace existing values with value labels. Replace all values from –9 to –1 with NaN: nls97r.replace(setvalues, inplace=True) nls97r.head() R0000100 R0536300 ... U2963000 Z9063900 0 1 Female ... -5 52 1 2 Male ... 6 0 2 3 Female ... 6 0 3 4 Female ... 6 4 4 5 Male ... 5 12 [5 rows x 42 columns] nls97r.replace(list(range(-9,0)), np.nan, inplace=True) for col in nls97r[[k for k in setvalues]].columns: ... nls97r[col] = nls97r[col].astype('category') ... nls97r.dtypes R0000100 int64 R0536300 category R0536401 int64 R0536402 int64 R1235800 category ... U2857300 category U2962800 category U2962900 category U2963000 float64 Z9063900 float64 Length: 42, dtype: object 1. Set meaningful column headings: nls97r.columns = newcols nls97r.dtypes personid int64 gender category birthmonth int64 birthyear int64 sampletype category ... wageincome category weeklyhrscomputer category weeklyhrstv category nightlyhrssleep float64 weeksworkedlastyear float64 Length: 42, dtype: object This shows how R data files can be imported into pandas and value labels assigned. HOW IT WORKS… Reading R data into pandas with pyreadr is fairly straightforward. Passing a filename to the read_r function is all that is required. Since read_r can return multiple objects with one call, we need to specify which object. When reading an rds file (as opposed to an rdata file), only one object is returned. It has the key None. In Step 3, we loaded a dictionary that maps our variables to value labels, and a list for our preferred column headings. In Step 4 we applied the value labels. We also changed the data type to category for the columns where we applied the values. We did this by generating a list of the keys in our setvalues dictionary with [k for k in setvalues] and then iterating over those columns. We change the column headings in Step 5 to ones that are more intuitive. Note that the order matters here. We need to set the value labels before changing the column names, since the setvalues dictionary is based on the original column headings. The main advantage of using pyreadr to read R files directly into pandas is that we do not have to convert the R data into a CSV file first. Once we have written our Python code to read the file, we can just rerun it whenever the R data changes. This is particularly helpful when we do not have R on the machine where we work. THERE’S MORE… Pyreadr is able to return multiple DataFrames. This is useful when we save several data objects in R as an rdata file. We can return all of them with one call. Pprint is a handy tool for improving the display of Python dictionaries. We could have used rpy2 instead of pyreadr to import R data. rpy2 requires that R also be installed, but it is more powerful than pyreadr. It will read R factors and automatically set them to pandas DataFrame values. See the following code: import rpy2.robjects as robjects from rpy2.robjects import pandas2ri pandas2ri.activate() readRDS = robjects.r['readRDS'] nls97withvalues = readRDS('data/nls97withvalues.rds') nls97withvalues R0000100 R0536300 ... U2963000 Z9063900 1 1 Female ... -2147483648 52 2 2 Male ... 6 0 3 3 Female ... 6 0 4 4 Female ... 6 4 5 5 Male ... 5 12 ... ... ... ... ... ... 8980 9018 Female ... 4 49 8981 9019 Male ... 6 0 8982 9020 Male ... -2147483648 15 8983 9021 Male ... 7 50 8984 9022 Female ... 7 20 [8984 rows x 42 columns] This generates unusual –2147483648 values. This is what happened when readRDS interpreted missing data in numeric columns. A global replacement of that number with NaN, after confirming that that is not a valid value, would be a good next step. SEE ALSO Clear instructions and examples for pyreadr are available at https://github.com/ofajardo/pyreadr. Feather files, a relatively new format, can be read by both R and Python. I discuss those files in the next recipe. PERSISTING TABULAR DATA We persist data, copy it from memory to local or remote storage, for several reasons: to be able to access the data without having to repeat the steps we used to generate it, to share the data with others, or to make it available for use with different software. In this recipe, we save data that we have loaded into a pandas DataFrame as different file types (CSV, Excel, Pickle, and Feather). Another important, but sometimes overlooked, reason to persist data is to preserve some segment of our data that needs to be examined more closely; perhaps it needs to be scrutinized by others before our analysis can be completed. For analysts who work with operational data in medium- to large-sized organizations, this process is part of the daily data-cleaning workflow. In addition to these reasons for persisting data, our decisions about when and how to serialize data are shaped by several other factors: where we are in terms of our data analysis projects, the hardware and software resources of the machine(s) saving and reloading the data, and the size of our dataset. Analysts end up having to be much more intentional when saving data than they are when pressing Ctrl + S in their word-processing application. Once we persist data, it is stored separately from the logic that we used to create it. I find this to be one of the most important threats to the integrity of our analysis. Often, we end up loading data that we saved some time in the past (a week ago? A month ago? A year ago?) and forget how a variable was defined and how it relates to other variables. If we are in the middle of a data-cleaning task, it is best not to persist our data, so long as our workstation and network can easily handle the burden of regenerating the data. It is a good idea to persist data only once we have reached milestones in our work. Beyond the question of when to persist data, there is the question of how. If we are persisting it for our own reuse with the same software, it is best to save it in a binary format native to that software. That is pretty straightforward for tools such as SPSS, SAS, Stata, and R, but not so much for pandas. But that is good news in a way. We have lots of choices, from CSV and Excel to Pickle and Feather. We save as all these file types in this recipe. Note Pickle and Feather are binary file formats that can be used to store pandas DataFrames. GETTING READY You will need to install Feather if you do not have it on your system. You can do that by entering pip install pyarrow in a Terminal window or powershell (in Windows). If you do not already have a subfolder named Views in your chapter 1 folder, you will need to create it in order to run the code for this recipe. Data note This dataset, taken from the Global Historical Climatology Network integrated database, is made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-monthly. I used the data from version 4. The data in this recipe uses a 100,000-row sample of the full dataset, which is also available in the repository. HOW TO DO IT… We will load a CSV file into pandas and then save it as a Pickle and a Feather file. We will also save subsets of the data to the CSV and Excel formats: 1. Import pandas and pyarrow. pyarrow needs to be imported in order to save pandas to Feather: import pandas as pd import pyarrow 1. Load the land temperatures CSV file into pandas, drop rows with missing data, and set an index: landtemps = \ ... pd.read_csv('data/landtempssample.csv', ... names=['stationid','year','month','avgtemp', ... 'latitude','longitude','elevation', ... 'station','countryid','country'], ... skiprows=1, ... parse_dates=[['month','year']], ... low_memory=False) landtemps.rename(columns={'month_year':'measuredate'}, inplace=True) landtemps.dropna(subset=['avgtemp'], inplace=True) landtemps.dtypes measuredate datetime64[ns] stationid object avgtemp float64 latitude float64 longitude float64 elevation float64 station object countryid object country object dtype: object landtemps.set_index(['measuredate','stationid'], inplace=True) 2. Write extreme values for temperature to CSV and Excel files. Use the quantile method to select outlier rows, which are those at the 1 in 1,000 level at each end of the distribution: extremevals = landtemps[(landtemps.avgtemp < landtemps.avgtemp.quantile(.001)) | (landtemps.avgtemp > landtemps.avgtemp.quantile(.999))] extremevals.shape (171, 7) extremevals.sample(7) avgtemp ... country measuredate stationid ... 2013-08-01 QAM00041170 35.30 ... Qatar 2005-01-01 RSM00024966 -40.09 ... Russia 1973-03-01 CA002401200 -40.26 ... Canada 2007-06-01 KU000405820 37.35 ... Kuwait 1987-07-01 SUM00062700 35.50 ... Sudan 1998-02-01 RSM00025325 -35.71 ... Russia 1968-12-01 RSM00024329 -43.20 ... Russia [7 rows x 7 columns] extremevals.to_excel('views/tempext.xlsx') extremevals.to_csv('views/tempext.csv') 1. Save to Pickle and Feather files. The index needs to be reset in order to save a Feather file: landtemps.to_pickle('data/landtemps.pkl') landtemps.reset_index(inplace=True) landtemps.to_feather("data/landtemps.ftr") 1. Load the Pickle and Feather files we just saved. Note that our index was preserved when saving and loading the Pickle file: landtemps = pd.read_pickle('data/landtemps.pkl') landtemps.head(2).T measuredate 2000-04-01 1940-05-01 stationid USS0010K01S CI000085406 avgtemp 5.27 18.04 latitude 39.90 -18.35 longitude -110.75 -70.33 elevation 2,773.70 58.00 station INDIAN_CANYON ARICA countryid US CI country United States Chile landtemps = pd.read_feather("data/landtemps.ftr") landtemps.head(2).T 0 1 measuredate 2000-04-01 00:00:00 1940-05-01 00:00:00 stationid USS0010K01S CI000085406 avgtemp 5.27 18.04 latitude 39.90 -18.35 longitude -110.75 -70.33 elevation 2,773.70 58.00 station INDIAN_CANYON ARICA countryid US CI country United States Chile The previous steps demonstrated how to serialize pandas DataFrames using two different formats, Pickle and Feather. HOW IT WORKS... Persisting pandas data is quite straightforward. DataFrames have the to_csv, to_excel, to_pickle, and to_feather methods. Pickling preserves our index. THERE’S MORE... The advantage of storing data in CSV files is that saving it uses up very little additional memory. The disadvantage is that writing CSV files is quite slow, and we lose important metadata, such as data types. (read_csv can often figure out the data type when we reload the file, but not always.) Pickle files keep that data but can burden a system that is low on resources when serializing. Feather is easier on resources and can be easily loaded in R as well as Python, but we have to sacrifice our index in order to serialize. Also, the authors of Feather make no promises regarding long-term support. You may have noticed that I do not make a global recommendation about what to use for data serialization – other than to limit your persistence of full datasets to project milestones. This is definitely one of those “right tools for the right job” kind of situations. I use CSV or Excel files when I want to share a segment of a file with colleagues for discussion. I use Feather for ongoing Python projects, particularly when I am using a machine with sub-par RAM and an outdated chip and also using R. When I am wrapping up a project, I pickle the DataFrames. SUMMARY Our Python data projects typically start with raw data stored in a range of formats and exported from a variety of software tools. Among the most popular tabular formats and tools are CSV and Excel files, SQL tables, and SPSS, Stata, SAS, and R datasets. We converted data from all of these sources into a pandas DataFrame in this chapter, and addressed the most common challenges. We also explored approaches to persisting tabular data. We will work with data in other formats in the next chapter. JOIN OUR COMMUNITY ON DISCORD Join our community’s Discord space for discussions with the author and other readers: https://discord.gg/p8uSgEAETX Left arrow icon PAGE 1 OF 9 Right arrow icon Download code icon Download Code KEY BENEFITS * Get to grips with new techniques for data preprocessing and cleaning for machine learning and NLP models * Use new and updated AI tools and techniques for data cleaning tasks * Clean, monitor, and validate large data volumes to diagnose problems using cutting-edge methodologies including Machine learning and AI DESCRIPTION Jumping into data analysis without proper data cleaning will certainly lead to incorrect results. The Python Data Cleaning Cookbook - Second Edition will show you tools and techniques for cleaning and handling data with Python for better outcomes. Fully updated to the latest version of Python and all relevant tools, this book will teach you how to manipulate and clean data to get it into a useful form. he current edition focuses on advanced techniques like machine learning and AI-specific approaches and tools for data cleaning along with the conventional ones. The book also delves into tips and techniques to process and clean data for ML, AI, and NLP models. You will learn how to filter and summarize data to gain insights and better understand what makes sense and what does not, along with discovering how to operate on data to address the issues you've identified. Next, you’ll cover recipes for using supervised learning and Naive Bayes analysis to identify unexpected values and classification errors and generate visualizations for exploratory data analysis (EDA) to identify unexpected values. Finally, you’ll build functions and classes that you can reuse without modification when you have new data. By the end of this Data Cleaning book, you'll know how to clean data and diagnose problems within it. WHAT YOU WILL LEARN * Using OpenAI tools for various data cleaning tasks * Producing summaries of the attributes of datasets, columns, and rows * Anticipating data-cleaning issues when importing tabular data into pandas * Applying validation techniques for imported tabular data * Improving your productivity in pandas by using method chaining * Recognizing and resolving common issues like dates and IDs * Setting up indexes to streamline data issue identification * Using data cleaning to prepare your data for ML and AI models PRODUCT DETAILS Country selected Publication date, Length, Edition, Language, ISBN-13 Publication date : May 31, 2024 Length 486 pages Edition : 2nd Edition Language : English ISBN-13 : 9781803239873 Category : Data Languages : Python Concepts : Data Analysis WHAT DO YOU GET WITH EBOOK? Product feature icon Instant access to your Digital eBook purchase Product feature icon Download this book in EPUB and PDF formats Product feature icon Access this title in our online reader with advanced features Product feature icon DRM FREE - Read whenever, wherever and however you want Buy Now ADD TO CART PRODUCT DETAILS Publication date : May 31, 2024 Length 486 pages Edition : 2nd Edition Language : English ISBN-13 : 9781803239873 Category : Data Languages : Python Concepts : Data Analysis PACKT SUBSCRIPTIONS See our plans and pricing Modal Close icon $19.99 billed monthly Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos Feature tick icon Constantly refreshed with 50+ new titles a month Feature tick icon Exclusive Early access to books as they're written Feature tick icon Solve problems while you work with advanced search and reference features Feature tick icon Offline reading on the mobile app Feature tick icon Simple pricing, no contract START FREE TRIAL BUY NOW $199.99 billed annually Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos Feature tick icon Constantly refreshed with 50+ new titles a month Feature tick icon Exclusive Early access to books as they're written Feature tick icon Solve problems while you work with advanced search and reference features Feature tick icon Offline reading on the mobile app Feature tick icon Choose a DRM-free eBook or Video every month to keep Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each Feature tick icon Exclusive print discounts START FREE TRIAL BUY NOW $279.99 billed in 18 months Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos Feature tick icon Constantly refreshed with 50+ new titles a month Feature tick icon Exclusive Early access to books as they're written Feature tick icon Solve problems while you work with advanced search and reference features Feature tick icon Offline reading on the mobile app Feature tick icon Choose a DRM-free eBook or Video every month to keep Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each Feature tick icon Exclusive print discounts START FREE TRIAL BUY NOW TABLE OF CONTENTS 14 Chapters Preface Chevron down icon Chevron up icon Preface New in the Second Edition Who this book is for What this book covers Get in touch 1. Anticipating Data Cleaning Issues When Importing Tabular Data with pandas Chevron down icon Chevron up icon Anticipating Data Cleaning Issues When Importing Tabular Data with pandas Technical requirements Importing CSV files Importing Excel files Importing data from SQL databases Importing SPSS, Stata, and SAS data Importing R data Persisting tabular data Summary 2. Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data Chevron down icon Chevron up icon Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data Technical requirements Importing simple JSON data Importing more complicated JSON data from an API Importing data from web pages Working with Spark data Persisting JSON data Versioning data Summary 3. Taking the Measure of Your Data Chevron down icon Chevron up icon Taking the Measure of Your Data Technical requirements Getting a first look at your data Selecting and organizing columns Selecting rows Generating frequencies for categorical variables Generating summary statistics for continuous variables Using generative AI to display descriptive statistics Summary 4. Identifying Outliers in Subsets of Data Chevron down icon Chevron up icon Identifying Outliers in Subsets of Data Technical requirements Identifying outliers with one variable Identifying outliers and unexpected values in bivariate relationships Using subsetting to examine logical inconsistencies in variable relationships Using linear regression to identify data points with significant influence Using k-nearest neighbors to find outliers Using Isolation Forest to find anomalies Using PandasAI to identify outliers Summary 5. Using Visualizations for the Identification of Unexpected Values Chevron down icon Chevron up icon Using Visualizations for the Identification of Unexpected Values Technical requirements Using histograms to examine the distribution of continuous variables Using boxplots to identify outliers for continuous variables Using grouped boxplots to uncover unexpected values in a particular group Examining both distribution shape and outliers with violin plots Using scatter plots to view bivariate relationships Using line plots to examine trends in continuous variables Generating a heat map based on a correlation matrix Summary 6. Cleaning and Exploring Data with Series Operations Chevron down icon Chevron up icon Cleaning and Exploring Data with Series Operations Technical requirements Getting values from a pandas Series Showing summary statistics for a pandas Series Changing Series values Changing Series values conditionally Evaluating and cleaning string Series data Working with dates Using OpenAI for Series operations Summary 7. Identifying and Fixing Missing Values Chevron down icon Chevron up icon Identifying and Fixing Missing Values Technical requirements Identifying missing values Cleaning missing values Imputing values with regression Using k-nearest neighbors for imputation Using random forest for imputation Using PandasAI for imputation Summary 8. Encoding, Transforming, and Scaling Features Chevron down icon Chevron up icon Encoding, Transforming, and Scaling Features Technical requirements Creating training datasets and avoiding data leakage Removing redundant or unhelpful features Encoding categorical features: one-hot encoding Encoding categorical features: ordinal encoding Encoding categorical features with medium or high cardinality Using mathematical transformations Feature binning: equal width and equal frequency k-means binning Feature scaling Summary 9. Fixing Messy Data When Aggregating Chevron down icon Chevron up icon Fixing Messy Data When Aggregating Technical requirements Looping through data with itertuples (an anti-pattern) Calculating summaries by group with NumPy arrays Using groupby to organize data by groups Using more complicated aggregation functions with groupby Using user-defined functions and apply with groupby Using groupby to change the unit of analysis of a DataFrame Using pivot_table to change the unit of analysis of a DataFrame Summary 10. Addressing Data Issues When Combining DataFrames Chevron down icon Chevron up icon Addressing Data Issues When Combining DataFrames Technical requirements Combining DataFrames vertically Doing one-to-one merges Doing one-to-one merges by multiple columns Doing one-to-many merges Doing many-to-many merges Developing a merge routine Summary 11. Tidying and Reshaping Data Chevron down icon Chevron up icon Tidying and Reshaping Data Technical requirements Removing duplicated rows Fixing many-to-many relationships Using stack and melt to reshape data from wide to long format Melting multiple groups of columns Using unstack and pivot to reshape data from long to wide format Summary 12. Automate Data Cleaning with User-Defined Functions, Classes, and Pipelines Chevron down icon Chevron up icon Automate Data Cleaning with User-Defined Functions, Classes, and Pipelines Technical requirements Functions for getting a first look at our data Functions for displaying summary statistics and frequencies Functions for identifying outliers and unexpected values Functions for aggregating or combining data Classes that contain the logic for updating Series values Classes that handle non-tabular data structures Functions for checking overall data quality Pre-processing data with pipelines: a simple example Pre-processing data with pipelines: a more complicated example Summary 13. Index Chevron down icon Chevron up icon Index RECOMMENDATIONS FOR YOU 1 of 10 Left arrow icon Mastering NLP from Foundations to LLMs Read more Apr 2024 340 pages ebook eBook * ebook eBook $29.99 * print Print $46.99 $29.99 $42.99 $46.99 $52.99 ADD TO CART Machine Learning with PyTorch and Scikit-Learn Read more Feb 2022 774 pages Full star icon 4.9 ebook eBook * ebook eBook $29.99 * print Print $54.99 $29.99 $43.99 $54.99 ADD TO CART LLM Prompt Engineering for Developers Read more May 2024 251 pages ebook eBook * ebook eBook $13.98 $13.98 $19.99 ADD TO CART Principles of Data Science Read more Jan 2024 326 pages ebook eBook * ebook eBook $27.98 * print Print $49.99 $27.98 $39.99 $49.99 ADD TO CART Microsoft Fabric Complete Guide – The Future of Data with Fabric Read more Dec 2023 9h 2m video Video $89.99 ADD TO CART Microsoft Power BI - The Complete Masterclass [2023 EDITION] Read more Jan 2023 14h 31m video Video $109.99 ADD TO CART The Complete SQL Bootcamp for Aspiring Data Scientists Read more Aug 2023 8h 10m video Video $69.99 ADD TO CART Python for Algorithmic Trading Cookbook Read more Aug 2024 412 pages ebook eBook * ebook eBook $27.98 * print Print $36.99 $27.98 $39.99 $36.99 $49.99 ADD TO CART Python Natural Language Processing Cookbook Read more Sep 2024 312 pages ebook eBook * ebook eBook $24.99 * print Print $44.99 $24.99 $35.99 $44.99 ADD TO CART Machine Learning and Generative AI for Marketing Read more Aug 2024 482 pages ebook eBook * ebook eBook $27.98 * print Print $49.99 $27.98 $39.99 $49.99 ADD TO CART Right arrow icon CUSTOMER REVIEWS Rating distribution Full star icon Full star icon Full star icon Full star icon Full star icon 5 (2 Ratings) 5 star 100% 4 star 0% 3 star 0% 2 star 0% 1 star 0% Bryan Edwards Jul 29, 2024 Full star icon Full star icon Full star icon Full star icon Full star icon 5 Great book - the author does a great job explaining the various concepts, and the examples are very helpful Feefo Verified review N/A Jul 27, 2024 Full star icon Full star icon Full star icon Full star icon Full star icon 5 Feefo Verified review ABOUT THE AUTHOR Profile icon Michael Walker Michael Walker has worked as a data analyst for over 30 years at a variety of educational institutions. He is currently the CIO at College Unbound in Providence, Rhode Island, in the United States. He has also taught data science, research methods, statistics, and computer programming to undergraduates since 2006. Read more See other products by Michael Walker Get free access to Packt library with over 7500+ books and video courses for 7 days! Start Free Trial FAQS How do I buy and download an eBook? Chevron down icon Chevron up icon Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time. If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it. Please Note: Packt eBooks are non-returnable and non-refundable. Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says: * You may make copies of your eBook for your own use onto any machine * You may not pass copies of the eBook on to anyone else How can I make a purchase on your website? Chevron down icon Chevron up icon If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps: 1. Register on our website using your email address and the password. 2. Search for the title by name or ISBN using the search option. 3. Select the title you want to purchase. 4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal) Where can I access support around an eBook? Chevron down icon Chevron up icon * If you experience a problem with using or installing Adobe Reader, the contact Adobe directly. * To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have. * To view your account details or to download a new copy of the book go to www.packtpub.com/account * To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us What eBook formats do Packt support? Chevron down icon Chevron up icon Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security. You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks. What are the benefits of eBooks? Chevron down icon Chevron up icon * You can get the information you need immediately * You can easily take them with you on a laptop * You can download them an unlimited number of times * You can print them out * They are copy-paste enabled * They are searchable * There is no password protection * They are lower price than print * They save resources and space What is an eBook? Chevron down icon Chevron up icon Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions. When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it. For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9. Arrow left icon Machine Learning Data Analysis Data Science Business Intelligence Data Visualization Artificial Intelligence Deep Learning Database Administration Data Processing Databases Arrow right icon Legal Terms and Conditions Privacy Policy Cookie Policy Shipping Policy Cancellation Policy Return Policy Support Help Contact Us Business Partnerships Sponsored eBooks Custom eBooks Careers Become an author Packt+ Membership Subscription DataPro SecPro TechLeaders United States Company Address: Packt Publishing Ltd, Grosvenor House, 11 St Paul's Square, Birmingham, B3 1RB © 2024 Packt Publishing Limited All Rights Reserved We use some essential cookies to make this service work. We’d also like to use analytics cookies so we can understand how you use the service and make improvements. ACCEPT ALL COOKIES ACCEPT ESSENTIAL COOKIES REJECT ALL COOKIES -------------------------------------------------------------------------------- Close icon Signed in users are eligible for personalised offers and content recommendations. Country selected Sign in with Packt Gmail Sign in with Google Github Sign in with Github -------------------------------------------------------------------------------- You are browsing a version of our website which may not be the most relevant option for you. We suggest changing to the following version. Country selected OK Germany