www.packtpub.com Open in urlscan Pro
2606:4700:10::6816:88b Public Scan

Back to summary

Submitted URL:
https://packt.omeclk.com/portal/wts/ug^cnN2co6ecv64gygEczea8eE8zmq4rhkrt7yCrCka
Effective URL:
https://www.packtpub.com/en-us/product/python-data-cleaning-cookbook-9781803239873?utm_source=ProgrammingPro+Newsletter&u...
Submission: On September 17 via api (September 17th 2024, 8:50:54 pm UTC) from SA — Scanned from DE

Form analysis
3 forms found in the DOM

https://www.packtpub.com/en-us/search

<form class="search desktop" data-search="https://www.packtpub.com/api/rebuild/header/search" data-method="POST" action="https://www.packtpub.com/en-us/search">
  <svg role="presentation" class="icon icon-5 search-box-icon">
    <use href="https://www.packtpub.com/rebuild/build/assets/common-CwvaZMrJ.svg#search"></use>
    <desc>Search icon</desc>
  </svg>
  <input type="text" name="query" class="search-input" placeholder="Search..." id="search" autocomplete="off">
  <span class="loader d-none"></span>
  <svg role="presentation" class="search-close d-none">
    <use href="https://www.packtpub.com/rebuild/build/assets/common-CwvaZMrJ.svg#close"></use>
    <desc>Close icon</desc>
  </svg>
  <div class="search-results d-none scrollbar" id="results"></div>
</form>

GET https://www.packtpub.com/checkout/add/ebook/9781803239873

<form action="https://www.packtpub.com/checkout/add/ebook/9781803239873" method="get">
  <button id="product-buy-now" type="submit" class="rebuild-btn rebuild-btn-primary" data-analytics-type="add_to_cart" data-analytics-currency="USD" data-analytics-item-id="US-9781803246291-EBOOK"
    data-analytics-item-title="Python Data Cleaning Cookbook - Second Edition" data-analytics-item-language="" data-analytics-item-framework="" data-analytics-item-concept="" data-analytics-item-publication-year="2024"
    data-analytics-item-quantity="1" data-analytics-item-index="0" data-analytics-item-format="ebook" data-analytics-item-price="27.98" data-analytics-item-discount="12.01"> Buy Now </button>
</form>

GET https://www.packtpub.com/checkout/add/ebook/9781803239873

<form action="https://www.packtpub.com/checkout/add/ebook/9781803239873" method="get">
  <button id="product-buy-now" type="submit" class="rebuild-btn rebuild-btn-primary" data-analytics-type="add_to_cart" data-analytics-currency="USD" data-analytics-item-id="US-9781803246291-EBOOK"
    data-analytics-item-title="Python Data Cleaning Cookbook - Second Edition" data-analytics-item-language="" data-analytics-item-framework="" data-analytics-item-concept="" data-analytics-item-publication-year="2024"
    data-analytics-item-quantity="1" data-analytics-item-index="0" data-analytics-item-format="ebook" data-analytics-item-price="27.98" data-analytics-item-discount="12.01"> Buy Now </button>
</form>

Text Content

Search icon Close icon

Search icon CANCEL
Subscription
0
Cart icon
Cart
Close icon
You have no products in your basket yet
Save more on your purchases!
Buy 2 products and save 10%
Buy 3 products and save 15%
Buy 5 products and save 20%
Savings automatically calculated. No voucher code required
Profile icon
Account
Close icon
Sign in New User? Create Account

Your Subscription Your Owned Titles Your Account Your Orders

Country Selection:


CHANGE COUNTRY

Modal Close icon
Country selected Country selected
United States
Country selected
United Kingdom
Country selected
India
Country selected
Germany
Country selected
France
Country selected
Canada
Country selected
Russia
Country selected
Spain
Country selected
Brazil
Country selected
Australia
Country selected

--------------------------------------------------------------------------------

Argentina
Country selected
Austria
Country selected
Belgium
Country selected
Bulgaria
Country selected
Chile
Country selected
Colombia
Country selected
Cyprus
Country selected
Czechia
Country selected
Denmark
Country selected
Ecuador
Country selected
Egypt
Country selected
Estonia
Country selected
Finland
Country selected
Greece
Country selected
Hungary
Country selected
Indonesia
Country selected
Ireland
Country selected
Italy
Country selected
Japan
Country selected
Latvia
Country selected
Lithuania
Country selected
Luxembourg
Country selected
Malaysia
Country selected
Malta
Country selected
Mexico
Country selected
Netherlands
Country selected
New Zealand
Country selected
Norway
Country selected
Philippines
Country selected
Poland
Country selected
Portugal
Country selected
Romania
Country selected
Singapore
Country selected
Slovakia
Country selected
Slovenia
Country selected
South Africa
Country selected
South Korea
Country selected
Sweden
Country selected
Switzerland
Country selected
Taiwan
Country selected
Thailand
Country selected
Turkey
Country selected
Ukraine
Country selected
Arrow left icon
All Products


Best Sellers


New Releases


Books


Videos


Audiobooks


Learning Hub
Newsletters
Free Learning


Arrow right icon


Home > Data > Data Analysis > Python Data Cleaning Cookbook - Second Edition


PYTHON DATA CLEANING COOKBOOK: PREPARE YOUR DATA FOR ANALYSIS WITH PANDAS,
NUMPY, MATPLOTLIB, SCIKIT-LEARN, AND OPENAI, SECOND EDITION

Profile Icon Michael Walker
By Michael Walker
$27.98 $39.99
Full star icon Full star icon Full star icon Full star icon Full star icon 5 (2
Ratings)
Book May 2024 486 pages 2nd Edition
eBook
$27.98 $39.99
Print
$39.98 $49.99
Subscription
Free Trial
Renews at $19.99p/m
Profile Icon Michael Walker
By Michael Walker
$27.98 $39.99
Full star icon Full star icon Full star icon Full star icon Full star icon 5 (2
Ratings)
Book May 2024 486 pages 2nd Edition
eBook
$27.98 $39.99
Print
$39.98 $49.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$27.98 $39.99
Print
$39.98 $49.99
Subscription
Free Trial
Renews at $19.99p/m


WHAT DO YOU GET WITH EBOOK?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced
features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now
ADD TO CART
Table of content icon View table of contents Preview book icon Preview Book


PYTHON DATA CLEANING COOKBOOK - SECOND EDITION


ANTICIPATING DATA CLEANING ISSUES WHEN IMPORTING TABULAR DATA WITH PANDAS

Scientific distributions of Python (Anaconda, WinPython, Canopy, and so on)
provide analysts with an impressive range of data manipulation, exploration, and
visualization tools. One important tool is pandas. Developed by Wes McKinney in
2008, but really gaining in popularity after 2012, pandas is now an essential
library for data analysis in Python. The recipes in this book demonstrate how
many common data preparation tasks can be done more easily with pandas than with
other tools. While we work with pandas extensively in this book, we also use
other popular packages such as Numpy, matplotlib, and scipy.

A key pandas object is the DataFrame, which represents data as a tabular
structure, with rows and columns. In this way, it is similar to the other data
stores we discuss in this chapter. However, a pandas DataFrame also has indexing
functionality that makes selecting, combining, and transforming data relatively
straightforward, as the recipes in this book will demonstrate.

Before we can make use of this great functionality, we have to import our data
into pandas. Data comes to us in a wide variety of formats: as CSV or Excel
files, as tables from SQL databases, from statistical analysis packages such as
SPSS, Stata, SAS, or R, from non-tabular sources such as JSON, and from web
pages.

We examine tools to import tabular data in this recipe. Specifically, we cover
the following topics:

 * Importing CSV files
 * Importing Excel files
 * Importing data from SQL databases
 * Importing SPSS, Stata, and SAS data
 * Importing R data
 * Persisting tabular data


TECHNICAL REQUIREMENTS

The code and notebooks for this chapter are available on GitHub at
https://github.com/michaelbwalker/Python-Data-Cleaning-Cookbook-Second-Edition.
You can use any IDE (Integrated Development Environment) of your choice – IDLE,
Visual Studio, Sublime, Spyder, and so on – or Jupyter Notebook to work with any
of the code in this chapter, or any chapter in this book. A good guide to get
started with Jupyter Notebook can be found here:
https://www.dataquest.io/blog/jupyter-notebook-tutorial/. I used the Spyder IDE
to write the code in this chapter.

I used pandas 2.2.1 and NumPy version 1.24.3 for all of the code in this chapter
and subsequent chapters. I have also tested all code with pandas 1.5.3.


IMPORTING CSV FILES

The read_csv method of the pandas library can be used to read a file with comma
separated values (CSV) and load it into memory as a pandas DataFrame. In this
recipe, we import a CSV file and address some common issues: creating column
names that make sense to us, parsing dates, and dropping rows with critical
missing data.

Raw data is often stored as CSV files. These files have a carriage return at the
end of each line of data to demarcate a row, and a comma between each data value
to delineate columns. Something other than a comma can be used as the delimiter,
such as a tab. Quotation marks may be placed around values, which can be helpful
when the delimiter occurs naturally within certain values, which sometimes
happens with commas.

All data in a CSV file are characters, regardless of the logical data type. This
is why it is easy to view a CSV file, presuming it is not too large, in a text
editor. The pandas read_csv method will make an educated guess about the data
type of each column, but you will need to help it along to ensure that these
guesses are on the mark.


GETTING READY

Create a folder for this chapter, and then create a new Python script or Jupyter
Notebook file in that folder. Create a data subfolder, and then place the
landtempssample.csv file in that subfolder. Alternatively, you could retrieve
all of the files from the GitHub repository, including the data files. Here is a
screenshot of the beginning of the CSV file:

Figure 1.1: Land Temperatures Data

Data note

This dataset, taken from the Global Historical Climatology Network integrated
database, is made available for public use by the United States National Oceanic
and Atmospheric Administration at
https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-monthly.
I used the data from version 4. The data in this recipe uses a 100,000-row
sample of the full dataset, which is also available in the repository.


HOW TO DO IT…

We will import a CSV file into pandas, taking advantage of some very useful
read_csv options:

 1. Import the pandas library, and set up the environment to make viewing the
    output easier:
    
    import pandas as pd
    pd.options.display.float_format = '{:,.2f}'.format
    pd.set_option('display.width', 85)
    pd.set_option('display.max_columns', 8)
    

 2. Read the data file, set new names for the headings, and parse the date
    column.

Pass an argument of 1 to the skiprows parameter to skip the first row, pass a
list of columns to parse_dates to create a pandas datetime column from those
columns, and set low_memory to False. This will cause pandas to load all of the
data into memory at once, rather than in chunks. We do this so that pandas can
identify the data type of each column automatically. In the There’s more…
section, we see how to set the data type for each column manually:

landtemps = pd.read_csv('data/landtempssample.csv',
...     names=['stationid','year','month','avgtemp','latitude',
...       'longitude','elevation','station','countryid','country'],
...     skiprows=1,
...     parse_dates=[['month','year']],
...     low_memory=False)
type(landtemps)


<class 'pandas.core.frame.DataFrame'>


Note

We have to use skiprows because we are passing a list of column names to
read_csv. If we use the column names in the CSV file, we do not need to specify
values for either names or skiprows.

 1. Get a quick glimpse of the data.

View the first few rows. Show the data type for all columns, as well as the
number of rows and columns:

landtemps.head(7)


  month_year    stationid  ...  countryid              country
0 2000-04-01  USS0010K01S  ...         US        United States
1 1940-05-01  CI000085406  ...         CI                Chile
2 2013-12-01  USC00036376  ...         US        United States
3 1963-02-01  ASN00024002  ...         AS            Australia
4 2001-11-01  ASN00028007  ...         AS            Australia
5 1991-04-01  USW00024151  ...         US        United States
6 1993-12-01  RSM00022641  ...         RS               Russia
[7 rows x 9 columns]


landtemps.dtypes


month_year       datetime64[ns]
stationed        object
avgtemp          float64
latitude         float64
longitude        float64
elevation        float64
station          object
countryid        object
country          object
dtype: object


landtemps.shape


(100000, 9)


 1. Give the date column a more appropriate name and view the summary statistics
    for average monthly temperature:
    
    landtemps.rename(columns={'month_year':'measuredate'}, inplace=True)
    landtemps.dtypes
    
    
    measuredate      datetime64[ns]
    stationid        object
    avgtemp          float64
    latitude         float64
    longitude        float64
    elevation        float64
    station          object
    countryid        object
    country          object
    dtype:           object
    
    
    landtemps.avgtemp.describe()
    
    
    count   85,554.00
    mean    10.92
    std     11.52
    min     -70.70
    25%     3.46
    50%     12.22
    75%     19.57
    max     39.95
    Name: avgtemp, dtype: float64
    

 2. Look for missing values for each column.

Use isnull, which returns True for each value that is missing for each column,
and False when not missing. Chain this with sum to count the missing values for
each column. (When working with Boolean values, sum treats True as 1 and False
as 0. I will discuss method chaining in the There’s more... section of this
recipe):

landtemps.isnull().sum()


measuredate    0
stationed      0
avgtemp        14446
latitude       0
longitude      0
elevation      0
station        0
countryid      0
country        5
dtype: int64


 1. Remove rows with missing data for avgtemp.

Use the subset parameter to tell dropna to drop rows when avgtemp is missing.
Set inplace to True. Leaving inplace at its default value of False would display
the DataFrame, but the changes we have made would not be retained. Use the shape
attribute of the DataFrame to get the number of rows and columns:

landtemps.dropna(subset=['avgtemp'], inplace=True)
landtemps.shape


(85554, 9)


That’s it! Importing CSV files into pandas is as simple as that.


HOW IT WORKS...

Almost all of the recipes in this book use the pandas library. We refer to it as
pd to make it easier to reference later. This is customary. We also use
float_format to display float values in a readable way and set_option to make
the Terminal output wide enough to accommodate the number of variables.

Much of the work is done by the first line in Step 2. We use read_csv to load a
pandas DataFrame in memory and call it landtemps. In addition to passing a
filename, we set the names parameter to a list of our preferred column headings.
We also tell read_csv to skip the first row, by setting skiprows to 1, since the
original column headings are in the first row of the CSV file. If we do not tell
it to skip the first row, read_csv will treat the header row in the file as
actual data.

read_csv also solves a date conversion issue for us. We use the parse_dates
parameter to ask it to convert the month and year columns to a date value.

Step 3 runs through a few standard data checks. We use head(7) to print out all
columns for the first seven rows. We use the dtypes attribute of the DataFrame
to show the data type of all columns. Each column has the expected data type. In
pandas, character data has the object data type, a data type that allows for
mixed values. shape returns a tuple, whose first element is the number of rows
in the DataFrame (100,000 in this case) and whose second element is the number
of columns (9).

When we used read_csv to parse the month and year columns, it gave the resulting
column the name month_year. We used the rename method in Step 4 to give that
column a more appropriate name. We need to specify inplace=True to replace the
old column name with the new column name in memory. The describe method provides
summary statistics on the avgtemp column.

Notice that the count for avgtemp indicates that there are 85,554 rows that have
valid values for avgtemp. This is out of 100,000 rows for the whole DataFrame,
as provided by the shape attribute. The listing of missing values for each
column in Step 5 (landtemps.isnull().sum()) confirms this: 100,000 – 85,554 =
14,446.

Step 6 drops all rows where avgtemp is NaN. (The NaN value, not a number, is the
pandas representation of missing values.) subset is used to indicate which
column to check for missing values. The shape attribute for landtemps now
indicates that there are 85,554 rows, which is what we would expect, given the
previous count from describe.


THERE’S MORE...

If the file you are reading uses a delimiter other than a comma, such as a tab,
this can be specified in the sep parameter of read_csv. When creating the pandas
DataFrame, an index was also created. The numbers to the far left of the output
when head was run are index values. Any number of rows can be specified for
head. The default value is 5.

Instead of setting low_memory to False, to get pandas to make good guesses
regarding data types, we could have set data types manually:

landtemps = pd.read_csv('data/landtempssample.csv',
    names=['stationid','year','month','avgtemp','latitude',
      'longitude','elevation','station','countryid','country'],
    skiprows=1,
    parse_dates=[['month','year']],
    dtype={'stationid':'object', 'avgtemp':'float64',
     'latitude':'float64','longitude':'float64',
     'elevation':'float64','station':'object',
     'countryid':'object','country':'object'},
    )
landtemps.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column      Non-Null Count   Dtype       
---  ------      --------------   -----       
 0   month_year  100000 non-null  datetime64[ns]
 1   stationid   100000 non-null  object      
 2   avgtemp     85554 non-null   float64     
 3   latitude    100000 non-null  float64     
 4   longitude   100000 non-null  float64     
 5   elevation   100000 non-null  float64     
 6   station     100000 non-null  object      
 7   countryid   100000 non-null  object      
 8   country     99995 non-null   object      
dtypes: datetime64[ns](1), float64(4), object(4)
memory usage: 6.9+ MB


The landtemps.isnull().sum() statement is an example of chaining methods. First,
isnull returns a DataFrame of True and False values, resulting from testing
whether each column value is null. The sum function takes that DataFrame and
sums the True values for each column, interpreting the True values as 1 and the
False values as 0. We would have obtained the same result if we had used the
following two steps:

checknull = landtemps.isnull()
checknull.sum()


There is no hard and fast rule for when to chain methods and when not to do so.
I find chaining helpful when the overall operation feels like a single step,
even if it’s two or more steps mechanically. Chaining also has the side benefit
of not creating extra objects that I might not need.

The dataset used in this recipe is just a sample from the full land temperatures
database, with almost 17 million records. You can run the larger file if your
machine can handle it, with the following code:

landtemps = pd.read_csv('data/landtemps.zip',
...   compression='zip', names=['stationid','year',
...     'month','avgtemp','latitude','longitude',
...     'elevation','station','countryid','country'],
...     skiprows=1,
...     parse_dates=[['month','year']],
...     low_memory=False)


read_csv can read a compressed ZIP file. We get it to do this by passing the
name of the ZIP file and the type of compression.


SEE ALSO

Subsequent recipes in this chapter, and in other chapters, set indexes to
improve navigation over rows and merging.

A significant amount of reshaping of the Global Historical Climatology Network
raw data was done before using it in this recipe. We demonstrate this in Chapter
11, Tidying and Reshaping Data.


IMPORTING EXCEL FILES

The read_excel method of the pandas library can be used to import data from an
Excel file and load it into memory as a pandas DataFrame. In this recipe, we
import an Excel file and handle some common issues when working with Excel
files: extraneous header and footer information, selecting specific columns,
removing rows with no data, and connecting to particular sheets.

Despite the tabular structure of Excel, which invites the organization of data
into rows and columns, spreadsheets are not datasets and do not require people
to store data in that way. Even when some data conforms with those expectations,
there is often additional information in rows or columns before or after the
data to be imported. Data types are not always as clear as they are to the
person who created the spreadsheet. This will be all too familiar to anyone who
has ever battled with importing leading zeros. Moreover, Excel does not insist
that all data in a column be of the same type, or that column headings be
appropriate for use with a programming language such as Python.

Fortunately, read_excel has a number of options for handling messiness in Excel
data. These options make it relatively easy to skip rows, select particular
columns, and pull data from a particular sheet or sheets.


GETTING READY

You can download the GDPpercapita22b.xlsx file, as well as the code for this
recipe, from the GitHub repository for this book. The code assumes that the
Excel file is in a data subfolder. Here is a view of the beginning of the file
(some columns were hidden for display purposes):

Figure 1.2: View of the dataset

And here is a view of the end of the file:

Figure 1.3: View of the dataset

Data note

This dataset, from the Organisation for Economic Co-operation and Development,
is available for public use at https://stats.oecd.org/.


HOW TO DO IT…

We import an Excel file into pandas and do some initial data cleaning:

 1. Import the pandas library:
    
    import pandas as pd
    

 2. Read the Excel per capita GDP data.

Select the sheet with the data we need, but skip the columns and rows that we do
not want. Use the sheet_name parameter to specify the sheet. Set skiprows to 4
and skipfooter to 1 to skip the first four rows (the first row is hidden) and
the last row. We provide values for usecols to get data from column A and
columns C through W (column B is blank). Use head to view the first few rows and
shape to get the number of rows and columns:

percapitaGDP = pd.read_excel("data/GDPpercapita22b.xlsx",
...    sheet_name="OECD.Stat export",
...    skiprows=4,
...    skipfooter=1,
...    usecols="A,C:W")
percapitaGDP.head()


                                  Year	2000	...	2019	2020
0	Metropolitan areas	       ...	NaN 	...	NaN	NaN
1	AUS: Australia	..	       ...	...  	...	...	...
2	AUS01: Greater Sydney	       ...	... 	... 	45576	45152
3	AUS02: Greater Melbourne     ...	... 	... 	42299	40848
4	AUS03: Greater Brisbane      ...	... 	... 	42145	40741
[5 rows x 22 columns]


percapitaGDP.shape


(731, 22)


Note

You may encounter a problem with read_excel if the Excel file does not use utf-8
encoding. One way to resolve this is to save the Excel file as a CSV file,
reopen it, and then save it with utf-8 encoding.

 1. Use the info method of the DataFrame to view data types and the non-null
    count. Notice that all columns have the object data type:
    
    percapitaGDP.info()
    
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 731 entries, 0 to 730
    Data columns (total 22 columns):
     #   Column  Non-Null Count  Dtype
    ---  ------  --------------  -----
     0   Year    731 non-null    object
     1   2000    730 non-null    object
     2   2001    730 non-null    object
     3   2002    730 non-null    object
     4   2003    730 non-null    object
     5   2004    730 non-null    object
     6   2005    730 non-null    object
     7   2006    730 non-null    object
     8   2007    730 non-null    object
     9   2008    730 non-null    object
     10  2009    730 non-null    object
     11  2010    730 non-null    object
     12  2011    730 non-null    object
     13  2012    730 non-null    object
     14  2013    730 non-null    object
     15  2014    730 non-null    object
     16  2015    730 non-null    object
     17  2016    730 non-null    object
     18  2017    730 non-null    object
     19  2018    730 non-null    object
     20  2019    730 non-null    object
     21  2020    730 non-null    object
    dtypes: object(22)
    memory usage: 125.8+ KB
    

 2. Rename the Year column to metro, and remove the leading spaces.

Give an appropriate name to the metropolitan area column. There are extra spaces
before the metro values in some cases. We can test for leading spaces with
startswith(' ') and then use any to establish whether there are one or more
occasions when the first character is blank. We can use endswith(' ') to examine
trailing spaces. We use strip to remove both leading and trailing spaces. When
we test for trailing spaces again, we see that there are none:

percapitaGDP.rename(columns={'Year':'metro'}, inplace=True)
percapitaGDP.metro.str.startswith(' ').any()


True


percapitaGDP.metro.str.endswith(' ').any()


False


percapitaGDP.metro = percapitaGDP.metro.str.strip()
percapitaGDP.metro.str.startswith(' ').any()


False


 1. Convert the data columns to numeric.

Iterate over all of the GDP year columns (2000–2020) and convert the data type
from object to float. Coerce the conversion even when there is character data –
the .. in this example. We want character values in those columns to become
missing, which is what happens. Rename the year columns to better reflect the
data in those columns:

for col in percapitaGDP.columns[1:]:
...   percapitaGDP[col] = pd.to_numeric(percapitaGDP[col],
...     errors='coerce')
...   percapitaGDP.rename(columns={col:'pcGDP'+col},
...     inplace=True)
...
percapitaGDP.head()


                      metro  pcGDP2000  pcGDP2001  ...  \
0        Metropolitan areas        NaN        NaN  ... 
1            AUS: Australia        NaN        NaN  ... 
2     AUS01: Greater Sydney        NaN      41091  ... 
3  AUS02: Greater Melbourne        NaN      40488  ... 
4   AUS03: Greater Brisbane        NaN      35276  ... 
   pcGDP2018  pcGDP2019  pcGDP2020
0        NaN        NaN        NaN
1        NaN        NaN        NaN
2      47171      45576      45152
3      43237      42299      40848
4      44328      42145      40741
[5 rows x 22 columns]


percapitaGDP.dtypes


metro          object
pcGDP2000      float64
pcGDP2001      float64
abbreviated to save space
pcGDP2019      float64
pcGDP2020      float64
dtype: object


 1. Use the describe method to generate summary statistics for all numeric data
    in the DataFrame:
    
    percapitaGDP.describe()
    
    
           pcGDP2000  pcGDP2001  pcGDP2002  ...  pcGDP2018  \
    count        158        450        479  ...        692 
    mean       33961      38874      39621  ...      41667 
    std        15155      13194      13061  ...      17440 
    min         2686       7805       7065  ...       5530 
    25%        21523      30790      31064  ...      31322 
    50%        35836      38078      39246  ...      41428 
    75%        42804      46576      47874  ...      51130 
    max        95221      96941      98929  ...     147760 
           pcGDP2019  pcGDP2020
    count        596        425
    mean       42709      39792
    std        18893      19230
    min         5698       5508
    25%        29760      24142
    50%        43505      41047
    75%        53647      51130
    max       146094     131082
    [8 rows x 21 columns]
    

 2. Remove rows where all of the per capita GDP values are missing.

Use the subset parameter of dropna to inspect all columns, starting with the
second column (it is zero-based) and going through to the last column. Use how
to specify that we want to drop rows only if all of the columns specified in
subset are missing. Use shape to show the number of rows and columns in the
resulting DataFrame:

percapitaGDP.dropna(subset=percapitaGDP.columns[1:], how="all", inplace=True)
percapitaGDP.shape


(692, 22)


 1. Set the index for the DataFrame using the metropolitan area column.

Confirm that there are 692 valid values for metro and that there are 692 unique
values, before setting the index:

percapitaGDP.metro.count()


692


percapitaGDP.metro.nunique()


692


percapitaGDP.set_index('metro', inplace=True)
percapitaGDP.head()


                          pcGDP2000  pcGDP2001  ...  \
metro                                           ... 
AUS01: Greater Sydney           NaN      41091  ... 
AUS02: Greater Melbourne        NaN      40488  ... 
AUS03: Greater Brisbane         NaN      35276  ... 
AUS04: Greater Perth            NaN      43355  ... 
AUS05: Greater Adelaide         NaN      36081  ... 
                          pcGDP2019  pcGDP2020
metro                                         
AUS01: Greater Sydney         45576      45152
AUS02: Greater Melbourne      42299      40848
AUS03: Greater Brisbane       42145      40741
AUS04: Greater Perth          70970      78489
AUS05: Greater Adelaide       38314      39181
[5 rows x 21 columns]
percapitaGDP.loc['AUS02: Greater Melbourne']
pcGDP2000     NaN
pcGDP2001   40488
...
pcGDP2019   42299
pcGDP2020   40848
Name: AUS02: Greater Melbourne, dtype: float64


We have now imported the Excel data into a pandas DataFrame and cleaned up some
of the messiness in the spreadsheet.


HOW IT WORKS…

We mostly manage to get the data we want in Step 2 by skipping rows and columns
we do not want, but there are still a number of issues – read_excel interprets
all of the GDP data as character data, many rows are loaded with no useful data,
and the column names do not represent the data well. In addition, the
metropolitan area column might be useful as an index, but there are leading and
trailing blanks, and there may be missing or duplicated values.

read_excel interprets Year as the column name for the metropolitan area data
because it looks for a header above the data for that Excel column and finds
Year there. We rename that column metro in Step 4. We also use strip to fix the
problem with leading and trailing blanks. We could have just used lstrip to
remove leading blanks, or rstrip if there had been trailing blanks. It is a good
idea to assume that there might be leading or trailing blanks in any character
data, cleaning that data shortly after the initial import.

The spreadsheet authors used .. to represent missing data. Since this is
actually valid character data, those columns get the object data type (that is
how pandas treats columns with character or mixed data). We coerce a conversion
to numeric type in Step 5. This also results in the original values of .. being
replaced with NaN (not a number), how pandas represents missing values for
numbers. This is what we want.

We can fix all of the per capita GDP columns with just a few lines because
pandas makes it easy to iterate over the columns of a DataFrame. By specifying
[1:], we iterate from the second column to the last column. We can then change
those columns to numeric and rename them to something more appropriate.

There are several reasons why it is a good idea to clean up the column headings
for the annual GDP columns – it helps us to remember what the data actually is;
if we merge it with other data by metropolitan area, we will not have to worry
about conflicting variable names; and we can use attribute access to work with
pandas Series based on those columns, which I will discuss in more detail in the
There’s more… section of this recipe.

describe in Step 6 shows us that fewer than 500 rows have valid data for per
capita GDP for some years. When we drop all rows that have missing values for
all per capita GDP columns in step 7, we end up with 692 rows in the DataFrame.


THERE’S MORE…

Once we have a pandas DataFrame, we have the ability to treat columns as more
than just columns. We can use attribute access (such as percapitaGPA.metro) or
bracket notation (percapitaGPA['metro']) to get the functionality of a pandas
Series. Either method makes it possible to use Series string inspecting methods,
such as str.startswith, and counting methods, such as nunique. Note that the
original column names of 20## did not allow attribute access because they
started with a number, so percapitaGDP.pcGDP2001.count() works, but
percapitaGDP.2001.count() returns a syntax error because 2001 is not a valid
Python identifier (since it starts with a number).

pandas is rich with features for string manipulation and for Series operations.
We will try many of them out in subsequent recipes. This recipe showed those
that I find most useful when importing Excel data.


SEE ALSO

There are good reasons to consider reshaping this data. Instead of 21 columns of
GDP per capita data for each metropolitan area, we should have 21 rows of data
for each metropolitan area, with columns for year and GDP per capita. Recipes
for reshaping data can be found in Chapter 11, Tidying and Reshaping Data.


IMPORTING DATA FROM SQL DATABASES

In this recipe, we will use pymssql and mysql apis to read data from Microsoft
SQL Server and MySQL (now owned by Oracle) databases, respectively. Data from
sources such as these tends to be well structured, since it is designed to
facilitate simultaneous transactions by members of organizations and those who
interact with them. Each transaction is also likely related to some other
organizational transaction.

This means that although data tables from enterprise systems such as these are
more reliably structured than data from CSV files and Excel files, their logic
is less likely to be self-contained. You need to know how the data from one
table relates to data from another table to understand its full meaning. These
relationships need to be preserved, including the integrity of primary and
foreign keys, when pulling data. Moreover, well-structured data tables are not
necessarily uncomplicated data tables. There are often sophisticated coding
schemes that determine data values, and these coding schemes can change over
time. For example, codes for merchandise at a retail store chain might be
different in 1998 than they are in 2024. Similarly, frequently there are codes
for missing values, such as 99,999, that pandas will understand as valid values.

Since much of this logic is business logic, and implemented in stored procedures
or other applications, it is lost when pulled out of this larger system. Some of
what is lost will eventually have to be reconstructed when preparing data for
analysis. This almost always involves combining data from multiple tables, so it
is important to preserve the ability to do that. However, it also may involve
adding some of the coding logic back after loading the SQL table into a pandas
DataFrame. We explore how to do that in this recipe.


GETTING READY

This recipe assumes you have pymssql and mysql apis installed. If you do not, it
is relatively straightforward to install them with pip. From the Terminal, or
powershell (in Windows), enter pip install pymssql or pip install
mysql-connector-python. We will work with data on educational attainment in this
recipe.

Data note

The dataset used in this recipe is available for public use at
https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip.


HOW TO DO IT...

We import SQL Server and MySQL data tables into a pandas DataFrame, as follows:

 1. Import pandas, numpy, pymssql, and mysql.

This step assumes that you have installed pymssql and mysql apis:

import pandas as pd
import numpy as np
import pymssql
import mysql.connector


 1. Use pymssql api and read_sql to retrieve and load data from a SQL Server
    instance.

Select the columns we want from the SQL Server data, and use SQL aliases to
improve column names (for example, fedu AS fathereducation). Create a connection
to the SQL Server data by passing database credentials to the pymssql connect
function. Create a pandas DataFrame by passing the SELECT statement and
connection object to read_sql. Use close to return the connection to the pool on
the server:

sqlselect = "SELECT studentid, school, sex, age, famsize,\
...   medu AS mothereducation, fedu AS fathereducation,\
...   traveltime, studytime, failures, famrel, freetime,\
...   goout, g1 AS gradeperiod1, g2 AS gradeperiod2,\
...   g3 AS gradeperiod3 From studentmath"
server = "pdcc.c9sqqzd5fulv.us-west-2.rds.amazonaws.com"
user = "pdccuser"
password = "pdccpass"
database = "pdcctest"
conn = pymssql.connect(server=server,
...   user=user, password=password, database=database)
studentmath = pd.read_sql(sqlselect,conn)
conn.close()


Note

Although tools such as pymssql make connecting to a SQL Server instance
relatively straightforward, the syntax still might take a little time to get
used to if it is unfamiliar. The previous step shows the parameter values you
will typically need to pass to a connection object – the name of the server, the
name of a user with credentials on the server, the password for that user, and
the name of a SQL database on the server.

 1. Check the data types and the first few rows:
    
    studentmath.dtypes
    
    
    studentid          object
    school             object
    sex                object
    age                int64
    famsize            object
    mothereducation    int64
    fathereducation    int64
    traveltime         int64
    studytime          int64
    failures           int64
    famrel             int64
    freetime           int64
    gout               int64
    gradeperiod1       int64
    gradeperiod2       int64
    gradeperiod3       int64
    dtype: object
    
    
    studentmath.head()
    
    
        studentid    school  ...      gradeperiod2    gradeperiod3
    0	001	    GP      ...	6	        6
    1	002	    GP      ...	5	        6
    2	003	    GP      ...	8	        10
    3	004	    GP      ...	14	        15
    4	005	    GP      ...	10	        10
    [5 rows x 16 columns]
    

 2. Connecting to a MySQL server is not very different from connecting to a SQL
    Server instance. We can use the connect method of the mysql connector to do
    that and then use read_sql to load the data.

Create a connection to the mysql data, pass that connection to read_sql to
retrieve the data, and load it into a pandas DataFrame (the same data file on
student math scores was uploaded to SQL Server and MySQL, so we can use the same
SQL SELECT statement we used in the previous step):

host = "pdccmysql.c9sqqzd5fulv.us-west-2.rds.amazonaws.com"
user = "pdccuser"
password = "pdccpass"
database = "pdccschema"
connmysql = mysql.connector.connect(host=host, \
...   database=database,user=user,password=password)
studentmath = pd.read_sql(sqlselect,connmysql)
connmysql.close()


 1. Rearrange the columns, set an index, and check for missing values.

Move the grade data to the left of the DataFrame, just after studentid. Also,
move the freetime column to the right after traveltime and studytime. Confirm
that each row has an ID and that the IDs are unique, and set studentid as the
index:

newcolorder = ['studentid', 'gradeperiod1',
...   'gradeperiod2','gradeperiod3', 'school',
...   'sex', 'age', 'famsize','mothereducation',
...   'fathereducation', 'traveltime',
...   'studytime', 'freetime', 'failures',
...   'famrel','goout']
studentmath = studentmath[newcolorder]
studentmath.studentid.count()


395


studentmath.studentid.nunique()


395


studentmath.set_index('studentid', inplace=True)


 1. Use the DataFrame’s count function to check for missing values:
    
    studentmath.count()
    
    
    gradeperiod1		395
    gradeperiod2		395
    gradeperiod3		395
    school		395
    sex			395
    age			395
    famsize		395
    mothereducation	395
    fathereducation	395
    traveltime		395
    studytime		395
    freetime		395
    failures		395
    famrel		395
    goout			395
    dtype: int64
    

 2. Replace coded data values with more informative values.

Create a dictionary with the replacement values for the columns, and then use
replace to set those values:

setvalues= \
...   {"famrel":{1:"1:very bad",2:"2:bad",
...     3:"3:neutral",4:"4:good",5:"5:excellent"},
...   "freetime":{1:"1:very low",2:"2:low",
...     3:"3:neutral",4:"4:high",5:"5:very high"},
...   "goout":{1:"1:very low",2:"2:low",3:"3:neutral",
...     4:"4:high",5:"5:very high"},
...   "mothereducation":{0:np.nan,1:"1:k-4",2:"2:5-9",
...     3:"3:secondary ed",4:"4:higher ed"},
...   "fathereducation":{0:np.nan,1:"1:k-4",2:"2:5-9",
...     3:"3:secondary ed",4:"4:higher ed"}}
studentmath.replace(setvalues, inplace=True)


 1. Change the type for columns with the changed data to category.

Check any changes in memory usage:

setvalueskeys = [k for k in setvalues]
studentmath[setvalueskeys].memory_usage(index=False)


famrel		3160
freetime		3160
goout			3160
mothereducation	3160
fathereducation	3160
dtype: int64


for col in studentmath[setvalueskeys].columns:
...   studentmath[col] = studentmath[col]. \
...     astype('category')
...
studentmath[setvalueskeys].memory_usage(index=False)


famrel		607
freetime		607
goout			607
mothereducation	599
fathereducation	599
dtype: int64


 1. Calculate percentages for values in the famrel column.

Run value_counts, and set normalize to True to generate percentages:

studentmath['famrel'].value_counts(sort=False, normalize=True)


1:very bad	0.02
2:bad		0.05
3:neutral	0.17
4:good	0.49
5:excellent	0.27
Name: famrel, dtype: float64


 1. Use apply to calculate percentages for multiple columns:
    
    studentmath[['freetime','goout']].\
    ...   apply(pd.Series.value_counts, sort=False,
    ...   normalize=True)
    
    
                 freetime   goout
    1:very low	0.05	    0.06
    2:low		0.16	    0.26
    3:neutral	0.40	    0.33
    4:high	0.29	    0.22
    5:very high	0.10	    0.13
    
    
    studentmath[['mothereducation','fathereducation']].\
    ...   apply(pd.Series.value_counts, sort=False,
    ...   normalize=True)
    
    
    			mothereducation	fathereducation
    1:k-4			0.15			0.21
    2:5-9			0.26			0.29
    3:secondary ed	0.25			0.25
    4:higher ed		0.33			0.24
    

The preceding steps retrieved a data table from a SQL database, loaded that data
into pandas, and did some initial data checking and cleaning.


HOW IT WORKS…

Since data from enterprise systems is typically better structured than CSV or
Excel files, we do not need to do things such as skip rows or deal with
different logical data types in a column. However, some massaging is still
usually required before we can begin exploratory analysis. There are often more
columns than we need, and some column names are not intuitive or not ordered in
the best way for analysis. The meaningfulness of many data values is not stored
in the data table to avoid entry errors and save on storage space. For example,
3 is stored for mother’s education rather than secondary education. It is a good
idea to reconstruct that coding as early in the cleaning process as possible.

To pull data from a SQL database server, we need a connection object to
authenticate us on the server, as well as a SQL select string. These can be
passed to read_sql to retrieve the data and load it into a pandas DataFrame. I
usually use the SQL SELECT statement to do a bit of cleanup of column names at
this point. I sometimes also reorder columns, but I did that later in this
recipe.

We set the index in Step 5, first confirming that every row has a value for
studentid and that it is unique. This is often more important when working with
enterprise data because we will almost always need to merge the retrieved data
with other data files on the system. Although an index is not required for this
merging, the discipline of setting one prepares us for the tricky business of
merging data further down the road. It will also likely improve the speed of the
merge.

We use the DataFrame’s count function to check for missing values and that there
are no missing values – for non-missing values, the count is 395 (the number of
rows) for every column. This is almost too good to be true. There may be values
that are logically missing – that is, valid numbers that nonetheless connote
missing values, such as -1, 0, 9, or 99. We address this possibility in the next
step.

Step 7 demonstrates a useful technique for replacing data values for multiple
columns. We create a dictionary to map original values to new values for each
column and then run it using replace. To reduce the amount of storage space
taken up by the new verbose values, we convert the data type of those columns to
category. We do this by generating a list of the keys of our setvalues
dictionary – setvalueskeys = [k for k in setvalues] generates [famrel, freetime,
goout, mothereducation, and fathereducation]. We then iterate over those five
columns and use the astype method to change the data type to category. Notice
that the memory usage for those columns is reduced substantially.

Finally, we check the assignment of new values by using value_counts to view
relative frequencies. We use apply because we want to run value_counts on
multiple columns. To prevent value_counts sorting by frequency, we set sort to
False.

The DataFrame replace method is also a handy tool for dealing with logical
missing values that will not be recognized as missing when retrieved by
read_sql. The 0 values for mothereducation and fathereducation seem to fall into
that category. We fix this problem in the setvalues dictionary by indicating
that the 0 values for mothereducation and fathereducation should be replaced
with NaN. It is important to address these kinds of missing values shortly after
the initial import because they are not always obvious and can significantly
impact all subsequent work.

Users of packages such as SPPS, SAS, and R will notice the difference between
this approach and value labels in SPSS and R, as well as the proc format in SAS.
In pandas, we need to change the actual data to get more informative values.
However, we reduce how much data is actually stored by giving the column a
category data type. This is similar to factors in R.


THERE’S MORE…

I moved the grade data to near the beginning of the DataFrame. I find it helpful
to have potential target or dependent variables in the leftmost columns, keeping
them at the forefront of your mind. It is also helpful to keep similar columns
together. In this example, personal demographic variables (sex and age) are next
to one another, as are family variables (mothereducation and fathereducation),
and how students spend their time (traveltime, studytime, and freetime).

You could have used map instead of replace in Step 7. Prior to version 19.2 of
pandas, map was significantly more efficient. Since then, the difference in
efficiency has been much smaller. If you are working with a very large dataset,
the difference may still be enough to consider using map.


SEE ALSO

The recipes in Chapter 10, Addressing Data Issues When Combining DataFrames, go
into detail on merging data. We will take a closer look at bivariate and
multivariate relationships between variables in Chapter 4, Identifying Outliers
in Subsets of Data. We will demonstrate how to use some of these same approaches
in packages such as SPSS, SAS, and R in subsequent recipes in this chapter.


IMPORTING SPSS, STATA, AND SAS DATA

We will use pyreadstat to read data from three popular statistical packages into
pandas. The key advantage of pyreadstat is that it allows data analysts to
import data from these packages without losing metadata, such as variable and
value labels.

The SPSS, Stata, and SAS data files we receive often come to us with the data
issues of CSV and Excel files and SQL databases having been resolved. We do not
typically have the invalid column names, changes in data types, and unclear
missing values that we can get with CSV or Excel files, nor do we usually get
the detachment of data from business logic, such as the meaning of data codes,
that we often get with SQL data. When someone or some organization shares a data
file from one of these packages with us, they have often added variable labels
and value labels for categorical data. For example, a hypothetical data column
called presentsat has the overall satisfaction with presentation variable label
and 1–5 value labels, with 1 being not at all satisfied and 5 being highly
satisfied.

The challenge is retaining that metadata when importing data from those systems
into pandas. There is no precise equivalent to variable and value labels in
pandas, and built-in tools for importing SAS, Stata, and SAS data lose the
metadata. In this recipe, we will use pyreadstat to load variable and value
label information and use a couple of techniques to represent that information
in pandas.


GETTING READY

This recipe assumes you have installed the pyreadstat package. If it is not
installed, you can install it with pip. From the Terminal, or Powershell (in
Windows), enter pip install pyreadstat. You will need the SPSS, Stata, and SAS
data files for this recipe to run the code.

We will work with data from the United States National Longitudinal Surveys
(NLS) of Youth.

Data note

The NLS of Youth is conducted by the United States Bureau of Labor Statistics.
This survey started with a cohort of individuals in 1997. Each survey respondent
was high school age when they first completed the survey, having been born
between 1980 and 1985. There were annual follow-up surveys each year through
2023. For this recipe, I pulled 42 variables on grades, employment, income, and
attitudes toward government, from the hundreds of data items on the survey.
Separate files for SPSS, Stata, and SAS can be downloaded from the repository.

The original NLS data can be downloaded from
https://www.nlsinfo.org/investigator/pages/search, along with code for creating
SPSS, Stata, or SAS files from the ASCII data files included in the download.


HOW TO DO IT...

We will import data from SPSS, Stata, and SAS, retaining metadata such as value
labels:

 1. Import pandas, numpy, and pyreadstat.

This step assumes that you have installed pyreadstat:

import pandas as pd
import numpy as np
import pyreadstat


 1. Retrieve the SPSS data.

Pass a path and filename to the read_sav method of pyreadstat. Display the first
few rows and a frequency distribution. Note that the column names and value
labels are non-descriptive, and that read_sav returns both a pandas DataFrame
and a meta object:

nls97spss, metaspss = pyreadstat.read_sav('data/nls97.sav')
nls97spss.dtypes


R0000100	float64
R0536300	float64
R0536401	float64
...
U2962900	float64
U2963000	float64
Z9063900	float64
dtype: object


nls97spss.head()


   R0000100  R0536300  ...  U2963000  Z9063900
0	1	2         ...  nan       52
1	2	1         ...  6         0
2	3	2         ...  6         0
3	4	2         ...  6         4
4	5	1         ...  5         12
[5 rows x 42 columns]


nls97spss['R0536300'].value_counts(normalize=True)


1.00	0.51
2.00	0.49
Name: R0536300, dtype: float64


 1. Grab the metadata to improve column labels and value labels.

The metaspss object created when we called read_sav has the column labels and
the value labels from the SPSS file. Use the variable_value_labels dictionary to
map values to value labels for one column (R0536300). (This does not change the
data. It only improves our display when we run value_counts.) Use the
set_value_labels method to actually apply the value labels to the DataFrame:

metaspss.variable_value_labels['R0536300']


{0.0: 'No Information', 1.0: 'Male', 2.0: 'Female'}


nls97spss['R0536300'].\
...   map(metaspss.variable_value_labels['R0536300']).\
...   value_counts(normalize=True)


Male		0.51
Female	0.49
Name: R0536300, dtype: float64


nls97spss = pyreadstat.set_value_labels(nls97spss, metaspss, formats_as_category=True)


 1. Use column labels in the metadata to rename the columns.

To use the column labels from metaspss in our DataFrame, we can simply assign
the column labels in metaspss to our DataFrame’s column names. Clean up the
column names a bit by changing them to lowercase, changing spaces to
underscores, and removing all remaining non-alphanumeric characters:

nls97spss.columns = metaspss.column_labels
nls97spss['KEY!SEX (SYMBOL) 1997'].value_counts(normalize=True)


Male		0.51
Female	0.49
Name: KEY!SEX (SYMBOL) 1997, dtype: float64


nls97spss.dtypes


PUBID - YTH ID CODE 		1997	float64
KEY!SEX (SYMBOL) 			1997	category
KEY!BDATE M/Y (SYMBOL)		1997	float64
KEY!BDATE M/Y (SYMBOL) 		1997	float64
CV_SAMPLE_TYPE 			1997	category
KEY!RACE_ETHNICITY (SYMBOL) 	1997	category
"... abbreviated to save space"
HRS/WK R WATCHES TELEVISION 	2017	category
HRS/NIGHT R SLEEPS 		 	2017	float64
CVC_WKSWK_YR_ALL L99			float64
dtype: object


nls97spss.columns = nls97spss.columns.\
...     str.lower().\
...     str.replace(' ','_').\
...     str.replace('[^a-z0-9_]', '', regex=True)
nls97spss.set_index('pubid__yth_id_code_1997', inplace=True)


 1. Simplify the process by applying the value labels from the beginning.

The data values can actually be applied in the initial call to read_sav by
setting apply_value_formats to True. This eliminates the need to call the
set_value_labels function later:

nls97spss, metaspss = pyreadstat.read_sav('data/nls97.sav', apply_value_formats=True, formats_as_category=True)
nls97spss.columns = metaspss.column_labels
nls97spss.columns = nls97spss.columns.\
...   str.lower().\
...   str.replace(' ','_').\
...   str.replace('[^a-z0-9_]', '', regex=True)


 1. Show the columns and a few rows:
    
    nls97spss.dtypes
    
    
    pubid__yth_id_code_1997	float64
    keysex_symbol_1997		category
    keybdate_my_symbol_1997	float64
    keybdate_my_symbol_1997	float64
    hrsnight_r_sleeps_2017	float64
    cvc_wkswk_yr_all_l99	float64
    dtype: object
    
    
    nls97spss.head()
    
    
       pubid__yth_id_code_1997 keysex_symbol_1997  ...  \
    0	1	Female  ... 
    1	2	Male  ... 
    2	3	Female  ... 
    3	4	Female  ... 
    4	5	Male  ... 
       hrsnight_r_sleeps_2017  cvc_wkswk_yr_all_l99
    0	nan	52
    1	6	0
    2	6	0
    3	6	4
    4	5	12
    [5 rows x 42 columns]
    

 2. Run frequencies on one of the columns, and set the index:
    
    nls97spss.govt_responsibility__provide_jobs_2006.\
    ...   value_counts(sort=False)
    
    
    Definitely should be	454
    Definitely should not be	300
    Probably should be		617
    Probably should not be	462
    Name: govt_responsibility__provide_jobs_2006, dtype: int64
    
    
    nls97spss.set_index('pubid__yth_id_code_1997', inplace=True)
    

 3. That demonstrated how to convert data from SPSS. Let’s try that with Stata
    data.
 4. Import the Stata data, apply value labels, and improve the column headings.

Use the same methods for the Stata data that we used for the SPSS data:

nls97stata, metastata = pyreadstat.read_dta('data/nls97.dta', apply_value_formats=True, formats_as_category=True)
nls97stata.columns = metastata.column_labels
nls97stata.columns = nls97stata.columns.\
...     str.lower().\
...     str.replace(' ','_').\
...     str.replace('[^a-z0-9_]', '', regex=True)
nls97stata.dtypes


pubid__yth_id_code_1997	float64
keysex_symbol_1997		category
keybdate_my_symbol_1997	float64
keybdate_my_symbol_1997	float64
hrsnight_r_sleeps_2017	float64
cvc_wkswk_yr_all_l99	float64
dtype: object


 1. View a few rows of the data and run frequencies:
    
    nls97stata.head()
    
    
       pubid__yth_id_code_1997    keysex_symbol_1997  ...  \
    0                        1                Female  ... 
    1                        2                  Male  ... 
    2                        3                Female  ... 
    3                        4                Female  ... 
    4                        5                  Male  ... 
       hrsnight_r_sleeps_2017    cvc_wkswk_yr_all_l99
    0                      -5                      52
    1                       6                       0
    2                       6                       0
    3                       6                       4
    4                       5                      12
    [5 rows x 42 columns]
    
    
    nls97stata.govt_responsibility__provide_jobs_2006.\
    ...   value_counts(sort=False)
    
    
    -5.0	1425
    -4.0	5665
    -2.0	56
    -1.0	5
    Definitely should be	454
    Definitely should not be	300
    Probably should be		617
    Probably should not be	462
    Name: govt_responsibility__provide_jobs_2006, dtype: int64
    

 2. Fix the logical missing values that show up with the Stata data and set an
    index. We can use the replace method to set any value that is between –9 and
    –1 in any column to missing:
    
    nls97stata.min(numeric_only=True)
    
    
    pubid__yth_id_code_1997          1
    keybdate_my_symbol_1997          1
    keybdate_my_symbol_1997      1,980
    trans_sat_verbal_hstr           -4
    trans_sat_math_hstr             -4
    trans_crd_gpa_overall_hstr      -9
    trans_crd_gpa_eng_hstr          -9
    trans_crd_gpa_math_hstr         -9
    trans_crd_gpa_lp_sci_hstr       -9
    cv_ba_credits_l1_2011           -5
    cv_bio_child_hh_2017            -5
    cv_bio_child_nr_2017            -5
    hrsnight_r_sleeps_2017          -5
    cvc_wkswk_yr_all_l99            -4
    dtype: float64
    
    
    nls97stata.replace(list(range(-9,0)), np.nan, inplace=True)
    nls97stata.min(numeric_only=True)
    
    
    pubid__yth_id_code_1997          1
    keybdate_my_symbol_1997          1
    keybdate_my_symbol_1997      1,980
    trans_sat_verbal_hstr           14
    trans_sat_math_hstr              7
    trans_crd_gpa_overall_hstr      10
    trans_crd_gpa_eng_hstr           0
    trans_crd_gpa_math_hstr          0
    trans_crd_gpa_lp_sci_hstr        0
    cv_ba_credits_l1_2011            0
    cv_bio_child_hh_2017             0
    cv_bio_child_nr_2017             0
    hrsnight_r_sleeps_2017           0
    cvc_wkswk_yr_all_l99             0
    dtype: float64
    
    
    nls97stata.set_index('pubid__yth_id_code_1997', inplace=True)
    

The process is fairly similar when working with SAS data files, as the next few
steps illustrate.

 1. Retrieve the SAS data, using the SAS catalog file for value labels:

The data values for SAS are stored in a catalog file. Setting the catalog file
path and filename retrieves the value labels and applies them:

nls97sas, metasas = pyreadstat.read_sas7bdat('data/nls97.sas7bdat', catalog_file='data/nlsformats3.sas7bcat', formats_as_category=True)
nls97sas.columns = metasas.column_labels
nls97sas.columns = nls97sas.columns.\
...     str.lower().\
...     str.replace(' ','_').\
...     str.replace('[^a-z0-9_]', '', regex=True)
nls97sas.head()


   pubid__yth_id_code_1997   keysex_symbol_1997    ...  \
0			     1		       Female    ... 
1			     2		         Male    ... 
2			     3		       Female    ... 
3			     4		       Female    ... 
4			     5		         Male    ... 
   hrsnight_r_sleeps_2017  cvc_wkswk_yr_all_l99
0			  nan			   52
1			    6			    0
2			    6			    0
3			    6			    4
4			    5			   12
[5 rows x 42 columns]


nls97sas.keysex_symbol_1997.value_counts()


Male		4599
Female	4385
Name: keysex_symbol_1997, dtype: int64


nls97sas.set_index('pubid__yth_id_code_1997', inplace=True)


This demonstrates how to import SPSS, SAS, and Stata data without losing
important metadata.


HOW IT WORKS...

The read_sav, read_dta, and read_sas7bdat methods of Pyreadstat, for SPSS,
Stata, and SAS data files, respectively, work in a similar manner. Value labels
can be applied when reading in the data by setting apply_value_formats to True
for SPSS and Stata files (Steps 5 and 8), or by providing a catalog file path
and filename for SAS (Step 12).

We can set formats_as_category to True to change the data type to category for
those columns where the data values will change. The meta object has the column
names and the column labels from the statistical package, so metadata column
labels can be assigned to pandas DataFrame column names at any point
(nls97spss.columns = metaspss.column_labels). We can even revert to the original
column headings after assigning meta column labels to them by setting pandas
column names to the metadata column names (nls97spss.columns =
metaspss.column_names).

In Step 3, we looked at some of the SPSS data before applying value labels. We
looked at the dictionary for one variable
(metaspss.variable_value_labels['R0536300']), but we could have viewed it for
all variables (metaspss.variable_value_labels). When we are satisfied that the
labels make sense, we can set them by calling the set_value_labels function.
This is a good approach when you do not know the data well and want to inspect
the labels before applying them.

The column labels from the meta object are often a better choice than the
original column headings. Column headings can be quite cryptic, particularly
when the SPSS, Stata, or SAS file is based on a large survey, as in this
example. However, the labels are not usually ideal for column headings either.
They sometimes have spaces, capitalization that is not helpful, and
non-alphanumeric characters. We chain some string operations to switch to
lowercase, replace spaces with underscores, and remove non-alphanumeric
characters.

Handling missing values is not always straightforward with these data files,
since there are often many reasons why data is missing. If the file is from a
survey, the missing value may be because of a survey skip pattern, or a
respondent failed to respond, or the response was invalid, and so on. The NLS
has nine possible values for missing, from –1 to –9. The SPSS import
automatically set those values to NaN, while the Stata import retained the
original values. (We could have gotten the SPSS import to retain those values by
setting user_missing to True.) For the Stata data, we need to tell it to replace
all values from –1 to –9 with NaN. We do this by using the DataFrame’s replace
function and passing it a list of integers from –9 to –1 (list(range(-9,0))).


THERE’S MORE…

You may have noticed similarities between this recipe and the previous one in
terms of how value labels are set. The set_value_labels function is like the
DataFrame replace operation we used to set value labels in that recipe. We
passed a dictionary to replace that mapped columns to value labels. The
set_value_labels function in this recipe essentially does the same thing, using
the variable_value_labels property of the meta object as the dictionary.

Data from statistical packages is often not as well structured as SQL databases
tend to be in one significant way. Since they are designed to facilitate
analysis, they often violate database normalization rules. There is often an
implied relational structure that might have to be unflattened at some point.
For example, the data may combine individual and event-level data – a person and
hospital visits, a brown bear and the date it emerged from hibernation. Often,
this data will need to be reshaped for some aspects of the analysis.


SEE ALSO

The pyreadstat package is nicely documented at
https://github.com/Roche/pyreadstat. The package has many useful options for
selecting columns and handling missing data that space did not permit me to
demonstrate in this recipe. In Chapter 11, Tidying and Reshaping Data, we will
examine how to normalize data that may have been flattened for analytical
purposes.


IMPORTING R DATA

We will use pyreadr to read an R data file into pandas. Since pyreadr cannot
capture the metadata, we will write code to reconstruct value labels (analogous
to R factors) and column headings. This is similar to what we did in the
Importing data from SQL databases recipe.

The R statistical package is, in many ways, similar to the combination of Python
and pandas, at least in its scope. Both have strong tools across a range of data
preparation and data analysis tasks. Some data scientists work with both R and
Python, perhaps doing data manipulation in Python and statistical analysis in R,
or vice versa, depending on their preferred packages. However, there is
currently a scarcity of tools for reading data saved in R, as rds or rdata
files, into Python. The analyst often saves the data as a CSV file first and
then loads it into Python. We will use pyreadr, from the same author as
pyreadstat, because it does not require an installation of R.

When we receive an R file, or work with one we have created ourselves, we can
count on it being fairly well structured, at least compared to CSV or Excel
files. Each column will have only one data type, column headings will have
appropriate names for Python variables, and all rows will have the same
structure. However, we may need to restore some of the coding logic, as we did
when working with SQL data.


GETTING READY

This recipe assumes you have installed the pyreadr package. If it is not
installed, you can install it with pip. From the Terminal, or Powershell (in
Windows), enter pip install pyreadr.

We will again work with the NLS in this recipe. You will need to download the
rds file used in this recipe from the GitHub repository in order to run the
code.


HOW TO DO IT…

We will import data from R without losing important metadata:

 1. Load pandas, numpy, pprint, and the pyreadr package:
    
    import pandas as pd
    import numpy as np
    import pyreadr
    import pprint
    

 2. Get the R data.

Pass the path and filename to the read_r method to retrieve the R data, and load
it into memory as a pandas DataFrame. read_r can return one or more objects.
When reading an rds file (as opposed to an rdata file), it will return one
object, having the key None. We indicate None to get the pandas DataFrame:

nls97r = pyreadr.read_r('data/nls97.rds')[None]
nls97r.dtypes


R0000100	int32
R0536300	int32
...
U2962800	int32
U2962900	int32
U2963000	int32
Z9063900	int32
dtype: object


nls97r.head(10)


     R0000100  R0536300  ...      U2963000    Z9063900
0	1	  2         ...      -5          52
1	2	  1         ...       6          0
2	3	  2         ...       6          0
3	4	  2         ...       6          4
4	5	  1         ...       5          12
5	6	  2         ...       6          6
6	7	  1         ...      -5          0
7	8	  2         ...      -5          39
8	9	  1         ...       4          0
9	10	  1         ...       6          0
[10 rows x 42 columns]


 1. Set up dictionaries for value labels and column headings.

Load a dictionary that maps columns to the value labels and create a list of
preferred column names as follows:

with open('data/nlscodes.txt', 'r') as reader:
...     setvalues = eval(reader.read())
...
pprint.pprint(setvalues)


{'R0536300': {0.0: 'No Information', 1.0: 'Male', 2.0: 'Female'},
 'R1235800': {0.0: 'Oversample', 1.0: 'Cross-sectional'},
 'S8646900': {1.0: '1. Definitely',
              2.0: '2. Probably ',
              3.0: '3. Probably not',
              4.0: '4. Definitely not'}}
...abbreviated to save space
newcols = ['personid','gender','birthmonth',
...   'birthyear','sampletype','category',
...   'satverbal','satmath','gpaoverall',
...   'gpaeng','gpamath','gpascience','govjobs',
...   'govprices','govhealth','goveld','govind',
...   'govunemp','govinc','govcollege',
...   'govhousing','govenvironment','bacredits',
...   'coltype1','coltype2','coltype3','coltype4',
...   'coltype5','coltype6','highestgrade',
...   'maritalstatus','childnumhome','childnumaway',
...   'degreecol1','degreecol2','degreecol3',
...   'degreecol4','wageincome','weeklyhrscomputer',
...   'weeklyhrstv','nightlyhrssleep',
...   'weeksworkedlastyear']


 1. Set value labels and missing values, and change selected columns to the
    category data type.

Use the setvalues dictionary to replace existing values with value labels.
Replace all values from –9 to –1 with NaN:

nls97r.replace(setvalues, inplace=True)
nls97r.head()


     R0000100   R0536300  ...     U2963000  Z9063900
0    1          Female    ...     -5        52
1    2          Male      ...     6         0
2    3          Female    ...     6         0
3    4          Female    ...     6         4
4    5          Male      ...     5         12
[5 rows x 42 columns]


nls97r.replace(list(range(-9,0)), np.nan, inplace=True)
for col in nls97r[[k for k in setvalues]].columns:
...     nls97r[col] = nls97r[col].astype('category')
...
nls97r.dtypes


R0000100	int64
R0536300	category
R0536401	int64
R0536402	int64
R1235800	category
              ... 
U2857300	category
U2962800	category
U2962900	category
U2963000	float64
Z9063900	float64
Length: 42, dtype: object


 1. Set meaningful column headings:
    
    nls97r.columns = newcols
    nls97r.dtypes
    
    
    personid	int64
    gender	category
    birthmonth	int64
    birthyear	int64
    sampletype	category
                             ... 
    wageincome	category
    weeklyhrscomputer	category
    weeklyhrstv	category
    nightlyhrssleep	float64
    weeksworkedlastyear	float64
    Length: 42, dtype: object
    

This shows how R data files can be imported into pandas and value labels
assigned.


HOW IT WORKS…

Reading R data into pandas with pyreadr is fairly straightforward. Passing a
filename to the read_r function is all that is required. Since read_r can return
multiple objects with one call, we need to specify which object. When reading an
rds file (as opposed to an rdata file), only one object is returned. It has the
key None.

In Step 3, we loaded a dictionary that maps our variables to value labels, and a
list for our preferred column headings. In Step 4 we applied the value labels.
We also changed the data type to category for the columns where we applied the
values. We did this by generating a list of the keys in our setvalues dictionary
with [k for k in setvalues] and then iterating over those columns.

We change the column headings in Step 5 to ones that are more intuitive. Note
that the order matters here. We need to set the value labels before changing the
column names, since the setvalues dictionary is based on the original column
headings.

The main advantage of using pyreadr to read R files directly into pandas is that
we do not have to convert the R data into a CSV file first. Once we have written
our Python code to read the file, we can just rerun it whenever the R data
changes. This is particularly helpful when we do not have R on the machine where
we work.


THERE’S MORE…

Pyreadr is able to return multiple DataFrames. This is useful when we save
several data objects in R as an rdata file. We can return all of them with one
call.

Pprint is a handy tool for improving the display of Python dictionaries.

We could have used rpy2 instead of pyreadr to import R data. rpy2 requires that
R also be installed, but it is more powerful than pyreadr. It will read R
factors and automatically set them to pandas DataFrame values. See the following
code:

import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
readRDS = robjects.r['readRDS']
nls97withvalues = readRDS('data/nls97withvalues.rds')
nls97withvalues


          R0000100      R0536300     ...    U2963000         Z9063900
1         1             Female       ...    -2147483648      52
2         2             Male         ...    6                0
3         3             Female       ...    6                0
4         4             Female       ...    6                4
5         5             Male         ...    5                12
...       ...           ...          ...    ...             ...
8980     9018           Female       ...    4                49
8981     9019           Male         ...    6                0
8982     9020           Male         ...    -2147483648      15
8983     9021           Male         ...    7                50
8984     9022           Female       ...    7                20
[8984 rows x 42 columns]


This generates unusual –2147483648 values. This is what happened when readRDS
interpreted missing data in numeric columns. A global replacement of that number
with NaN, after confirming that that is not a valid value, would be a good next
step.


SEE ALSO

Clear instructions and examples for pyreadr are available at
https://github.com/ofajardo/pyreadr.

Feather files, a relatively new format, can be read by both R and Python. I
discuss those files in the next recipe.


PERSISTING TABULAR DATA

We persist data, copy it from memory to local or remote storage, for several
reasons: to be able to access the data without having to repeat the steps we
used to generate it, to share the data with others, or to make it available for
use with different software. In this recipe, we save data that we have loaded
into a pandas DataFrame as different file types (CSV, Excel, Pickle, and
Feather).

Another important, but sometimes overlooked, reason to persist data is to
preserve some segment of our data that needs to be examined more closely;
perhaps it needs to be scrutinized by others before our analysis can be
completed. For analysts who work with operational data in medium- to large-sized
organizations, this process is part of the daily data-cleaning workflow.

In addition to these reasons for persisting data, our decisions about when and
how to serialize data are shaped by several other factors: where we are in terms
of our data analysis projects, the hardware and software resources of the
machine(s) saving and reloading the data, and the size of our dataset. Analysts
end up having to be much more intentional when saving data than they are when
pressing Ctrl + S in their word-processing application.

Once we persist data, it is stored separately from the logic that we used to
create it. I find this to be one of the most important threats to the integrity
of our analysis. Often, we end up loading data that we saved some time in the
past (a week ago? A month ago? A year ago?) and forget how a variable was
defined and how it relates to other variables. If we are in the middle of a
data-cleaning task, it is best not to persist our data, so long as our
workstation and network can easily handle the burden of regenerating the data.
It is a good idea to persist data only once we have reached milestones in our
work.

Beyond the question of when to persist data, there is the question of how. If we
are persisting it for our own reuse with the same software, it is best to save
it in a binary format native to that software. That is pretty straightforward
for tools such as SPSS, SAS, Stata, and R, but not so much for pandas. But that
is good news in a way. We have lots of choices, from CSV and Excel to Pickle and
Feather. We save as all these file types in this recipe.

Note

Pickle and Feather are binary file formats that can be used to store pandas
DataFrames.


GETTING READY

You will need to install Feather if you do not have it on your system. You can
do that by entering pip install pyarrow in a Terminal window or powershell (in
Windows). If you do not already have a subfolder named Views in your chapter 1
folder, you will need to create it in order to run the code for this recipe.

Data note

This dataset, taken from the Global Historical Climatology Network integrated
database, is made available for public use by the United States National Oceanic
and Atmospheric Administration at
https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-monthly.
I used the data from version 4. The data in this recipe uses a 100,000-row
sample of the full dataset, which is also available in the repository.


HOW TO DO IT…

We will load a CSV file into pandas and then save it as a Pickle and a Feather
file. We will also save subsets of the data to the CSV and Excel formats:

 1. Import pandas and pyarrow.

pyarrow needs to be imported in order to save pandas to Feather:

import pandas as pd
import pyarrow


 1. Load the land temperatures CSV file into pandas, drop rows with missing
    data, and set an index:
    
    landtemps = \
    ...   pd.read_csv('data/landtempssample.csv',
    ...     names=['stationid','year','month','avgtemp',
    ...       'latitude','longitude','elevation',
    ...       'station','countryid','country'],
    ...     skiprows=1,
    ...     parse_dates=[['month','year']],
    ...     low_memory=False)
    landtemps.rename(columns={'month_year':'measuredate'}, inplace=True)
    landtemps.dropna(subset=['avgtemp'], inplace=True)
    landtemps.dtypes
    
    
    measuredate	datetime64[ns]
    stationid	object
    avgtemp	float64
    latitude	float64
    longitude	float64
    elevation	float64
    station	object
    countryid	object
    country	object
    dtype: object
    
    
    landtemps.set_index(['measuredate','stationid'], inplace=True)
    

 2. Write extreme values for temperature to CSV and Excel files.

Use the quantile method to select outlier rows, which are those at the 1 in
1,000 level at each end of the distribution:

extremevals = landtemps[(landtemps.avgtemp < landtemps.avgtemp.quantile(.001)) | (landtemps.avgtemp > landtemps.avgtemp.quantile(.999))]
extremevals.shape


(171, 7)


extremevals.sample(7)


                           avgtemp  ...   country
measuredate  stationid              ...       
2013-08-01	QAM00041170	35.30    ...	Qatar
2005-01-01	RSM00024966	-40.09   ...	Russia
1973-03-01	CA002401200	-40.26   ...	Canada
2007-06-01	KU000405820	37.35    ...	Kuwait
1987-07-01	SUM00062700	35.50    ...	Sudan
1998-02-01	RSM00025325	-35.71   ...	Russia
1968-12-01	RSM00024329	-43.20   ...	Russia
[7 rows x 7 columns]


extremevals.to_excel('views/tempext.xlsx')
extremevals.to_csv('views/tempext.csv')


 1. Save to Pickle and Feather files.

The index needs to be reset in order to save a Feather file:

landtemps.to_pickle('data/landtemps.pkl')
landtemps.reset_index(inplace=True)
landtemps.to_feather("data/landtemps.ftr")


 1. Load the Pickle and Feather files we just saved.

Note that our index was preserved when saving and loading the Pickle file:

landtemps = pd.read_pickle('data/landtemps.pkl')
landtemps.head(2).T


measuredate	2000-04-01	1940-05-01
stationid	USS0010K01S	CI000085406
avgtemp	5.27		18.04
latitude	39.90		-18.35
longitude	-110.75		-70.33
elevation	2,773.70	58.00
station	INDIAN_CANYON	ARICA
countryid	US		CI
country	United States	Chile


landtemps = pd.read_feather("data/landtemps.ftr")
landtemps.head(2).T


                               0                    1
measuredate	2000-04-01 00:00:00	1940-05-01 00:00:00
stationid	USS0010K01S		CI000085406
avgtemp	5.27			18.04
latitude	39.90			-18.35
longitude	-110.75			-70.33
elevation	2,773.70		58.00
station	INDIAN_CANYON		ARICA
countryid	US			CI
country	United States		Chile


The previous steps demonstrated how to serialize pandas DataFrames using two
different formats, Pickle and Feather.


HOW IT WORKS...

Persisting pandas data is quite straightforward. DataFrames have the to_csv,
to_excel, to_pickle, and to_feather methods. Pickling preserves our index.


THERE’S MORE...

The advantage of storing data in CSV files is that saving it uses up very little
additional memory. The disadvantage is that writing CSV files is quite slow, and
we lose important metadata, such as data types. (read_csv can often figure out
the data type when we reload the file, but not always.) Pickle files keep that
data but can burden a system that is low on resources when serializing. Feather
is easier on resources and can be easily loaded in R as well as Python, but we
have to sacrifice our index in order to serialize. Also, the authors of Feather
make no promises regarding long-term support.

You may have noticed that I do not make a global recommendation about what to
use for data serialization – other than to limit your persistence of full
datasets to project milestones. This is definitely one of those “right tools for
the right job” kind of situations. I use CSV or Excel files when I want to share
a segment of a file with colleagues for discussion. I use Feather for ongoing
Python projects, particularly when I am using a machine with sub-par RAM and an
outdated chip and also using R. When I am wrapping up a project, I pickle the
DataFrames.


SUMMARY

Our Python data projects typically start with raw data stored in a range of
formats and exported from a variety of software tools. Among the most popular
tabular formats and tools are CSV and Excel files, SQL tables, and SPSS, Stata,
SAS, and R datasets. We converted data from all of these sources into a pandas
DataFrame in this chapter, and addressed the most common challenges. We also
explored approaches to persisting tabular data. We will work with data in other
formats in the next chapter.


JOIN OUR COMMUNITY ON DISCORD

Join our community’s Discord space for discussions with the author and other
readers:

https://discord.gg/p8uSgEAETX



Left arrow icon


PAGE 1 OF 9

Right arrow icon
Download code icon Download Code


KEY BENEFITS

 * Get to grips with new techniques for data preprocessing and cleaning for
   machine learning and NLP models
 * Use new and updated AI tools and techniques for data cleaning tasks
 * Clean, monitor, and validate large data volumes to diagnose problems using
   cutting-edge methodologies including Machine learning and AI


DESCRIPTION

Jumping into data analysis without proper data cleaning will certainly lead to
incorrect results. The Python Data Cleaning Cookbook - Second Edition will show
you tools and techniques for cleaning and handling data with Python for better
outcomes. Fully updated to the latest version of Python and all relevant tools,
this book will teach you how to manipulate and clean data to get it into a
useful form. he current edition focuses on advanced techniques like machine
learning and AI-specific approaches and tools for data cleaning along with the
conventional ones. The book also delves into tips and techniques to process and
clean data for ML, AI, and NLP models. You will learn how to filter and
summarize data to gain insights and better understand what makes sense and what
does not, along with discovering how to operate on data to address the issues
you've identified. Next, you’ll cover recipes for using supervised learning and
Naive Bayes analysis to identify unexpected values and classification errors and
generate visualizations for exploratory data analysis (EDA) to identify
unexpected values. Finally, you’ll build functions and classes that you can
reuse without modification when you have new data. By the end of this Data
Cleaning book, you'll know how to clean data and diagnose problems within it.


WHAT YOU WILL LEARN

 * Using OpenAI tools for various data cleaning tasks
 * Producing summaries of the attributes of datasets, columns, and rows
 * Anticipating data-cleaning issues when importing tabular data into pandas
 * Applying validation techniques for imported tabular data
 * Improving your productivity in pandas by using method chaining
 * Recognizing and resolving common issues like dates and IDs
 * Setting up indexes to streamline data issue identification
 * Using data cleaning to prepare your data for ML and AI models


PRODUCT DETAILS

Country selected
Publication date, Length, Edition, Language, ISBN-13

Publication date : May 31, 2024
Length 486 pages
Edition : 2nd Edition
Language : English
ISBN-13 : 9781803239873
Category :
Data
Languages :
Python
Concepts :
Data Analysis


WHAT DO YOU GET WITH EBOOK?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced
features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now
ADD TO CART


PRODUCT DETAILS


Publication date : May 31, 2024
Length 486 pages
Edition : 2nd Edition
Language : English
ISBN-13 : 9781803239873
Category :
Data
Languages :
Python
Concepts :
Data Analysis


PACKT SUBSCRIPTIONS

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books
and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and
reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
START FREE TRIAL
BUY NOW
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books
and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and
reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like
for just $5 each
Feature tick icon Exclusive print discounts
START FREE TRIAL
BUY NOW
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books
and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and
reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like
for just $5 each
Feature tick icon Exclusive print discounts
START FREE TRIAL
BUY NOW


TABLE OF CONTENTS

14 Chapters
Preface Chevron down icon Chevron up icon
Preface
New in the Second Edition
Who this book is for
What this book covers
Get in touch
1. Anticipating Data Cleaning Issues When Importing Tabular Data with pandas
Chevron down icon Chevron up icon
Anticipating Data Cleaning Issues When Importing Tabular Data with pandas
Technical requirements
Importing CSV files
Importing Excel files
Importing data from SQL databases
Importing SPSS, Stata, and SAS data
Importing R data
Persisting tabular data
Summary
2. Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark
Data Chevron down icon Chevron up icon
Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data
Technical requirements
Importing simple JSON data
Importing more complicated JSON data from an API
Importing data from web pages
Working with Spark data
Persisting JSON data
Versioning data
Summary
3. Taking the Measure of Your Data Chevron down icon Chevron up icon
Taking the Measure of Your Data
Technical requirements
Getting a first look at your data
Selecting and organizing columns
Selecting rows
Generating frequencies for categorical variables
Generating summary statistics for continuous variables
Using generative AI to display descriptive statistics
Summary
4. Identifying Outliers in Subsets of Data Chevron down icon Chevron up icon
Identifying Outliers in Subsets of Data
Technical requirements
Identifying outliers with one variable
Identifying outliers and unexpected values in bivariate relationships
Using subsetting to examine logical inconsistencies in variable relationships
Using linear regression to identify data points with significant influence
Using k-nearest neighbors to find outliers
Using Isolation Forest to find anomalies
Using PandasAI to identify outliers
Summary
5. Using Visualizations for the Identification of Unexpected Values Chevron down
icon Chevron up icon
Using Visualizations for the Identification of Unexpected Values
Technical requirements
Using histograms to examine the distribution of continuous variables
Using boxplots to identify outliers for continuous variables
Using grouped boxplots to uncover unexpected values in a particular group
Examining both distribution shape and outliers with violin plots
Using scatter plots to view bivariate relationships
Using line plots to examine trends in continuous variables
Generating a heat map based on a correlation matrix
Summary
6. Cleaning and Exploring Data with Series Operations Chevron down icon Chevron
up icon
Cleaning and Exploring Data with Series Operations
Technical requirements
Getting values from a pandas Series
Showing summary statistics for a pandas Series
Changing Series values
Changing Series values conditionally
Evaluating and cleaning string Series data
Working with dates
Using OpenAI for Series operations
Summary
7. Identifying and Fixing Missing Values Chevron down icon Chevron up icon
Identifying and Fixing Missing Values
Technical requirements
Identifying missing values
Cleaning missing values
Imputing values with regression
Using k-nearest neighbors for imputation
Using random forest for imputation
Using PandasAI for imputation
Summary
8. Encoding, Transforming, and Scaling Features Chevron down icon Chevron up
icon
Encoding, Transforming, and Scaling Features
Technical requirements
Creating training datasets and avoiding data leakage
Removing redundant or unhelpful features
Encoding categorical features: one-hot encoding
Encoding categorical features: ordinal encoding
Encoding categorical features with medium or high cardinality
Using mathematical transformations
Feature binning: equal width and equal frequency
k-means binning
Feature scaling
Summary
9. Fixing Messy Data When Aggregating Chevron down icon Chevron up icon
Fixing Messy Data When Aggregating
Technical requirements
Looping through data with itertuples (an anti-pattern)
Calculating summaries by group with NumPy arrays
Using groupby to organize data by groups
Using more complicated aggregation functions with groupby
Using user-defined functions and apply with groupby
Using groupby to change the unit of analysis of a DataFrame
Using pivot_table to change the unit of analysis of a DataFrame
Summary
10. Addressing Data Issues When Combining DataFrames Chevron down icon Chevron
up icon
Addressing Data Issues When Combining DataFrames
Technical requirements
Combining DataFrames vertically
Doing one-to-one merges
Doing one-to-one merges by multiple columns
Doing one-to-many merges
Doing many-to-many merges
Developing a merge routine
Summary
11. Tidying and Reshaping Data Chevron down icon Chevron up icon
Tidying and Reshaping Data
Technical requirements
Removing duplicated rows
Fixing many-to-many relationships
Using stack and melt to reshape data from wide to long format
Melting multiple groups of columns
Using unstack and pivot to reshape data from long to wide format
Summary
12. Automate Data Cleaning with User-Defined Functions, Classes, and Pipelines
Chevron down icon Chevron up icon
Automate Data Cleaning with User-Defined Functions, Classes, and Pipelines
Technical requirements
Functions for getting a first look at our data
Functions for displaying summary statistics and frequencies
Functions for identifying outliers and unexpected values
Functions for aggregating or combining data
Classes that contain the logic for updating Series values
Classes that handle non-tabular data structures
Functions for checking overall data quality
Pre-processing data with pipelines: a simple example
Pre-processing data with pipelines: a more complicated example
Summary
13. Index Chevron down icon Chevron up icon
Index


RECOMMENDATIONS FOR YOU

1 of 10
Left arrow icon
Mastering NLP from Foundations to LLMs
Read more
Apr 2024 340 pages
ebook
eBook
 * ebook eBook $29.99
 * print Print $46.99

$29.99 $42.99
$46.99 $52.99
ADD TO CART
Machine Learning with PyTorch and Scikit-Learn
Read more
Feb 2022 774 pages
Full star icon 4.9
ebook
eBook
 * ebook eBook $29.99
 * print Print $54.99

$29.99 $43.99
$54.99
ADD TO CART
LLM Prompt Engineering for Developers
Read more
May 2024 251 pages
ebook
eBook
 * ebook eBook $13.98

$13.98 $19.99
ADD TO CART
Principles of Data Science
Read more
Jan 2024 326 pages
ebook
eBook
 * ebook eBook $27.98
 * print Print $49.99

$27.98 $39.99
$49.99
ADD TO CART
Microsoft Fabric Complete Guide – The Future of Data with Fabric
Read more
Dec 2023 9h 2m
video Video
$89.99
ADD TO CART
Microsoft Power BI - The Complete Masterclass [2023 EDITION]
Read more
Jan 2023 14h 31m
video Video
$109.99
ADD TO CART
The Complete SQL Bootcamp for Aspiring Data Scientists
Read more
Aug 2023 8h 10m
video Video
$69.99
ADD TO CART
Python for Algorithmic Trading Cookbook
Read more
Aug 2024 412 pages
ebook
eBook
 * ebook eBook $27.98
 * print Print $36.99

$27.98 $39.99
$36.99 $49.99
ADD TO CART
Python Natural Language Processing Cookbook
Read more
Sep 2024 312 pages
ebook
eBook
 * ebook eBook $24.99
 * print Print $44.99

$24.99 $35.99
$44.99
ADD TO CART
Machine Learning and Generative AI for Marketing
Read more
Aug 2024 482 pages
ebook
eBook
 * ebook eBook $27.98
 * print Print $49.99

$27.98 $39.99
$49.99
ADD TO CART
Right arrow icon


CUSTOMER REVIEWS


Rating distribution
Full star icon Full star icon Full star icon Full star icon Full star icon 5
(2 Ratings)
5 star 100%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Bryan Edwards Jul 29, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Great book - the author does a great job explaining the various concepts, and
the examples are very helpful
Feefo Verified review
N/A Jul 27, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5

Feefo Verified review


ABOUT THE AUTHOR

Profile icon Michael Walker

Michael Walker has worked as a data analyst for over 30 years at a variety of
educational institutions. He is currently the CIO at College Unbound in
Providence, Rhode Island, in the United States. He has also taught data science,
research methods, statistics, and computer programming to undergraduates since
2006.
Read more
See other products by Michael Walker
Get free access to Packt library with over 7500+ books and video courses for 7
days!
Start Free Trial


FAQS

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the
book details for that title. Add either the standalone eBook or the eBook and
print book bundle to your shopping cart. Your eBook will show in your cart as a
product on its own. After completing checkout and payment in the normal way, you
will receive your receipt on the screen containing a link to a personalised PDF
download file. This link will remain active for 30 days. You can download backup
copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will
download and open the PDF file directly. If you don't, then save the PDF file on
your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing,
completing your purchase means you accept the terms of our licence agreement.
Please read the full text of the agreement. In it we have tried to balance the
need for the ebook to be usable for you the reader with our needs to protect the
rights of us as Publishers and of our authors. In summary, the agreement says:

 * You may make copies of your eBook for your own use onto any machine
 * You may not pass copies of the eBook on to anyone else

How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please
follow below steps:

 1. Register on our website using your email address and the password.
 2. Search for the title by name or ISBN using the search option.
 3. Select the title you want to purchase.
 4. Choose the format you wish to purchase the title in; if you order the Print
    Book, you get a free eBook copy of the same title. 
 5. Proceed with the checkout process (payment to be made using Credit Card,
    Debit Cart, or PayPal)

Where can I access support around an eBook? Chevron down icon Chevron up icon
 * If you experience a problem with using or installing Adobe Reader, the
   contact Adobe directly.
 * To view the errata for the book, see www.packtpub.com/support and view the
   pages for the title you have.
 * To view your account details or to download a new copy of the book go
   to www.packtpub.com/account
 * To contact us directly if a problem is not resolved,
   use www.packtpub.com/contact-us

What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and
ePubs. In the future, this may well change with trends and development in
technology, but please note that our PDFs are not Adobe eBook Reader format,
which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF
eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
 * You can get the information you need immediately
 * You can easily take them with you on a laptop
 * You can download them an unlimited number of times
 * You can print them out
 * They are copy-paste enabled
 * They are searchable
 * There is no password protection
 * They are lower price than print
 * They save resources and space

What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available
in PDF and ePub formats. Every piece of content down to the page numbering is
the same. Because we save the costs of printing and shipping the book to you, we
are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the
link in Your Download Area. We recommend you saving the file to your hard drive
before opening it.

For optimal viewing of our eBooks, we recommend you download and install the
free Adobe Reader version 9.

Arrow left icon
Machine Learning Data Analysis Data Science Business Intelligence Data
Visualization Artificial Intelligence Deep Learning Database Administration Data
Processing Databases
Arrow right icon
Legal
Terms and Conditions Privacy Policy Cookie Policy Shipping Policy Cancellation
Policy Return Policy
Support
Help Contact Us
Business
Partnerships
Sponsored eBooks Custom eBooks
Careers Become an author
Packt+ Membership
Subscription DataPro SecPro TechLeaders
United States
Company Address: Packt Publishing Ltd, Grosvenor House, 11 St Paul's Square,
Birmingham, B3 1RB © 2024 Packt Publishing Limited All Rights Reserved

We use some essential cookies to make this service work. We’d also like to use
analytics cookies so we can understand how you use the service and make
improvements.

ACCEPT ALL COOKIES ACCEPT ESSENTIAL COOKIES REJECT ALL COOKIES

--------------------------------------------------------------------------------

Close icon

Signed in users are eligible for personalised offers and content
recommendations.

Country selected Sign in with Packt Gmail Sign in with Google Github Sign in
with Github

--------------------------------------------------------------------------------

You are browsing a version of our website which may not be the most relevant
option for you. We suggest changing to the following version. Country selected
OK Germany

www.packtpub.com Open in urlscan Pro 2606:4700:10::6816:88b Public Scan

Form analysis 3 forms found in the DOM

https://www.packtpub.com/en-us/search

GET https://www.packtpub.com/checkout/add/ebook/9781803239873

GET https://www.packtpub.com/checkout/add/ebook/9781803239873

Text Content

www.packtpub.com Open in urlscan Pro
2606:4700:10::6816:88b Public Scan

Form analysis
3 forms found in the DOM