Transforming DataFrame

Base on DataCamp

Play this article

Introduction to DataFrame

Pandas is a popular data science in the Python data science community. Pandas is built on top of NumPy which provides a multidimensional array of objects for easy data manipulation that pandas use to store data and Matplotlib to visualize our data.

Rectangular data, also known as tabular data, is the most common form for data analysis and is represented as a DataFrame object in pandas or a table in SQL.

Example Rectangular Data:

UsernameGenderEmailCreated At

Inspecting a DataFrame

  • .head(n=5): Returns the first n rows (the “head” of the DataFrame) the default value of n is 5

  • .info(): Print a concise summary of a DataFrame which shows the information on each of the columns, such as the data type and the number of missing values.

  • .shape: Returns the number of rows and columns of the DataFrame in Tuple.

  • .describe(): Generate descriptive statistics. Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

# Print the head of the homelessness data

# Print information about homelessness

# Print the shape of homelessness

# Print a description of homelessness

Parts of a DataFrame

  • .values: Return a Numpy representation of the DataFrame. this is deprecated and according to documentation we have been suggested to use .to_numpy() method instead

  • .columns: Return the column labels of the DataFrame.

  • .index: Return The index (row labels) of the DataFrame.

# Import pandas using the alias pd
import pandas as pd

# Print the values of homelessness

# Print the column index of homelessness

# Print the row index of homelessness

Sorting and subsetting

The simplest ways to find interesting parts of the DataFrame

  1. Sorting, by sorting DataFrame we can find the most interesting data is at the top of DataFrame.

  2. Subsetting, by extracting a subset of data from a larger dataset we can select specific columns, and filter the dataset with the logical condition.

Sorting rows

We can use sort_values() method to sorting rows in pandas DataFrame.

Sort on …Syntax
one columndf.sort_values("breed")
multiple columnsdf.sort_values(["breed", "weight_kg"])
# Sort homelessness by individuals
homelessness_ind = homelessness.sort_values(['individuals'])

# Print the top few rows

# Sort homelessness by descending family members
homelessness_fam = homelessness.sort_values(['family_members'], ascending = False)

# Print the top few rows

# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(['region', 'family_members'], ascending=[True, False])

# Print the top few rows

Subsetting columns

We can use square brackets ([]) to select specific columns that make sense to us.

# Select the individuals column
individuals = homelessness['individuals']

# Print the head of the result

# Select the state and family_members columns
state_fam = homelessness[['state', 'family_members']]

# Print the head of the result

# Select only the individuals and state columns, in that order
ind_state = homelessness[['individuals', 'state']]

# Print the head of the result

Subsetting rows

This is sometimes known as filtering rows or selecting rows.

# Filter for rows where individuals is greater than 10000
ind_gt_10k = homelessness[homelessness['individuals'] > 10000]

# See the result

# Filter for rows where region is Mountain
mountain_reg = homelessness[homelessness['region'] == 'Mountain']

# See the result

# Filter for rows where family_members is less than 1000 
# and region is Pacific
fam_lt_1k_pac = homelessness[(homelessness['family_members'] < 1000) & (homelessness['region'] == 'Pacific')]

# See the result

Subsetting rows by categorical variables

When we are filtering rows we often use "or" (|) to select rows from multiple categories. This can get tedious when we want to filter more than two categories

# Subset for rows in South Atlantic or Mid-Atlantic regions
south_mid_atlantic = homelessness[homelessness['region'].isin(['South Atlantic', 'Mid-Atlantic'])]

# See the result

# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness['state'].isin(canu)]

# See the result

New columns

Adding a column to a DataFrame in pandas involves creating a new column and assigning values to it. This process is commonly referred to as "mutating," "transforming," or "feature engineering."

Adding new columns

# Add total col as sum of individuals and family_members
homelessness['total'] = homelessness['individuals'] + homelessness['family_members']

# Add p_individuals col as proportion of total that are individuals
homelessness['p_individuals'] = homelessness['individuals'] / homelessness['total']

# See the result


# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * homelessness['individuals'] / homelessness['state_pop']

# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness[homelessness['indiv_per_10k'] > 20]

# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values(['indiv_per_10k'], ascending=[False])

# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[['state', 'indiv_per_10k']]

# See the result