Introduction to DataFrame
Pandas is a popular data science in the Python data science community. Pandas is built on top of NumPy which provides a multidimensional array of objects for easy data manipulation that pandas use to store data and Matplotlib to visualize our data.
Rectangular data, also known as tabular data, is the most common form for data analysis and is represented as a DataFrame object in pandas or a table in SQL.
Example Rectangular Data:
Inspecting a DataFrame
.head(n=5): Returns the first n rows (the “head” of the DataFrame) the default value of
.info(): Print a concise summary of a DataFrame which shows the information on each of the columns, such as the data type and the number of missing values.
.shape: Returns the number of rows and columns of the DataFrame in Tuple.
.describe(): Generate descriptive statistics. Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding
# Print the head of the homelessness data print(homelessness.head()) # Print information about homelessness print(homelessness.info()) # Print the shape of homelessness print(homelessness.shape) # Print a description of homelessness print(homelessness.describe())
Parts of a DataFrame
.columns: Return the column labels of the DataFrame.
.index: Return The index (row labels) of the DataFrame.
# Import pandas using the alias pd import pandas as pd # Print the values of homelessness print(homelessness.values) # Print the column index of homelessness print(homelessness.columns) # Print the row index of homelessness print(homelessness.index)
Sorting and subsetting
The simplest ways to find interesting parts of the DataFrame
Sorting, by sorting DataFrame we can find the most interesting data is at the top of DataFrame.
Subsetting, by extracting a subset of data from a larger dataset we can select specific columns, and filter the dataset with the logical condition.
We can use
sort_values() method to sorting rows in pandas DataFrame.
|Sort on …||Syntax|
# Sort homelessness by individuals homelessness_ind = homelessness.sort_values(['individuals']) # Print the top few rows print(homelessness_ind.head()) # Sort homelessness by descending family members homelessness_fam = homelessness.sort_values(['family_members'], ascending = False) # Print the top few rows print(homelessness_fam.head()) # Sort homelessness by region, then descending family members homelessness_reg_fam = homelessness.sort_values(['region', 'family_members'], ascending=[True, False]) # Print the top few rows print(homelessness_reg_fam.head())
We can use square brackets (
) to select specific columns that make sense to us.
# Select the individuals column individuals = homelessness['individuals'] # Print the head of the result print(individuals.head()) # Select the state and family_members columns state_fam = homelessness[['state', 'family_members']] # Print the head of the result print(state_fam.head()) # Select only the individuals and state columns, in that order ind_state = homelessness[['individuals', 'state']] # Print the head of the result print(ind_state.head())
This is sometimes known as filtering rows or selecting rows.
# Filter for rows where individuals is greater than 10000 ind_gt_10k = homelessness[homelessness['individuals'] > 10000] # See the result print(ind_gt_10k) # Filter for rows where region is Mountain mountain_reg = homelessness[homelessness['region'] == 'Mountain'] # See the result print(mountain_reg) # Filter for rows where family_members is less than 1000 # and region is Pacific fam_lt_1k_pac = homelessness[(homelessness['family_members'] < 1000) & (homelessness['region'] == 'Pacific')] # See the result print(fam_lt_1k_pac)
Subsetting rows by categorical variables
When we are filtering rows we often use "or" (
|) to select rows from multiple categories. This can get tedious when we want to filter more than two categories
# Subset for rows in South Atlantic or Mid-Atlantic regions south_mid_atlantic = homelessness[homelessness['region'].isin(['South Atlantic', 'Mid-Atlantic'])] # See the result print(south_mid_atlantic) # The Mojave Desert states canu = ["California", "Arizona", "Nevada", "Utah"] # Filter for rows in the Mojave Desert states mojave_homelessness = homelessness[homelessness['state'].isin(canu)] # See the result print(mojave_homelessness)
Adding a column to a DataFrame in pandas involves creating a new column and assigning values to it. This process is commonly referred to as "mutating," "transforming," or "feature engineering."
Adding new columns
# Add total col as sum of individuals and family_members homelessness['total'] = homelessness['individuals'] + homelessness['family_members'] # Add p_individuals col as proportion of total that are individuals homelessness['p_individuals'] = homelessness['individuals'] / homelessness['total'] # See the result print(homelessness)
# Create indiv_per_10k col as homeless individuals per 10k state pop homelessness["indiv_per_10k"] = 10000 * homelessness['individuals'] / homelessness['state_pop'] # Subset rows for indiv_per_10k greater than 20 high_homelessness = homelessness[homelessness['indiv_per_10k'] > 20] # Sort high_homelessness by descending indiv_per_10k high_homelessness_srt = high_homelessness.sort_values(['indiv_per_10k'], ascending=[False]) # From high_homelessness_srt, select the state and indiv_per_10k cols result = high_homelessness_srt[['state', 'indiv_per_10k']] # See the result print(result)