Getting to Know a Dataset

Base on DataCamp

Play this article

Initial exploration

Explanatory Data Analysis (EDA)

The process of cleaning and reviewing data to...

  • derive insights, such as descriptive statistics and correlation

  • generate hypotheses for experiments


Inform the next step for the dataset

Pandas method for initial exploration

  • head

    We can use head method to take a look at the top of the DataFrame, we can see our data contains columns representation.

  • info

    We can use info is a quick way to summarize the number of missing values in each column, the data types of each column, and memory usage.
  • value_counts

    A common question about categorical data is how many data points we have in each category. We can use value_counts to answer the question

  • describe

    We can use describe getting summary statistics about our datasets


Functions for initial exploration

# Print the first five rows of unemployment

# Print a summary of non-missing values and data types in the unemployment DataFrame

# Print summary statistics for numerical columns in unemployment

Counting categorical values

# Count the values associated with each continent in unemployment

Global unemployment in 2021

# Import the required visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Create a histogram of 2021 unemployment; show a full percent in each bin
sns.histplot(data=unemployment, x='2021',binwidth=True)

Data validation

Data validation is an important early step in EDA, we need to understand whether data types and ranges are as expected.

Validating data types

We can take a look at the data type of each column using info or dtypes to validate data types.

# Data types only

Updating data types

df['year'] = df['year'].astype(int)

Validating categorical data

We can validate categorical data by comparing values in a column to a list of expected values using isin function.

df['gender'].isin(['Male', 'Female']) # is in ['Male', 'Female']

~df['gender'].isin(['Male', 'Female']) # is not in ['Male', 'Female']

Validating numeric data

df.select_dtypes('number') # to filter numberic data column only

df['year'].min() # Min year
df['year'].max() # Max year

Detecting data types

# Update the data type of the 2019 column to a float
unemployment["2019"] = unemployment['2019'].astype(float)
# Print the dtypes to check your work

Validating continents

# Define a Series describing whether each continent is outside of Oceania
not_oceania = unemployment['continent'] != 'Oceania'

# Print unemployment without records related to countries in Oceania

Validating range

# Print the minimum and maximum unemployment rates during 2021
print(unemployment['2021'].min(), unemployment['2021'].max())

# Create a boxplot of 2021 unemployment rates, broken down by continent

Data summarization

Exploring group of data

  • .groupby() groups data by category

  • Aggregate function indicates how to summarize grouped data

Aggregating functions

  • .sum()

  • .count()

  • .min()

  • .max()

  • .var()

  • .std()

Aggregating ungrouped data

books.agg(['mean', 'std'])

Specifying aggregations for columns

books.agg({'rating': ['mean', 'std'], 'year': ['median']})

Named summary columns

    mean_rating=('rating', 'mean'),
    std_rating=('rating', 'std'),
    median_year=('year', 'median')

Visualizing categorical summaries

sns.barplot(x='genre', y='rating', data = books)

Summaries with .groupby() and .agg()

# Print the mean and standard deviation of rates by year
print(unemployment.agg(['mean', 'std']))

# Print yearly mean and standard deviation grouped by continent
print(unemployment.groupby('continent').agg(['mean', 'std']))

Named aggregations

continent_summary = unemployment.groupby("continent").agg(
    # Create the mean_rate_2021 column
    mean_rate_2021 = ('2021', 'mean'),
    # Create the std_rate_2021 column
    std_rate_2021 = ('2021', 'std'),

Visualizing categorical summaries

# Create a bar plot of continents and their average unemployment
sns.barplot(x='continent', y='2021', data=unemployment)