# amalhanaja  # Getting to Know a Dataset

## Initial exploration

### Explanatory Data Analysis (EDA)

The process of cleaning and reviewing data to...

• derive insights, such as descriptive statistics and correlation

• generate hypotheses for experiments

Results

Inform the next step for the dataset

### Pandas method for initial exploration

We can use `head` method to take a look at the top of the DataFrame, we can see our data contains columns representation.

``````  df.head()
``````
• info

We can use `info` is a quick way to summarize the number of missing values in each column, the data types of each column, and memory usage.

``````  df.info()
``````
• value_counts

A common question about categorical data is how many data points we have in each category. We can use `value_counts` to answer the question

``````  df.value_counts('category')
``````
• describe

We can use `describe` getting summary statistics about our datasets

``````  df.describe()
``````

### Functions for initial exploration

``````#1
# Print the first five rows of unemployment

#2
# Print a summary of non-missing values and data types in the unemployment DataFrame
print(unemployment.info())

#3
# Print summary statistics for numerical columns in unemployment
print(unemployment.describe())
``````

### Counting categorical values

``````# Count the values associated with each continent in unemployment
print(unemployment['continent'].value_counts())
``````

### Global unemployment in 2021

``````# Import the required visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Create a histogram of 2021 unemployment; show a full percent in each bin
sns.histplot(data=unemployment, x='2021',binwidth=True)
plt.show()
``````

## Data validation

Data validation is an important early step in EDA, we need to understand whether data types and ranges are as expected.

### Validating data types

We can take a look at the data type of each column using `info` or `dtypes` to validate data types.

``````df.info()

# Data types only
df.dtypes
``````

### Updating data types

``````df['year'] = df['year'].astype(int)
``````

### Validating categorical data

We can validate categorical data by comparing values in a column to a list of expected values using `isin` function.

``````df['gender'].isin(['Male', 'Female']) # is in ['Male', 'Female']

~df['gender'].isin(['Male', 'Female']) # is not in ['Male', 'Female']
``````

### Validating numeric data

``````df.select_dtypes('number') # to filter numberic data column only

df['year'].min() # Min year
df['year'].max() # Max year
``````

### Detecting data types

``````# Update the data type of the 2019 column to a float
unemployment["2019"] = unemployment['2019'].astype(float)
# Print the dtypes to check your work
print(unemployment.dtypes)
``````

### Validating continents

``````#1
# Define a Series describing whether each continent is outside of Oceania
not_oceania = unemployment['continent'] != 'Oceania'

#2
# Print unemployment without records related to countries in Oceania
print(unemployment[not_oceania])
``````

### Validating range

``````# Print the minimum and maximum unemployment rates during 2021
print(unemployment['2021'].min(), unemployment['2021'].max())

# Create a boxplot of 2021 unemployment rates, broken down by continent
sns.boxplot(x='2021',y='continent',data=unemployment)
plt.show()
``````

## Data summarization

### Exploring group of data

• `.groupby()` groups data by category

• Aggregate function indicates how to summarize grouped data

### Aggregating functions

• `.sum()`

• `.count()`

• `.min()`

• `.max()`

• `.var()`

• `.std()`

### Aggregating ungrouped data

``````books.agg(['mean', 'std'])
``````

### Specifying aggregations for columns

``````books.agg({'rating': ['mean', 'std'], 'year': ['median']})
``````

### Named summary columns

``````books.groupby('genre').agg(
mean_rating=('rating', 'mean'),
std_rating=('rating', 'std'),
median_year=('year', 'median')
)
``````

### Visualizing categorical summaries

``````sns.barplot(x='genre', y='rating', data = books)
plt.show()
``````

### Summaries with .groupby() and .agg()

``````#1
# Print the mean and standard deviation of rates by year
print(unemployment.agg(['mean', 'std']))

#2
# Print yearly mean and standard deviation grouped by continent
print(unemployment.groupby('continent').agg(['mean', 'std']))
``````

### Named aggregations

``````continent_summary = unemployment.groupby("continent").agg(
# Create the mean_rate_2021 column
mean_rate_2021 = ('2021', 'mean'),
# Create the std_rate_2021 column
std_rate_2021 = ('2021', 'std'),
)
print(continent_summary)
``````

### Visualizing categorical summaries

``````# Create a bar plot of continents and their average unemployment
sns.barplot(x='continent', y='2021', data=unemployment)
plt.show()
``````