# amalhanaja

Plots are a powerful way to share the insights we've gained from our data.

• Histogram shows the distribution of a numeric variable.

• Bar plots shows relationships between a categorical variable and a numeric variable, like gender and height

• Line plots are great for visualizing changes in numeric variables over time.

• Scatter plots are great for visualizing relationships between two numeric variables.

``````# Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt

# Look at the first few rows of data

# Get the total number of avocados sold of each size
# print(nb_sold_by_size)

# Create a bar plot of the number of avocados sold by size
nb_sold_by_size.plot(kind='bar')

# Show the plot
plt.show()
`````` #### Changes in sales over time

``````# Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt

# Get the total number of avocados sold on each date

# Create a line plot of the number of avocados sold by date
nb_sold_by_date.plot(kind='line')

# Show the plot
plt.show()
`````` ``````# Scatter plot of avg_price vs. nb_sold with title

# Show the plot
plt.show()
`````` #### Price of conventional vs. organic avocados

``````# Histogram of conventional avg_price

# Histogram of organic avg_price

plt.legend(['conventional', 'organic'])

# Show the plot
plt.show()
`````` ``````# Modify histogram transparency to 0.5

# Modify histogram transparency to 0.5

plt.legend(["conventional", "organic"])

# Show the plot
plt.show()
`````` ``````# Modify bins to 20

# Modify bins to 20

plt.legend(["conventional", "organic"])

# Show the plot
plt.show()
`````` ### Missing values

We could be given a DataFrame that has missing values. When we first get a DataFrame, it's a good idea to get a sense of whether it contains any missing values, and if so, how many. We can inspect missing values using `isna()` method then we get a Boolean for every single value indicating whether the value is missing or not, but this isn't very helpful when we're working with a lot of data. Since taking the sum of Booleans is the same thing as counting the number of Trues, we can combine sum with `isna()` to count the number of NaNs in each column and we also can plot it using a bar plot to get insight. We can remove the missing values from our data set or fill it with others values to handle missing values.

#### Finding missing values

``````# Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt

# Check individual values for missing values

# Check each column for missing values

# Bar plot of missing values by variable

# Show plot
plt.show()
`````` #### Removing missing values

``````# Remove rows with missing values

# Check if any columns contain missing values
``````
``````date               False
avg_price          False
total_sold         False
small_sold         False
large_sold         False
xl_sold            False
total_bags_sold    False
small_bags_sold    False
large_bags_sold    False
xl_bags_sold       False
dtype: bool
``````

#### Replacing missing values

``````# List the columns with missing values
cols_with_missing = ["small_sold", "large_sold", "xl_sold"]

# Create histograms showing the distributions cols_with_missing

# Show the plot
plt.show()
`````` ``````# From previous step
cols_with_missing = ["small_sold", "large_sold", "xl_sold"]
plt.show()

# Fill in missing values with 0

# Create histograms of the filled columns

# Show the plot
plt.show()
`````` ### Creating DataFrame

There are many ways to create DataFrames from scratch, but we'll discuss two ways: from a list of dictionaries and from a dictionary of lists. In the first method, the DataFrame is built up row by row, while in the second method, the DataFrame is built up column by column.

#### List of dictionaries

``````# Create a list of dictionaries with new data
{'date': '2019-11-03', 'small_sold': 10376832, 'large_sold': 7835071},
{'date': "2019-11-10", 'small_sold': 10717154, 'large_sold': 8561348}
]

# Convert list into DataFrame

# Print the new DataFrame
``````

#### Dictionary of lists

``````# Create a dictionary of lists with new data
"date": ["2019-11-17", "2019-12-01"    ],
"small_sold": [10859987,9291631    ],
"large_sold": [7674135,6238096]
}

# Convert dictionary into DataFrame

# Print the new DataFrame
``````

CSV, or comma-separated values, is a common data storage file type. It's designed to store tabular data, just like a pandas DataFrame.

#### CSV to DataFrame

``````# Read CSV as DataFrame called airline_bumping

# Take a look at the DataFrame
``````
``````# From previous step

# For each airline, select nb_bumped and total_passengers and sum
airline_totals = airline_bumping.groupby('airline')[['nb_bumped', 'total_passengers']].sum()
``````
``````# From previous steps
airline_totals = airline_bumping.groupby("airline")[["nb_bumped", "total_passengers"]].sum()

# Create new col, bumps_per_10k: no. of bumps per 10k passengers for each airline
airline_totals["bumps_per_10k"] = airline_totals['nb_bumped'] / airline_totals['total_passengers'] * 10000
``````
``````# From previous steps
airline_totals = airline_bumping.groupby("airline")[["nb_bumped", "total_passengers"]].sum()
airline_totals["bumps_per_10k"] = airline_totals["nb_bumped"] / airline_totals["total_passengers"] * 10000

# Print airline_totals
print(airline_totals)
``````

#### DataFrame to CSV

``````# Create airline_totals_sorted
airline_totals_sorted = airline_totals.sort_values('bumps_per_10k', ascending=False)

# Print airline_totals_sorted
print(airline_totals_sorted)

# Save as airline_totals_sorted.csv
airline_totals_sorted.to_csv('airline_totals_sorted.csv')
``````