# amalhanaja  # Correlation and Experimental Design

## Correlation

### Relationship between two variables x = explanatory/independent variable

y = response/dependent variable

### Correlation coefficient

• Quantifies the linear relationship between two variables

• Number between -1 and 1

• Magnitude corresponds to strength of relationship

• Sign (+ or -) corresponds to direction of relationship

• The closer the correlation value to zero the weaker the correlation

### Visualize relationships

``````import seaborn as sns
sns.scatterplot(x="sleep_total", y="sleep_rem", data=msleep)
plt.show()
``````

By adding a trendline it can help us to easily find a correlation between two variable

``````import seaborn as sns
sns.lmplot(x='sleep_total', y='sleep_rem', data=msleep, ci=None)
plt.show()
``````

### Calculating correlation

``````msleep['sleep_total'].corr(msleep['sleep_rem']) # Out: 0.751755

msleep['sleep_rem'].corr(msleep['sleep_total']) # Out: 0.751755
``````

Correlation between x and y == correlation between y and x

### Relationship between variables

``````#1.
# Create a scatterplot of happiness_score vs. life_exp and show
sns.scatterplot(x='life_exp', y='happiness_score', data=world_happiness)

# Show plot
plt.show()

#2.
# Create scatterplot of happiness_score vs life_exp with trendline
sns.lmplot(x='life_exp', y='happiness_score', data=world_happiness, ci=None)

# Show plot
plt.show()

#4
# Correlation between life_exp and happiness_score
cor = world_happiness['life_exp'].corr(world_happiness['happiness_score'])

print(cor)
``````

## Correlation caveats

• Correlations only account for a linear relationship

• Always visualize data when possible

• `x` is correlated with `y` doesn't mean `x` causes `y`

• Apply log when data is highly skewed, we can apply `np.log` transformation

• Other Transformation:

• Log transformation (`log(x)`)

• Square root transformation (`sqrt(x)`)

• Reciprocal transformation (`1/x`)

• Combination of these, e.g.:

• `log(x)` and `sqrt(y)`

• `sqrt(x)` and `1/y`

• `1/x` and `log(y)`

### Why use transformation?

Certain statistical methods rely on variables having a linear relationship

• Correlation coefficient

• Linear regression

### What can't correlation measure?

``````#1
# Scatterplot of gdp_per_cap and life_exp
sns.scatterplot(x='gdp_per_cap',y='life_exp',data=world_happiness)

# Show plot
plt.show()

#2
# Correlation between gdp_per_cap and life_exp
cor = world_happiness['gdp_per_cap'].corr(world_happiness['life_exp'])

print(cor)
``````

### Transforming variable

``````#1
# Scatterplot of happiness_score vs. gdp_per_cap
sns.scatterplot(x='gdp_per_cap', y='happiness_score',data=world_happiness)
plt.show()

# Calculate correlation
cor = world_happiness['happiness_score'].corr(world_happiness['gdp_per_cap'])
print(cor) # Out: 0.727973301222298

#2
# Create log_gdp_per_cap column
world_happiness['log_gdp_per_cap'] = np.log(world_happiness['gdp_per_cap'])

# Scatterplot of happiness_score vs. log_gdp_per_cap
sns.scatterplot(x='log_gdp_per_cap',y='happiness_score',data=world_happiness)
plt.show()

# Calculate correlation
cor = world_happiness['log_gdp_per_cap'].corr(world_happiness['happiness_score'])
print(cor) # Out: 0.8043146004918288
``````

### Does sugar improve happiness?

``````#1
# Scatterplot of grams_sugar_per_day and happiness_score
sns.scatterplot(x='grams_sugar_per_day', y='happiness_score',data=world_happiness)
plt.show()

# Correlation between grams_sugar_per_day and happiness_score
cor = world_happiness['grams_sugar_per_day'].corr(world_happiness['happiness_score'])
print(cor)
``````

## Design of experiments

### Controlled experiments

• Participants are assigned by researchers to either a treatment group or a control group, e.g. A/B Testing

• Control group doesn't

• Group should be comparable so that causation can be inferred, if not could lead to cofounding (bias)

### Best practices

The best practice of experiments will eliminate as much bias as possible.

• Less bias is more reliable

#### Tools

• Randomize controlled trials

• Participants are assigned to treatment/control randomly, not based on any other characteristics

• Choosing randomly helps ensure that groups are comparable

• Placebo

• Resembles the treatment, but has no effect

• Participants will not know which group they're in

• Double-blind trials

• Person administering the treatment/running the study doesn't know whether the treatment is real or a placebo

• Prevent bias in the response and/or analysis result

### Observational study

• Participants are not assigned randomly to groups

• Participants assign themselves, usually based on pre-existing characteristics
• Many research questions are not conducive to a controlled experiment

• Establish association, not causation

### Longitudinal vs cross-sectional studies

#### Longitudinal studies

• Participants are followed over a certain period to examine the effect of treatment response

• Effect of age on height is not cofounded by generation

• Expensive and take a longer time

#### Cross-sectional studies

• Data on participants is collected from a single snapshot in time

• Effect of age on height is cofounded by generation

• Affordable and take a shorter time