Chap 7 Relationships between variables

Visualization

In most cases, the simplest way to check a relationship between two variables is scatter plot(散点图).
To get better visual effect, usually we could use jittering to reverse the effect of rounding off. To handle the problem of too much overlapping points, we can set the parameter of alpha to less than 1.0.

However, when the dataset is bigger, hexbin may be a better choice to show the relationship.

In addition to scatter plot, bin one variable and plot percentiles of the others is also a good option.
Here is the code for bining data(数据装箱):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
bins = np.arange(10, 100, 5)
indices = np.digitize(data.x, bins) # data is a dataframe contains a variable of x. np.digitize maps each data.x to its bin
groups = data.groupby(indices)

```
### Correlation
A **correlation** is a statistic to quantify the strength of the relationship between two variables.
The challenge is:
1. variables are not in the same unit
2. variables may come from different distributions

Two common solution:
1. **Standard score** for example, use $ z\_i = (x\_i-\mu) \sigma $ . Which lead to "Pearson product-moment correlation coefficient"
2. Transform value to **Rank**, which leads to "Spearman rank correlation coefficient"

### Covariance
**Covariance** is a measure of the tendency of two variables to vary together.
$$Cov(X,Y) = E[(X-E(X))(Y-E(Y))] = E[XY]-E[X]E[Y]$$

### Pearson's correlation
$$\rho_{X,Y}={\frac {\operatorname {cov} (X,Y)}{\sigma_{X}\sigma _{Y}}} $$
Pearson's correlation between -1 and +1. By dividing the starndard deviation, that yields the coefficient to dimesionless.  
If the value is positive or negatinve, so does the relationship, and the magnitude indicates the strength of the correlation.
If Pearson's correlation is near 0, it means that the variables do not have much linear relationships. And **Pearson's correlation only measures linear relationships**.

### Spearman's rank correlation
The computation of Spearmans's rank correlation is similar to Pearson's correlation, except for replace real value with rank.
Compared with Pearson's correlation, Spearman's rank corrlation has serveral advantages:
1. Pearson's correlation tends to underestimate the strength of relationship, when relationship is **nonlinear**.
2. Spearmans's rank correlations is more robust, while Pearson's can be affected if the distributions is **skewed** or contains outliner.

### Correlation and causation
Always remember [**Correlation does not imply causatioin**](https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation)
Two methods we can try to provide evidence of causation.
1. Use time.
2. Use randomness

## Chap 8 Estimation

### Guess the variance
Think about a normal distribution, we have a sample of $[x\_1, x_2, \dots,x_m]$, to estimate $\sigma^2$, here is a estimator:
$$S^2 = \frac{1}{n}\sum(x_i-\bar{x})^2$$
When the sample is large, the estimator is adequate, but for small samples it tends to be too low.
Since the estimator above is **biased**, there is another **unbiased** estimator of $\sigma^2$.

### Sampling distributions
Variations in the estimate cauesd by random selection(instead of using full data) is clled **sampling error**.
If we run m simulations and choose n values to estimate variable each time, we will get a **sampling distribution**.
There are two common ways to summarize the sampling distribution:
1. Standard error (SE). *Notice that the difference between standard error and standard deviation*.
2. Confidence interval (CI 置信区间) is a range that includes a given fractions of sampling distribution. Shaped like 90% CI is (86,94).
SE & CI only quantify sampling error, sampling distribution does not account for **sampling bias and measurement error**.

### Sampling bias
+ **sampling bias** is the bias caused by sampling that is not uniform.
+ **measurement error** is the error caused by in inaccurate measurement.

### Exponenial distributions
To estimate, we could use:
$$L=\frac{1}{\bar{x}}$$
or use median to ensure robustness ( $m$ is the sample median):
$$L_m=ln(2)/m$$

## Hypothesis testing

### Classical hypothesis testing
To answer the question of :
> Given a sample and an apparent effect, what is the probability of seeing such an effect by chance?
we follow the steps:
1. Quantity the size of the apparent effect by choosing a **test statistic**.
2. Define a **null hypothesis**
3. Compute a p-value——the probability of seeing the apparent effect if the null bypothesis is true.
4. Interpret the result

The logic of this progress is similar to a proof by contrdiction.

### Chi-squared tests
To test proportions, it's more common to use the **chi-squared** as statistic.
$$\chi^2 = \sum_{i}=frac{(O_i-E_i)^2}{E_i}$$
where the $O\_i$ is the observed value, and the $E\_i$ is the excepted value.

## Linear least square

### why use square?

+ Treat the positive and negative residual the same
+ Give more weight but not so much to large residuals
+ If the residuals are uncorrelated and normally distributed with mean 0 and constant variance, then the least squares fit is also the maximum likelihood estimator of inter and slope.


## Time series analysis
### Moving averages
Three most common components in time series modeling:
+ Trend: A smooth function that captures presistent changes
+ Seasonality: Periodic variation
+ Noise: Random variation around the long-term trend

Moving average - show the trend.
**Simplest moving average**: rolling mean
```py
rollingMean = [mean(data[i],...,data[i+k]) for i in range(len(data)) - k]
# or
pandas.rolling_mean(data, k)

EWMA (expontially-weighted moving average):
the most recent value has the highest weight and the weights for previous values drop off exponentially.

ewma = pandas.ewma(data, span=k)