Bayesian Analyis with Python Chapter 4, Understanding and Predicting Data with Linear Regression Models

Simple Linear Regression

Continuous variable - a variable using real numbers or floats (dependent, predicted, outcome)
Independent variable - can be continous or categorical (predictor, input)
We can can model this relationship with linear regression. With multiple independent variables, we will use multiple regression models.

The machine learning connection

Machine learning (ml) is the umbrella term for a collection of methods to automatically learn patterns in data. Regression is a supervised learning problem because we know the x and y values. The question is how to generalize these observations to any future observation.

The core of linear regression models

$\underset{1}{y} = \alpha + \beta \underset{1}{x}$

Beta is the slope of the line, changer per unit change in x.
Alpha is the value of y when x = 0.
When we try to solve this problem we use the least squares model. We can also use a Bayesian framework.
This has several advantages:

we can obtain the best values of alpha and beta
capture uncertainty estimation of these parameters.

Probable linear regression:
$y \sim N(\mu = \alpha +\beta x, \sigma =\varepsilon )$

data vector y is assumed to be distributed as a Gaussian with mean alpha + beta x

Linear models and high auto correlation

By definition, both of our parameters are going to be correlated by the definition of the model. The shape a very diagonal space. See the Curse of dimensionality. The fact that the line is constrained by the mean of the data is only true for least squares method. Using Bayesian methods, this constraint is relaxed. We will look at two approaches to fix this.

Modifying the data before running

The simple solution is to center the x data by subtracting the mean of the x variables.
${x}' = x - \bar{x}$

x prime will be centered at 0.

Centering the data can help with interpeting the data. In some insteresting cases
$\underset{i}{x} = 0$
the value
$\underset{i}{y}$ value is meaningless.
To report parameters back in the original scale

$\underset{i}{x} = {a}' - {\beta}'x$

You can also standardize the data. To do this, you divide the by the standard deviation
${x}' = \left (x - \overline{x} \right )/ \underset{sd}{x}$
${y}' = \left (y - \overline{y} \right )/ \underset{sd}{y}$
Standardizing the data allows us to talk in z-scores. 1.3 in z-score units means 1.3 standard deviations above or below.

Changing the sampling method

By changing the sampling method we can alleviate the auto correlation problem. NUTS can be slower per step, but usually needs less steps than Metropolis-Hastings

Interpreting and visualizing the posterior

This section is a lot of code, and pictures. The author gives good ideas on how to analyze of the code and the data that it produces.

Pearson correlation coefficient

Is the measure of the degree of linear dependence between two variables, often denoted as:
r = +1 == perfect positive linear correlation. When one variable goes up, the other goes down.
r = -1 == perfect negative correlation. When one variable goes up, the other goes down.
r = 0 == no linear correlation
Pearson correlation coefficient is equal to the slope of the line when standard deviation of x = standard deviation of y.
determinant coefficient = 1 = pearson coefficient squared

Pearson coeffecient from a multivariate Gaussian

Multivariate Gaussian is the generalization of the Gaussian distribution to more than one dimension.
We need a 2x2 co-variance matrix.
$\sum = \begin{bmatrix} \underset{x1}{\sigma ^{2}} & \rho\underset{x2}{\sigma} \underset{x1}{\sigma} \\ \rho\underset{x2}{\sigma} \underset{x1}{\sigma} & \underset{x2}{\sigma ^{2}} \end{bmatrix}$
The main diagonal is the square of the variances p = Pearson coefficient correlation
Since we don't know the values of the covariance matrix we have to put priors on it. We could use other methodologies

Wishart Distribution
LKJ Prior
use manual priors

Robust linear regression

Assuming our data is Gaussian is a reasonable assumption in most cases. Because of outliers, our Gaussian assumption can fail. The students t-distribution can be a reasonably robust inference. These concepts apply to linear regression also. You may need to use a shifted exponential, as the unshifted puts too much emphasis on extreme outliers or data with few bulk points.

Hierarchical linear regression

This section contains a bunch of code that shows how to do hierarchical linear regression.

Correlation, causation, and the messiness of life

Correlation does not imply causation

When we establish a linear relationship between two variables, the variables can be interchanged. This does not mean that x implies y or that y implies x. To establish that correlation can be interpreted as causation, we need to add a physical mechanism to the problems.

Polynomial regression

This section discusses the code to a line for polynomial regression.

Interpreting the parameters of a polynomial regression

The beta coefficients are no longer slopes. While these models can be good at predictions, they aren't good at understanding the underlying processes. Polynomials of order two or higher are better with other models.

Polynomial regression - the ultimate model

Is is possible to fit a polynomial perfectly. A model that fits your data perfectly will in general be a poor description of unobserved data. This is known as overfitting, and is a problem for statistics and machine learning. Lines are easier to interpret even if the cubic models fit the data better. It is possible to fit you data perfectly with polynomial regression. This does not mean it is a good model

Multiple linear regression

We have been working with one dependent and independent variable. We can also have multiple independent variables.

$\mu = \alpha + \beta x$

Beta is a vector of coefficients

Under linear regression, we hope to find a straight line that fits our data. In multiple linear regression, we find a hyperplane of dimension mu.

Confounding variables and redundant variables

We can sometimes predict y from x. But we might really be interested in z. When we omit the variable that is actually driving our analysis, this is called the confounding variable. It can be left out for many reasons, it wasn't measured, or was left out of our dataset.

Multi collinearity or when the correlation is too high

To prove the point, we set two variables almost exactly equal. This model works and allows us to predict data very well. But it may be simpler to leave one variable out. Correlated variables and highly correlated variables are always possible in any dataset.
How to deal with them is the question:

remove one variable, it doesn't really matter which one
create a new variable averaging the redundant variables
use stronger priors

Masking effect variables

When one variable is positively correlated and another is negatively correlated, the variables by themselves are not good predictors.

Adding Interactions

In our examples so far, the dependent variables have been contributing to additives toe the predictive variable. Interaction variables are non-additive terms that affect our predicted variable.

The GLM module

The glm (generalized linear module) included in PYMC3, simplifies writing linear models.

Search This Blog

Orion's Book Reviews