Bayesian Analyis with Python Chapter 2, Programming Probabilisticly

Probabilistic Programming

Probabilistic programming allows clear separation between models and inference. PP hides the details on how probabilities are performed, allowing you to focus on model specifications and analysis of results.

Inference Engines

Non-Markovian Methods

Grid computing
Quadratic approximation
Variational methods

Markovian Methods

Monte Carlo
Metropolis Hastings
Markov Chain
Hamiltonian Monte Carlo/NUTS

Currently, Bayesian analysis using MCMC (Markov Chain Monte Carlo) methods, while bigger datasets are using variational methods.

Non-Markovian Methods

Can be faster than Markovian and provide a good starting point for Markovian methods.

Grid Computing

May be able to compute the prior even if it can not get the whole posterior. An infinite number of points will get us the exact posterior. Does not work well for many parameters. Will end up spending the most time computing values with null contribution.

Quadratic Method

Also known as LaPlace Method. This method uses a Gaussian (Normal) distribution to calculate the posterior.

Variational Methods

The general idea is to approximate the posterior with a simpler distribution. The main drawback is that each model requires a different algorithm. ADVI (automatic differentiation variational inference)

Markovian Methods

Monte Carlo

This method uses random sampling to compute or simulate a given process

Markov Chain

A Markov chain is a mathematical object that consists of a sequence of states and the probabilities of transitioning to those states.
Detail Balance Condition if we move in a reversible way, the probability of moving from state i to state j should be the same probability of moving from j to i.

Metropolis-Hastings

There is a ton in this section. Take your time and look up all the terms included. The concept is easy.

Hamilton Monte Carlo/NUTS

Hamilton Monte Carlo is a description of the total energy in a system. This system is faster since we are no longer making random moves. Think again of the lake. When we take a sample, we move a number of steps to the next lowest point until we get to the lowest point in the lake.

Other MCMC methods

There are a lot of other methods we could explore. The book mentions 2, but leaves them as an exercise for the coder.
Replica Exchange method (aka parallel tempering, Metropolis coupled MCMC)

PYMC3 Introduction

We are starting to get into the guts of the new module we will be using. PyMC3 is written using Theano and Numpy. We don't need to know Theano, but it might not be a bad idea to work through a few quick tutorials.
Links:

http://deeplearning.net/software/theano/tutorial/

Coin-flipping, the computational approach

Model Specification

We are told how closely the PyMC3 follows the mathematical notation.

Pushing the inference button

The author explains quite a bit of the science behind the code. Luckily, we only need the code to make it work.

Diagnosing the sampling process

Once we have a posterior we need to determine if it makes sense. We have a couple of options:

Increase samples
Remove samples from beginning (burn-in)
Reparametize the model
transform the data

We will revisit all of these later in the book when they are appropriate.

Convergence

One test we can run is a visual one. We build a KDE (kernel density estimation) plot. This will look like a smoothed histogram. A good chart will look like a bunch of noise. If they are different, it could indicate the need for burn-in. If there is a lack of similarity, or we see a pattern, we may need more steps, a different sampler or parametization. We can also use the Gelman-Rubin test as a quantitative way to test our model. This should work out to 1. Values > 1.1 signal a lack of convergence.

Auto correlation

Ideally our sample will lack auto correlation. In practice our samples generated with MCML methods can be auto correlated but we expect the samples to drop to a low value of auto correlation.

Effective Size

The idea to stress here is that give 2 samples of the same size, the one without auto correlation has more information. If we have a sample with high auto correlation, we could try to estimate the size of the sample without auto correlation. That is the effective size of our sample. One way that is suggested is to thin our sample. We only want to do this if we have to reduce storage.

Summarizing the posterior

In python, we can use the plot_posterior function. It accepts a PyMC3 trace or a Numpy array and returns a histogram, with the mean and the 95% HPD.

Posterior-based decisions

Even once we have this data, our decision on what to do with it is subjective. We can use it to make the most informed decision we can.

ROPE

Rope (Region of Practical Equivalence) is a range we set based on our knowledge of the subject at hand. There are three scenarios that are all modeled using Venn diagrams.
plot_posterior takes many options. There are two interesting ones. The ROPE is equal on top of the HPD. ref_value is the green vertical line and proportion of the posterior above and below it.
Link:
https://docs.pymc.io/api/plots.html

Loss Functions

Loss function (aka Cost Functions) mathematically formalize the precision for an estimated value. and take into account the cost of making a mistake. Many problems the cost of making a decision is asymmetric. When deciding to create a new vaccine, the cost of getting it wrong is very high.

Search This Blog

Orion's Book Reviews