Find solutions to the exercises in the book “Probabilistic Perspectives on Machine Learning”.

Click to see video:

## Introduction

This document contains solutions to the exercises for the book Probabilistic Perspectives on Machine Learning, by Kevin P. Murphy. The book is available for free online at http://www.cs.ubc.ca/~murphyk/MLbook/.

Each chapter has a corresponding set of exercise solutions. The exercises are divided up into two types: theoretical questions and programming questions. Theoretical questions are meant to be solved without the aid of a computer, although you may find it helpful to use a calculator or computer algebra system to check your work. Programming questions require you to write code in either MATLAB or Python (or another language, if you prefer).

## Basic Probability Review

This appendix reviews some basic probability theory that will be used throughout the book. We begin with a review of random variables, which are functions that map outcomes of random experiments to real numbers. Let X be a random variable. We use upper-case letters for random variables and lower-case letters for their particular values; thus, x is a particular value that X may take on. The set of all possible values that X can take on is called the support of X and denoted by supp(X). For example, if X is the roll of a fair die, then supp(X) = {1,2,3,4,5,6}.

## The Law of Large Numbers

The law of large numbers is a fundamental result in probability and statistics. Informally, it states that the average of a sequence of random variables converges to the expected value as the number of variables in the sequence grows. More precisely, let X1,…,Xn be a sequence of independent and identically distributed (i.i.d.) random variables with common expected value μ. Then, as n→∞, the following holds:

Xn=1n(X1+…+Xn)→μ

In other words, the sequence Xn converges in probability to the expected value μ. The law of large numbers is also known as the trivia Theorem or the Bolzano–Weierstrass theorem.

## The Central Limit Theorem

One of the most important results in probability theory is the central limit theorem. The theorem says that if you have a sequence of independent, identically distributed random variables, then the distribution of their sum will tend to be normal, no matter what the underlying distribution is. This result is important in machine learning because it means that we can use powerful probabilistic tools even when our data isn’t necessarily Normally distributed.

Here are some examples of random variables that are not Normally distributed:

-Bernoulli random variables (i.e. coin flips)

-Binomial random variables (i.e. the number of heads in a sequence of coin flips)

-Categorical random variables (i.e. labels assigned to data points)

Despite the fact that these random variables are not Normally distributed, the central limit theorem tells us that their sums will tend to be Normally distributed. This means that we can still use all of the probabilistic tools that are available for Normally distributed data, even when our data is not Normally distributed.

## Sampling Distributions

Computing the variance of a binomial distribution

The variance of a binomial distribution with parameters n and p is np(1-p).

## Estimators and Bias

In machine learning, an estimator is a function that is used to estimate the value of a parameter. The bias of an estimator is the difference between the estimated value and the true value of the parameter. For example, if we use a linear regression model to estimate the mean of a population, the bias of our estimator will be the difference between the estimated mean and the true mean of the population.

In general, we want our estimators to be as unbiased as possible. However, it is often impossible to find an estimator that is completely unbiased. In some cases, we may sacrifice some bias in order to reduce other types of error (such as variance).

There are many different types of estimators, each with its own advantages and disadvantages. Some common examples include Maximum Likelihood Estimators (MLEs), Least Squares Estimators (LSEs), and Bayesian Estimators. Each type of estimator has different properties that make it better or worse for certain applications.

Maximum Likelihood Estimators are a class of estimators that are often used in machine learning. They are called “maximum likelihood” because they choose estimates that maximize the likelihood function. The likelihood function is a mathematical function that describes how likely it is for a given data set to occur given certain parameters. For example, if we have a data set that consists of coin flips, we can use the likelihood function to calculate how likely it is for that data set to occur given different probabilities of heads (p). If we want to find the maximum likelihood estimate for p, we would simply choose the value of p that maximizes the likelihood function.

Least Squares Estimators are another class of estimators that are commonly used in machine learning. They are called “least squares” because they minimize the sum of squared errors. In other words, they choose estimates such that the sum of squared differences between the estimated values and the true values is minimized.

Bayesian Estimators are a class of estimators that use Bayesian inference. Bayesian inference is a way of reasoning about uncertainty using probabilities. In general, Bayesian estimators will output a distribution over possible values rather than a single estimate. This allows us to quantify our uncertainty about the estimate and makes Bayesian methods especially well-suited for problems where there is a lot of uncertainty or noise in the data.

## The Maximum Likelihood Estimator

In this section, we will derive the maximum likelihood estimator (MLE) for a simple linear regression model. Consider the following model:

Y = β0 + β1X + ε

where Y is the dependent variable, X is the independent variable, β0 is the intercept, β1 is the slope, and ε is the error term. We assume that ε is distributed according to a normal distribution with mean 0 and variance σ2.

## The Method of Moments Estimator

In machine learning, the method of moments is a ways of estimation involving the calculation of certain statistics from the data. The estimator resulting from this method is called the method of moments estimator. The method of moments can be used to calculate maximum likelihood estimators in some cases, but not all.

There are a few different ways to calculate the method of moments estimator for a given data set. One way is to calculate the sample mean and variance for the data set. Another way is to use the moment generating function for the data set. In either case, the estimator resulting from the method of moments will be an unbiased estimator if the data are drawn from a population with a finite variance.

## The Maximum a Posteriori Estimator

The Maximum a Posteriori (MAP) estimator is a popular technique for solving inference problems in machine learning. Given a set of data points, the MAP estimator finds the value of the parameter that is most likely to have generated the data. In other words, it finds the value of the parameter that maximizes the posterior probability.

There are many different ways to compute the MAP estimator, and the approach that you take will depend on the type of data that you have and the structure of your model. In this section, we will discuss two common methods for computing the MAP estimator: The first is based on solving a optimization problem, and the second is based on using Monte Carlo methods.

##The Maximum Likelihood Estimator

##Expansion:

The Maximum Likelihood Estimator (MLE) is a popular technique for solving inference problems in machine learning. Given a set of data points, the MLE finds the value of the parameter that is most likely to have generated the data. In other words, it finds the value of the parameter that maximizes the likelihood function.

There are many different ways to compute the MLE, and the approach that you take will depend on the type of data that you have and the structure of your model. In this section, we will discuss two common methods for computing the MLE: The first is based on solving a optimization problem, and 42the second is based on using Monte Carlo methods.

## Bayesian Inference

Probabilistic machine learning is a powerful tool for understanding and making predictions from data. In this article, we will focus on one particular type of probabilistic approach, Bayesian inference.

We will start by introducing the concept of Bayesian inference, and then we will work through a number of examples to illustrate how it can be used in machine learning. We will conclude with some suggestions for further reading.

What is Bayesian inference?

In its simplest form, Bayesian inference is a method of statistical inference in which we use probabilities to make decisions about the world around us. For example, if you see a coin on the ground, you might use Bayesian inference to decide whether it is a fair coin or not.

To do this, you would need to know two things: the probability of seeing a fair coin (which we will call P(fair)), and the probability of seeing an unfair coin (P(unfair)). If P(unfair) > P(fair), then you would conclude that the coin is unfair; if P(fair) > P(unfair), then you would conclude that the coin is fair.

Bayesian inference can be used to make decisions about anything that can be represented probabilistically. In machine learning, we use it to make predictions about future events, based on past data. For example, we might use Bayesian inference to predict the likelihood of a person defaulting on a loan, based on their past credit history.

How does Bayesian inference work?

Bayesian inference works by using probabilities to update our beliefs about something in light of new evidence. To do this, we need to define two probabilities: the prior probability and the posterior probability.

The prior probability is our initial belief about something before we take into account any new evidence. For example, if we are trying to predict whether it will rain tomorrow, our prior belief might be that there is a 60% chance of rain. This prior belief could be based on our previous experience (e.g., it has rained on six out of ten days this week), or it could be completely arbitrary.

Keyword: Probabilistic Perspectives on Machine Learning – Solutions to Exercises