Writing is thinking. To write well is to think clearly. That's why it's so hard.

~ David McCullough
 

Linear Regression

2017年06月28日

Regression analysis is a common technique for finding a best fit line through a set of data points. We'll illustrate how it works by showing a simple case involving one input feature.

Click in the box below to add or remove data points, and the best fit line will be displayed. Pressing 'c' will clear all data points.

Let's suppose that you start off by giving two data points: $(x_1, y_1)$ and $(x_2, y_2)$. The best fit line in this case is straightforward to work out with some high school algebra:

$$ (y - y_1) = \frac{y_2 - y_1}{x_2 - x_1} (x - x_1) $$

If you happened to pick out three points $(x_1, y_1)$, $(x_2, y_2)$, and $(x_3, y_3)$ that all happen to lie on a common line, the same formula gives the best fit line.

But this situation is not so likely. Even when data does follow a linear trend, noise, errors in measurement, and bad luck will prevent any three points from ever lining up perfectly.

Linear regression will attempt to find a line which, in some sense, as close to all of the points as possible.

Calculating the Error

Trying to find the best fit line from the start might be a tall order. A more manageable problem might be, given a candidate line, how can we describe how well it fits the given data?

Let's consider the line given by $y = f(x)$ where $f(x) = ax + b$ for some fixed values $a$ and $b$. Here, $a$ represents the slope and $b$ represents the vertical offset.

We can think of the line as a prediction for the $y$ value of a point given its $x$ value. When a data point does not lie perfectly on the line, this shows there is some error in our prediction. The error for a single data point is $f(x) - y$. This is represented in the applet by the vertical blue bars connecting the points to the line.

It seems naturally that we would want to sum these values to find the total error among all of our data points. But we have to be a bit careful since the errors could be negative numbers. So we instead sum their absolute values. We call this total absolute error $\varepsilon_{\text{abs}}$:

$$ \varepsilon_{\text{abs}} = \sum_{i=1}^n |f(x_i) - y_i| = \sum_{i=1}^n |a x_i + b - y_i| $$

Since our line is specified by $a$ and $b$, to find the best fit line, all we need to do is find the values of $a$ and $b$ which make this sum as small as possible.

Minimizing the Error

In general, whenever you want to minimize (or maximize) a quantity, you need calculus. We consider $\varepsilon$ as a quantity which depends on the values $a$ and $b$.

Fixing some initial values for $a$ and $b$, we might imagine increasing the value of $a$ and seeing how the value of $\varepsilon$ changes in response. The partial derivative of $\varepsilon$ with respect to $a$, denoted $\frac{\partial \varepsilon}{\partial a}$, tell us how much $\varepsilon$ is increasing or decreasing as we add a tiny amount to $a$. If we find that adding a tiny amount to $a$ does not increase $\varepsilon$ significantly - that is, if $\frac{\partial \varepsilon}{\partial a} = 0$ - then we have found a minimum value of $\varepsilon$ among all lines with the value we previously fixed for $b$.

(In general, when the derivative is $0$, we can only conclude that something "interesting" is going on. It might be a minimum... but it could also be a maximum, or a so-called inflection point. Together, these are collectively referred to as critical points. But here, it should be clear if you give it some thought that our function $\varepsilon$ has no maximum points and I assure you it has no inflection points).

To find the best fitting line, then we need to use this trick on both parameters. In other words, we need to solve the system of equations:

$$ {\large \begin{array}{lcl} \frac{\partial \varepsilon}{\partial a} &=& 0 \\\\ \frac{\partial \varepsilon}{\partial b} &=& 0 \end{array} } $$

But how exactly do we do that?

A small tweak: Using squares for the error

Rather than solve the problem exactly as we've set it up, we're going to first make use of a common trick that comes up across the field of statistics. Rather than use the sum of the errors, we're going to use the sum of the squares of the errors.

In other words, we're going to instead use this definition for the error, where we have replaced the absolute value bars with squaring:

$$ \varepsilon_{\text{sq}} = \sum_{i=1}^n (f(x_i) - y_i)^2 = \sum_{i=1}^n (a x_i + b - y_i)^2 $$

Let's talk about the rationale for making this move.

First, it will end up simplifying our analysis. We're taking derivatives, but the absolute value function will introduce points where the derivative is ill-defined (it introduces sharp corners in the graph of the function). By using squares instead, we still force the total error to be a positive value, but our error function $\varepsilon$ is now a smooth function.

Second, sums of squares are particularly well-behaved in mathematics. The Pythagorean theorem is another example where sums of squares appear, and some reasoning by analogy might lead us to feel like the above definition of the error is, in an abstract sense, a kind of squared distance.

Sometimes you will see a factor of $\frac{1}{2}$ in front to cancel out with the $2$ in the exponent when we take the derivative. You might also see a factor of $\frac{1}{n}$ in front, which we then call this quantity the mean squared error. The constant in front won't affect anything, since we will ultimately set the derivative equal to zero and divide it away.

Some calculus

With this change to work with the mean squared error, working out the partial derivatives becomes straightforward:

\begin{array}{lcl} \frac{\partial \varepsilon}{\partial a} &=& \frac{\partial}{\partial a} \sum_{i=1}^n (ax_i + b - y_i)^2 \\ &=&\sum \frac{\partial}{\partial a} (ax_i + b - y_i)^2 \\ &=&\sum 2(ax_i + b - y_i) \cdot \frac{\partial}{\partial a}(ax_i + b - y_i) \\ &=&2\sum (ax_i + b - y_i)\cdot x_i \end{array}

And

\begin{array}{lcl} \frac{\partial \varepsilon}{\partial b} &=& \frac{\partial}{\partial b} \sum_{i=1}^n (ax_i + b - y_i)^2 \\ &=&\sum \frac{\partial}{\partial b} (ax_i + b - y_i)^2 \\ &=&\sum 2(ax_i + b - y_i) \cdot \frac{\partial}{\partial b}(ax_i + b - y_i)\\ &=&2\sum (ax_i + b - y_i) \end{array}

We set these expressions both equal to $0$ and divide each by $2$ to get rid of the constant in front. The result is a system of equations:

\begin{array}{lcl} \sum^n_{i=1} (ax_i + b - y_i)\cdot x_i &=&0\\ \sum^n_{i=1} (ax_i + b - y_i) &=&0 \end{array}

Keep in mind that in this formula, the $x_i$ and $y_i$ are known values. They are our data points. It is $a$ and $b$ which are the unknowns, and they correspond to the slope and the vertical offset of the best fit line respectively. Once we solve this equation, we have found our best fit line.

Solving the equations

For a moment, let's assume that we have just two points $(x_1, y_1)$ and $(x_2, y_2)$. This will let us see more explicitly the form of the problem:

\begin{array}{lcl} \sum^n_{i=1} (ax_i + b - y_i)\cdot x_i &=&\\ (a x_1 + b - y_1) \cdot x_1 + (a x_2 + b - y_2) \cdot x_2 &=& \\ ax_1^2 + bx_1 - y_1 x_1 + ax_2^2 +bx_2 -x_2 y_2 &=&\\ (x_1^2 + x_2^2)a + (x_1 + x_2)b - (x_1 y_1 + x_2 y_2) &=& 0 \end{array}

And

\begin{array}{lcl} \sum^n_{i=1} (ax_i + b - y_i) &=&\\ (a x_1 + b - y_1) + (a x_2 + b - y_2) &=& \\ (x_1 + x_2)a + 2b - (y_1 + y_2) &=& 0 \end{array}

Collecting the last line of each and shuffling the negative terms around, we get a system of linear equations in $a$ and $b$:

\begin{array}{lcl} (x_1^2 + x_2^2)a + (x_1 + x_2)b &=& (x_1 y_1 + x_2 y_2) \\ (x_1 + x_2)a + 2b &=& (y_1 + y_2) \end{array}

We can express this as a matrix equation, which will make it easier to solve when we go to the general case:

$$ \begin{align} \begin{bmatrix} x_1^2 + x_2^2 & x_1 + x_2 \\\\ x_1 + x_2 & 2 \end{bmatrix} \begin{bmatrix} a \\\\ b \end{bmatrix} = \begin{bmatrix} x_1 y_1 + x_2 y_2 \\\\ y_1 + y_2 \end{bmatrix} \end{align} $$

We can see some interesting features. The matrix on the far left is symmetric. It contains the sum of squares of the $x$ values in the upper left, the total number of points $n$ in the lower right, and in the off-diagonals, we have the simple sum of the $x$-values.

In general, with $n$ points named $(x_1, y_1), (x_2, y_2), \cdots, (x_n, y_n)$, the matrix will look like this:

$$ \begin{align} \begin{bmatrix} \sum x_i^2 & \sum x_i \\\\ \sum x_i & n \end{bmatrix} \begin{bmatrix} a \\\\ b \end{bmatrix} = \begin{bmatrix} \sum x_i y_i \\\\ \sum y_i \end{bmatrix} \end{align} $$

To get the values for $a$ and $b$ requires us to calculate the inverse of the far left matrix and multiply by it on both sides:

$$ \begin{align} \begin{bmatrix} a \\\\ b \end{bmatrix} = \frac{1}{n(\sum x_i^2) - (\sum x_i)^2} \begin{bmatrix} n & -\sum x_i \\\\ -\sum x_i & \sum x_i^2 \end{bmatrix} \begin{bmatrix} \sum x_i y_i \\\\ \sum y_i \end{bmatrix} \\\\ \end{align} $$

And this gives us an explicit formula for calculating $a$ and $b$:

$$ {\Large \begin{array}{lcl} a &=& \frac{n (\sum x_i y_i) - (\sum x_i)(\sum y_i)}{n (\sum x_i^2) - (\sum x_i)^2}\\\\ b &=& \frac{(\sum x_i^2)(\sum y_i) - (\sum x_i y_i)(\sum x_i)}{n (\sum x_i^2) - (\sum x_i)^2} \end{array} } $$

And so to plot the best fit line is specified in terms of the given data.

qed