Skip to main content

Section7.5The Method of Least Squares

Objectives
  1. Learn examples of best-fit problems.
  2. Learn to turn a best-fit problem into a least-squares problem.
  3. Recipe: find a least-squares solution (two ways).
  4. Picture: geometry of a least-squares solution.
  5. Vocabulary: least-squares solution.

In this section, we answer the following important question:

Suppose that Ax = b does not have a solution. What is the best approximate solution?

For our purposes, the best approximate solution is called the least-squares solution. We will present two methods for finding least-squares solutions, and we will give several applications to best-fit problems.

Subsection7.5.1Least-Squares Solutions

We begin by clarifying exactly what we will mean by a “best approximate solution” to an inconsistent matrix equation Ax = b .

Definition

Let A be an m × n matrix and let b be a vector in R m . A least-squares solution of the matrix equation Ax = b is a vector K x in R n such that

dist ( b , A K x ) dist ( b , Ax )

for all other vectors x in R n .

Recall that dist ( v , w )= A v w A is the distance between the vectors v and w . The term “least squares” comes from the fact that dist ( b , Ax )= A b A K x A is the square root of the sum of the squares of the entries of the vector b A K x . So a least-squares solution minimizes the sum of the squares of the differences between the entries of A K x and b . In other words, a least-squares solution solves the equation Ax = b as closely as possible, in the sense that the sum of the squares of the difference b Ax is minimized.

Least Squares: Picture

Suppose that the equation Ax = b is inconsistent. Recall from this note in Section 2.4 that the column space of A is the set of all other vectors c such that Ax = c is consistent. In other words, Col ( A ) is the set of all vectors of the form Ax . Hence, the closest vector of the form Ax to b is the orthogonal projection of b onto Col ( A ) . This is denoted b Col ( A ) , following this notation in Section 7.3.

Col A Ax Ax Ax A K x = b Col ( A ) b b A K x = b Col ( A ) 0

A least-squares solution of Ax = b is a solution K x of the consistent equation Ax = b Col ( A )

Note

If Ax = b is consistent, then b Col ( A ) = b , so that a least-squares solution is the same as a usual solution.

Where is K x in this picture? If v 1 , v 2 ,..., v n are the columns of A , then

A K x = A EIIGK x 1 K x 2 ... K x n FJJH = K x 1 v 1 + K x 2 v 2 + ··· + K x n v n .

Hence the entries of K x are the “coordinates” of b Col ( A ) with respect to the spanning set { v 1 , v 2 ,..., v m } of Col ( A ) . (They are honest B -coordinates if the columns of A are linearly independent.)

Col A v 1 v 2 K x 1 v 1 K x 2 v 2 A K x = b Col ( A ) b b A K x = b Col ( A ) 0
Figure4The violet plane is Col ( A ) . The closest that Ax can get to b is the closest vector on Col ( A ) to b , which is the orthogonal projection b Col ( A ) (in blue). The vectors v 1 , v 2 are the columns of A , and the coefficients of K x are the lengths of the green lines. Click and drag b to move it.
Note

If Ax = b is consistent, then b Col ( A ) = b , so that a least-squares solution is the same as a usual solution.

We learned to solve this kind of orthogonal projection problem in Section 7.3.

Proof

In particular, finding a least-squares solution means solving a consistent system of linear equations. We can translate the above theorem into a recipe:

Recipe 1: Compute a least-squares solution

Let A be an m × n matrix and let b be a vector in R m . Here is a method for computing a least-squares solution of Ax = b :

  1. Compute the matrix A T A and the vector A T b .
  2. Form the augmented matrix for the matrix equation A T Ax = A T b , and row reduce.
  3. This equation is always consistent, and any solution K x is a least-squares solution.

To reiterate: once you have found a least-squares solution K x of Ax = b , then b Col ( A ) is equal to A K x .

The reader may have noticed that we have been careful to say “the least-squares solutions” in the plural, and “a least-squares solution” using the indefinite article. This is because a least-squares solution need not be unique: indeed, if the columns of A are linearly dependent, then Ax = b Col ( A ) has infinitely many solutions. The following theorem, which gives equivalent criteria for uniqueness, is an analogue of this corollary in Section 7.3.

Proof

As usual, calculations involving projections become easier in the presence of an orthogonal set. Indeed, if A is an m × n matrix with orthogonal columns u 1 , u 2 ,..., u m , then we can use the projection formula in Section 7.4 to write

b Col ( A ) = b · u 1 u 1 · u 1 u 1 + b · u 2 u 2 · u 2 u 2 + ··· + b · u m u m · u m u m = A EIIG ( b · u 1 ) / ( u 1 · u 1 )( b · u 2 ) / ( u 2 · u 2 ) ... ( b · u m ) / ( u m · u m ) FJJH .

Note that the least-squares solution is unique in this case, since an orthogonal set is linearly independent.

Recipe 2: Compute a least-squares solution

Let A be an m × n matrix with orthogonal columns u 1 , u 2 ,..., u m , and let b be a vector in R n . Then the least-squares solution of Ax = b is the vector

K x = L b · u 1 u 1 · u 1 , b · u 2 u 2 · u 2 ,..., b · u m u m · u m M .

This formula is particularly useful in the sciences, as matrices with orthogonal columns often arise in nature.

Subsection7.5.2Best-Fit Problems

In this subsection we give an application of the method of least squares to data modeling. We begin with a basic example.

Example(Best-fit line)

Suppose that we have measured three data points

( 0,6 ) , ( 1,0 ) , ( 2,0 ) ,

and that our model for these data asserts that the points should lie on a line. Of course, these three points do not actually lie on a single line, but this could be due to errors in our measurement. How do we predict which line they are supposed to lie on?

( 0,6 ) ( 1,0 ) ( 2,0 )

The general equation for a (non-vertical) line is

y = Mx + B .

If our three data points were to lie on this line, then the following equations would be satisfied:

6 = M · 0 + B 0 = M · 1 + B 0 = M · 2 + B . (7.5.1)

In order to find the best-fit line, we try to solve the above equations in the unknowns M and B . As the three points do not actually lie on a line, there is no actual solution, so instead we compute a least-squares solution.

Putting our linear equations into matrix form, we are trying to solve Ax = b for

A = C 011121 D x = L MB M b = C 600 D .

We solved this least-squares problem in this example: the only least-squares solution to Ax = b is K x = A MB B = A 35 B , so the best-fit line is

y = 3 x + 5.
( 0,6 ) ( 1,0 ) ( 2,0 ) y = 3 x + 5

What exactly is the line y = f ( x )= 3 x + 5 minimizing? The least-squares solution K x minimizes the sum of the squares of the entries of the vector b A K x . The vector b is the left-hand side of (7.5.1), and

A L 35 M = C 3 ( 0 )+ 5 3 ( 1 )+ 5 3 ( 2 )+ 5 D = C f ( 0 ) f ( 1 ) f ( 2 ) D .

In other words, A K x is the vector whose entries are the y -coordinates of the graph of the line at the values of x we specified in our data points, and b is the vector whose entries are the y -coordinates of those data points. The difference b A K x is the vertical distance of the graph from the data points:

( 0,6 ) ( 1,0 ) ( 2,0 ) 1 2 1 y = 3 x + 5 b A K x = C 600 D A L 35 M = C 12 1 D

The best-fit line minimizes the sum of the squares of these vertical distances.

All of the above examples have the following form: some number of data points ( x , y ) are specified, and we want to find a function

y = B 1 g 1 ( x )+ B 2 g 2 ( x )+ ··· + B m g m ( x )

that best approximates these points, where g 1 , g 2 ,..., g m are fixed functions of x . Indeed, in the best-fit line example we had g 1 ( x )= x and g 2 ( x )= 1; in the best-fit parabola example we had g 1 ( x )= x 2 , g 2 ( x )= x , and g 3 ( x )= 1; and in the best-fit linear function example we had g 1 ( x 1 , x 2 )= x 1 , g 2 ( x 1 , x 2 )= x 2 , and g 3 ( x 1 , x 2 )= 1 (in this example we take x to be a vector with two entries). We evaluate the above equation on the given data points to obtain a system of linear equations in the unknowns B 1 , B 2 ,..., B m —once we evaluate the g i , they just become numbers, so it does not matter what they are—and we find the least-squares solution. The resulting best-fit function minimizes the sum of the squares of the vertical distances from the graph of y = f ( x ) to our original data points.

To emphasize that the nature of the functions g i really is irrelevant, consider the following example.

The next example has a somewhat different flavour from the previous ones.

Note

Gauss invented the method of least squares to find a best-fit ellipse: he correctly predicted the (elliptical) orbit of the asteroid Ceres as it passed behind the sun in 1801.