Topics on Linear Algebra for Data Analysis - Part 2
Sugandha Roy, Hasnain Hossain
December 20, 2021
PDF
Projection of a vector onto another vector has many important
applications
Orthogonal Projection and Least Square Approximation
Orthogonal Projection: The Linear Algebra Way
-
Projection on a vector: The orthogonal
decomposition of $\boldsymbol{x}$ on $\boldsymbol{v}$ means
$\boldsymbol{x}=\boldsymbol{p+z}$ such that
$\boldsymbol{p}=t\boldsymbol{v}$ ($t$ is scalar) and
$\boldsymbol{z}\perp \boldsymbol{v}$. $p = proj_\boldsymbol{v}x$
is called the orthogonal projection of $x$ on $v$. Show,
$$\boldsymbol{p}=\frac{\boldsymbol{x^T v}}{\boldsymbol{v^T
v}}\boldsymbol{v}$$
and if $\boldsymbol{v}$ is an unit vector, then
$\boldsymbol{p=(x^T v)v}$. Note that here we can consider
$\boldsymbol{x^T v}$ as the coordinate of $\boldsymbol{p}$ in the
space spanned by $\boldsymbol{p}$.
Solution:
Let $\boldsymbol{p}=t\boldsymbol{v}$, here t is a scalar
$\boldsymbol{z}=\boldsymbol{x-p}=\boldsymbol{x}-t\boldsymbol{v}$
$\boldsymbol{z}$ is orthogonal to $\boldsymbol{v}$ if and only if
$$\mathbf{0}=(\boldsymbol{x}-t\boldsymbol{v}).\boldsymbol{v}$$
$$\Rightarrow
\mathbf{0}=\boldsymbol{x.v}-(t\boldsymbol{v}.\boldsymbol{v})$$
$$\Rightarrow
\boldsymbol{0}=\boldsymbol{x.v}-t(\boldsymbol{v.v})$$
$$\Rightarrow \boldsymbol{0}=\boldsymbol{x^T v}-t\boldsymbol{v^T
v}$$ $$\therefore t= \frac{\boldsymbol{x^T v}}{\boldsymbol{v^T
v}}$$ Since $\boldsymbol{p}=t\boldsymbol{v}$, we can write
$$\boldsymbol{p}=\frac{\boldsymbol{x^T v}}{\boldsymbol{v^T
v}}\boldsymbol{v}$$ If $\boldsymbol{v}$ is an unit vector, then
$\boldsymbol{v.v}=1$ $$\therefore \boldsymbol{p=(x^T v)v}$$
-
Projection on a subspace: Projection of
$\boldsymbol{x}$ on a subspace $W$, such that
$\boldsymbol{x}\notin \boldsymbol{W}$, is given by
$$\boldsymbol{p}=proj_W \boldsymbol{x}=\hat{\boldsymbol{x}}=V
V^T\boldsymbol{x}$$
here, $\boldsymbol{W}$ is spanned by orthonormal basis set $B=
\{v_1, v_2,\dots, v_k\}$ and
$V = [v_1, v_2, \dots, v_k]$, the matrix with the $v_i$'s as
columns.
$VV^T$ is called the projection matrix.
Solution:
$\boldsymbol{x}=\boldsymbol{p+z}$ s.t.,
$\boldsymbol{p}\in\boldsymbol{W}$ or,
$\hat{\boldsymbol{x}}\in\boldsymbol{W}$
Then, $\hat{\boldsymbol{x}}=\boldsymbol{p}=\alpha_1
\boldsymbol{v_1}+..+\alpha_k
\boldsymbol{v_k}=\boldsymbol{V}\boldsymbol{\alpha}$
and $\boldsymbol{z} = \boldsymbol{x}-\boldsymbol{V\alpha}$ We also
have, $\boldsymbol{z}\perp\boldsymbol{W}$ which means
$\boldsymbol{z}\perp\boldsymbol{w}$, for any
$\boldsymbol{w}\in\boldsymbol{W}$. Then
$\boldsymbol{z}\perp\boldsymbol{v_i}$ because $\boldsymbol{z}$ is in
$\boldsymbol{W}^{\perp}$ and subspace W is spanned by the
orthonormal basis vectors $\boldsymbol{v_i}$.
So, $$\boldsymbol{z}\perp\boldsymbol{v_1} \Rightarrow
\boldsymbol{v_1^T z}=0\Rightarrow
\boldsymbol{v_1^T}(\boldsymbol{x}-\boldsymbol{V\alpha})=0 $$ $$
\vdots$$ $$\boldsymbol{z}\perp\boldsymbol{v_k} \Rightarrow
\boldsymbol{v_k^T z}=0\Rightarrow
\boldsymbol{v_k^T}(\boldsymbol{x}-\boldsymbol{V\alpha})=0$$
Therefore, $$\begin{pmatrix} \boldsymbol{v_1^T}\\
\boldsymbol{v_2^{T}}\\ \vdots\\ \boldsymbol{v_k^{T}}
\end{pmatrix}\boldsymbol{(}\boldsymbol{x}-\boldsymbol{V\alpha}\boldsymbol{)}=\boldsymbol{0}$$
$$\boldsymbol{V^T}\boldsymbol{(}\boldsymbol{x}-\boldsymbol{V\alpha}\boldsymbol{)}=\boldsymbol{0}$$
$$\Rightarrow \boldsymbol{V^T x}-\boldsymbol{V^T
V\alpha}=\boldsymbol{0}$$ $$\Rightarrow \boldsymbol{V^T
x}-\boldsymbol{\alpha}=\boldsymbol{0}$$ Since $\boldsymbol{V^T
V}=\boldsymbol{I}$ ( Property of orthonormal matrix and here
$\boldsymbol{V}$ is orthonormal matrix) $$\therefore
\boldsymbol{\alpha}=\boldsymbol{V^T x}$$ We have
$\hat{\boldsymbol{x}}=\boldsymbol{V \alpha}$ $$\therefore
\hat{\boldsymbol{x}}=\boldsymbol{V V^T x}$$
Note: Projection of $\boldsymbol{x}$ in the direction of
$\boldsymbol{v_i}$, for $i=1,2, \dots,k$:
$$proj_{\boldsymbol{v_i}}{\boldsymbol{x}}=(\boldsymbol{x^Tv_i)v_i}$$
Therefore projection to the subspace can be expressed as $$
\begin{split} \hat{\boldsymbol{x}} & = proj_{W}{\boldsymbol{x}} =
\boldsymbol{VV^T}\boldsymbol{x} \\ &
=(\boldsymbol{x^Tv_1)v_1}+(\boldsymbol{x^Tv_2)v_2}+\dots+(\boldsymbol{x^Tv_k)v_k}\\
& = proj_{\boldsymbol{v_1}}\boldsymbol{x}+
proj_{\boldsymbol{v_2}}\boldsymbol{x}+\dots+proj_{\boldsymbol{v_k}}\boldsymbol{x}
\end{split} $$
-
Orthogonal Projection Gives the Best Approximation: Using Pythagorean Theorem show that,
$\left\|{\boldsymbol{x-p}}\right\|^2 <
{\left\|\boldsymbol{x-q}\right\|}^2$, where $\boldsymbol{p} = proj_W
\boldsymbol{x}$, $\boldsymbol{q}$ is in W, and $\boldsymbol{p}\neq
\boldsymbol{q}$. That is, $\boldsymbol{p}$ is the best approximation
of $\boldsymbol{x}$ in the subspace W.
Solution:
Both $\boldsymbol{p}$ and $\boldsymbol{q}$ are in $\boldsymbol{W}$
and distinct from each other. Then $\boldsymbol{p-q}$ is in
$\boldsymbol{W}$. $\boldsymbol{z} = \boldsymbol{x-p}$ is orthogonal
to $\boldsymbol{W}$. In particular, $\boldsymbol{x-p}$ is orthogonal
to $\boldsymbol{p-q}$. Therefore,
$$\boldsymbol{x-q}=(\boldsymbol{x-p})+(\boldsymbol{p-q})$$ Using
Pythagorean Theorem,
$$\left\|{\boldsymbol{x-q}}\right\|^2=\left\|{\boldsymbol{x-p}}\right\|^2+\left\|{\boldsymbol{p-q}}\right\|^2
$$ Now $\left\|{\boldsymbol{p-q}}\right\|^2>\boldsymbol{0}$ because
$\boldsymbol{p-q}\neq \boldsymbol{0}$ which implies
$\left\|{\boldsymbol{x-p}}\right\|^2<\left\|{\boldsymbol{x-q}}\right\|^2$
-
Change of Coordinates:
$$ \begin{split} proj_W\boldsymbol{x} &= \boldsymbol{VV^Tx}\\ & =
\begin{bmatrix} |& &|\\ \boldsymbol{v_1}&...&\boldsymbol{v_k}\\ |&
&| \end{bmatrix} \begin{bmatrix} \boldsymbol{v_1^{T}}\\
\boldsymbol{v_2^{T}}\\ \vdots\\ \boldsymbol{v_k^{T}}\\
\end{bmatrix}\boldsymbol{x}\\ & = \begin{bmatrix} |& &|\\
\boldsymbol{v_1}& \dots &\boldsymbol{v_k}\\ |& &|
\end{bmatrix}\begin{bmatrix} \boldsymbol{x^{T}v_1}\\
\boldsymbol{x^{T}v_2}\\ \vdots\\ \boldsymbol{x^{T}v_k}
\end{bmatrix}\\ & = \begin{bmatrix} |& &|\\ \boldsymbol{v_1}& \dots
&\boldsymbol{v_k}\\ |& &| \end{bmatrix}\begin{bmatrix} \alpha_{1}\\
\alpha_{2}\\ \vdots\\ \alpha_{k} \end{bmatrix}\\
&\therefore\big[\hat{\boldsymbol{x}}\big]_{S}=V
\big[\hat{\boldsymbol{x}}\big]_{B} \end{split} $$
Observe that
-
$V^T \boldsymbol{x} = \big[\hat{\boldsymbol{x}}\big]_{B}$ gives
the coordinates of the projection (using basis $B$) and
-
then the coordinates of the projection are changed from the basis
$B$ to $S$ (the standard basis).
-
$\big[\hat{\boldsymbol{x}}\big]_{S}$ is a $d{\times}1$ vector and
$\big[\hat{\boldsymbol{x}}\big]_{B}$ is a $k{\times}1$ vector. If
$\boldsymbol{x}$ is from a $d$-dimensional vector space,
Orthogonal Projection: The Calculus Way
-
Projection on a vector: Using Calculus find the
vector closest (in least square sense) to $\boldsymbol{x}$ in the
direction of $\boldsymbol{v}$. In other words, find the least square
approximation of $\boldsymbol{x}$ in the space spanned by
$\boldsymbol{v}$.
Solution:
A vector $\boldsymbol{p}$ in the direction of $\boldsymbol{v}$ is
given by $\boldsymbol{p}=t\boldsymbol{v}$, where $t$ is a scalar.
Therefore, we need to find $t$ that minimizes
$J(t)=\left\|{\boldsymbol{x}-t\boldsymbol{v}}\right\|^2$
$$J(t)=\left\|{\boldsymbol{x}-t\boldsymbol{v}}\right\|^2$$
$$\Rightarrow J(t)= (\boldsymbol{x}-t\boldsymbol{v})^T
(\boldsymbol{x}-t\boldsymbol{v})=(\boldsymbol{x}^T-t\boldsymbol{v}^T)(\boldsymbol{x}-t\boldsymbol{v})$$
$$\Rightarrow
J(t)=\boldsymbol{x}^T\boldsymbol{x}-t\boldsymbol{v}^T\boldsymbol{x}
-
t\boldsymbol{x}^T\boldsymbol{v}+t^2(\boldsymbol{v}^T\boldsymbol{v})$$
$$\begin{equation} \label{firsteq} \therefore
J(t)=\left\|{\boldsymbol{x}}\right\|^2-2t\boldsymbol{x}^T\boldsymbol{v}+t^2\left\|{\boldsymbol{v}}\right\|^2
\end{equation}$$ Differentiating equation (\ref{firsteq}) with
respect to $t$, we get,
$$J'(t)=-2\boldsymbol{x}^T\boldsymbol{x}+2t\left\|{\boldsymbol{v}}\right\|^2$$
Setting $J'(t)=0$ and solving for $t$ we get,
$$-2\boldsymbol{x}^T\boldsymbol{v}+2t\left\|{\boldsymbol{v}}\right\|^2=0$$
$$\therefore
t=\frac{\boldsymbol{x}^T\boldsymbol{v}}{\left\|{\boldsymbol{v}}\right\|^2}$$
Therefore, projection of $\boldsymbol{x}$ on $\boldsymbol{v}$
defined as $\boldsymbol{p}$ can be written as,
$$\boldsymbol{p}=\frac{\boldsymbol{x}^T
\boldsymbol{v}}{\left\|{\boldsymbol{v}}\right\|^2}\boldsymbol{v}$$
-
Projection on a subspace: We seek the closest
approximation of vector $\boldsymbol{x}$ in the subspace
$\boldsymbol{W}$ which has dimension $k$. Assume,
$\boldsymbol{v_i}$'s for $i=1,2,...,k$ form an orthonormal basis for
$\boldsymbol{W}$. Find the $\alpha_{i}$'s for $i=1,2,...,k$, s.t.
the error $J$, given by %$\boldsymbol{\alpha_1,...,\alpha_N}$ \[ J =
\big\|{\boldsymbol{x-\hat{\boldsymbol{x}}}}\big\|^{2} =
\big\|{\boldsymbol{x}-\sum_{i=1}^{k}\alpha_{i}\boldsymbol{v}_i}\big\|
^{2}\] is minimized.
Solution:
Any vector in $\boldsymbol{W}$ can be written as
$\sum_{i=1}^{k}\alpha_i\boldsymbol{v}_i$. Thus, $\boldsymbol{x}$
will be represented by some vector in $\boldsymbol{W}$ as
$\sum_{i=1}^{k}\alpha_i\boldsymbol{v}_i$. To minimize the error J we
need to take partial derivatives. $$ \begin{equation} \begin{split}
J(\alpha_{1},...,\alpha_{k})&=
\big\|{\boldsymbol{x}-\sum_{i=1}^{k}\alpha_{i}\boldsymbol{v}_i}\big\|^{2}
\\ &= (\boldsymbol{x}-\sum_{i=1}^{k} \alpha_i\boldsymbol{v}_i)^T
(\boldsymbol{x} - \sum_{i=1}^{k}\alpha_i\boldsymbol{v}_i)\\ &=
(\boldsymbol{x}^T-\sum_{i=1}^{k} \alpha_{i}\boldsymbol{v}_i^T)
(\boldsymbol{x}-\sum_{i=1}^{k}\alpha_i\boldsymbol{v}_i)\\ &=
\boldsymbol{x}^T \boldsymbol{x}-\boldsymbol{x}\sum_{i=1}^{k}
\alpha_{i}\boldsymbol{v}_i^T
-\boldsymbol{x}^T\sum_{i=1}^{k}\alpha_{i
\boldsymbol{v}_i}+(\sum_{i=1}^{k}\alpha_i\boldsymbol{v}_i^T)
(\sum_{i=1}^{k}\alpha_i \boldsymbol{v}_i)\\
&=\big\|\boldsymbol{x}\big\|^2-2\sum_{i=1}^{k}\alpha_i
\boldsymbol{x}^T\boldsymbol{v}_i +\sum_{i=1}^{k}\alpha_i ^2
\Bigg[\because \big\|\boldsymbol{v}_i\big\|^2=1 \text{ and }
\boldsymbol{v}_i\text{'s are orthogonal}\Bigg] \end{split}
\end{equation} $$
Then we take partial derivative with respect to $\alpha_{i}$ and set
that to 0 for optimal value. We get,
$$-2\boldsymbol{x}^T\boldsymbol{v}_{i}+2\alpha_{i}= 0$$ $$\therefore
\alpha_i =\boldsymbol{x}^T\boldsymbol{v}_i$$
Ordinary Least Square Regression
Suppose we have $N$ datapoints $(\boldsymbol{x}^{(1)}, y^{(1)}),
\ldots , (\boldsymbol{x}^{(N)}, y^{(N)})$, where
$\boldsymbol{x}^{(i)}$'s are $d$-vectors $[\boldsymbol{x}^{(i)}_1,
\boldsymbol{x}^{(i)}_2, \dots, \boldsymbol{x}^{(i)}_d]$ (\textit{i.e.,
$x$ contains the values of $d$ features or (independent) variables})
and $y$ is a real number (\textit{called the dependent variable}).
We assume there is a function $f(\boldsymbol{x})$ such that $y =
f(\boldsymbol{x})$. In linear regression, based on the data, we want
to find a linear (affine) function $\hat{f}$ that approximates $f$(in
the least square sense).
Let,
\begin{align*} \hat{y} = \hat{f}(\mathbf{x}) &= \beta_0 + \beta_1 x_1
+ \ldots + \beta_d x_d \\ &= \beta_0 x_0 + \beta_1 x_1 + \ldots +
\beta_d x_d \quad [let, x_0 = 1]\\ \end{align*}
That is, for each of the $N$ datapoints $\boldsymbol{x}^{(i)}$:
\begin{align*} \beta_0 x_0^{(1)} + \beta_1 x_1^{(1)} + \ldots +
\beta_d x_d^{(1)} &= \hat{y}^{(1)} \\ \beta_0 x_0^{(2)} + \beta_1
x_1^{(2)} + \ldots + \beta_d x_d^{(2)} &= \hat{y}^{(2)} \\ \vdots \\
\beta_0 x_0^{(N)} + \beta_1 x_1^{(N)} + \ldots + \beta_d x_d^{(N)} &=
\hat{y}^{(N)} \\ \end{align*}
\begin{align*} \begin{bmatrix} x_0^{(1)} & x_1^{(1)} & \ldots &
x_d^{(1)} \\ x_0^{(2)} & x_1^{(2)} & \ldots & x_d^{(2)} \\ \vdots &
\vdots & \ddots & \vdots \\ x_0^{(N)} & x_1^{(N)} & \ldots & x_d^{(N)}
\end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_d
\end{bmatrix} &= \begin{bmatrix} \hat{y}^{(1)} \\ \hat{y}^{(2)} \\
\vdots \\ \hat{y}^{(N)} \end{bmatrix} \\ \begin{bmatrix} 1 & x_1^{(1)}
& \ldots & x_d^{(1)} \\ 1 & x_1^{(2)} & \ldots & x_d^{(2)} \\ \vdots &
\vdots & \ddots & \vdots \\ 1 & x_1^{(N)} & \ldots & x_d^{(N)}
\end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_d
\end{bmatrix} &= \begin{bmatrix} \hat{y}^{(1)} \\ \hat{y}^{(2)} \\
\vdots \\ \hat{y}^{(N)} \end{bmatrix} \\ \implies \boldsymbol{A}
\boldsymbol{\beta} = \mathbf{\hat{y}} \end{align*}
That means, we need to find the $\beta_0, ..., \beta_d$ coefficinets
that satisfy the above equations. We can view this problem as finding
the solution to the system of linear equations ($\boldsymbol{A}
\boldsymbol{\beta} = \mathbf{\hat{y}}$). However, this is a
overdetermined system (more equations (or rows) than variables (or
columns)). Therefore, we can only find the best $\boldsymbol{\beta}$
that approximately solves the system of linear equation.
Or, we can also view this as an optimization problem and find the
$\boldsymbol{\beta}$ that minimizes the Mean Squared Error (MSE):
\begin{equation*} \frac{1}{N} \sum^{N}_{i=1} (y^{(i)} - \hat{y}^{(i)})
= \frac{1}{N} \lvert \lvert \mathbf{y} - \mathbf{\hat{y}} \rvert
\rvert^{2} = \frac{1}{N} \lvert \lvert \mathbf{y} - \boldsymbol{A}
\boldsymbol{\beta} \rvert \rvert^{2} \end{equation*}
Note: Usually in linear regression the features or
independent variables are transformed to create a new set of
variables. This can be done through basis functions
$\phi_j(\mathbf{x})$ that transforms the data and creates a datapoint
in the transformed feature space, $z_j = \phi_j(\mathbf{x}), \; \; j =
1, \ldots ,p$. And then we do linear regression using the transformed
datapoints, Note that The basis functions can be non-linear.
\begin{align*} \hat{f}(\mathbf{x}) &= \beta_0 + \beta_1 z_1 + \ldots +
\beta_p z_p \\ &= \beta_0 \phi_0 (\mathbf{x}) + \beta_1 \phi_1
(\mathbf{x}) + \ldots + \beta_p \phi_p (\mathbf{x}) \quad
[\phi_0(\mathbf{x}) = 1] \end{align*}
Again, the MSE can be written as
\begin{align*} \frac{1}{N} \lvert \lvert \mathbf{y} - \boldsymbol{A}
\boldsymbol{\beta} \rvert \rvert^2 \end{align*}
where,
\begin{gather*} A = \begin{bmatrix} \phi_0 (x^{(1)}) & \phi_1
(x^{(1)}) & \ldots & \phi_p (x^{(1)}) \\ \phi_0 (x^{(2)}) & \phi_1
(x^{(2)}) & \ldots & \phi_p (x^{(2)}) \\ \vdots & \vdots & \ddots &
\vdots \\ \phi_0 (x^{(N)}) & \phi_1 (x^{(N)}) & \ldots & \phi_p
(x^{(N)}) \end{bmatrix}, \; \boldsymbol{\beta} = \begin{bmatrix}
\beta_0 \\ \beta_0 \\ \vdots \\ \beta_N \end{bmatrix} \end{gather*}
The Linear Algebra Way
Suppose we have a system of linear equations
$A\boldsymbol{\beta}=\boldsymbol{y}$, where $A$ is a $N \times k$
matrix.
Here, $A= \begin{bmatrix} |& &|\\
\boldsymbol{a_1}&...&\boldsymbol{a_k}\\ |& &| \end{bmatrix} $,
$\boldsymbol{\beta}=\begin{bmatrix} \beta_1\\ \beta_2\\ \vdots\\
\beta_k \end{bmatrix}$, and $A\boldsymbol{\beta}=
\beta_1\boldsymbol{a_1}+
\beta_2\boldsymbol{a_2}+\dots+\beta_k\boldsymbol{a_k}.$
If the system of linear equations does not have a solution,
$\boldsymbol{y} \neq \beta_1\boldsymbol{a_1}+
\beta_2\boldsymbol{a_2}+\dots+\beta_k\boldsymbol{a_k}$, i.e.,
$\boldsymbol{y} \notin Col(A)=span\{\boldsymbol{a_1},
\boldsymbol{a_2}, \dots \boldsymbol{a_k}\}$. Then least square
solution is $\hat{\boldsymbol{\beta}}$, such that
$\left\|{\boldsymbol{y}-A\boldsymbol{\hat{\boldsymbol{\beta}}}}\right\|^2$
is minimum.
We observe, what we are asking for is the coordinates of
$\hat{y}=proj_{Col(A)}y$. Now, we may not have an orthonormal basis of
$Col(A)$, that is columns of $A$ might not be orthonormal. Rather we
have a basis
$B'=\{\boldsymbol{a_1},\boldsymbol{a_2},...,\boldsymbol{a_k}\}$ of the
$Col(A)$, assuming the columns of $A$ are linearly independent. Show,
$$\hat{\boldsymbol{\beta}}=(A^T A)^{-1}A^T \boldsymbol{y}$$
$$\hat{\boldsymbol{y}}=proj_{Col(A)}\boldsymbol{y}=A\hat{\boldsymbol{\beta}}=A(A^T
A)^{-1}A^T \boldsymbol{y}$$ Note, $A(A^T A)^{-1}A^T$ is called the
Projection matrix and $A^\dagger=(A^T A)^{-1}A^T$ is called the
pseudo-inverse matrix.
Solution: When a solution is demanded and none
exists, in this scenario what we can do is to find a
$\boldsymbol{\beta}$ that makes $A\boldsymbol{\beta}$ as close as
possible to $\boldsymbol{y}$. Here, $A$ is $N \times k$ and
$\boldsymbol{y}$ is in $\mathbb{R}^N$.
Let $\hat{\boldsymbol{y}}=$proj$_{Col(A)}\boldsymbol{y}$ Because
$\hat{\boldsymbol{y}}$ is in the column space of $A$, the equation
$A\boldsymbol{\beta}=\hat{\boldsymbol{y}}$ is consistent and there is
a $\hat{\boldsymbol{\beta}}$ in $\mathbb{R}^k$ such that
$$A\hat{\boldsymbol{\beta}}=\hat{\boldsymbol{y}}$$ The projection
$\hat{\boldsymbol{y}}$ has the property that
$\boldsymbol{y}-\hat{\boldsymbol{y}}$ is orthogonal to $Col(A)$, so
$(\boldsymbol{y}-A\hat{\boldsymbol{\beta}})$ is orthogonal to each
column of $A$. If $\boldsymbol{a_{j}}$ is any column of $A$, then
$\boldsymbol{a_{j}.(y-}A\hat{\boldsymbol{\beta}})=\boldsymbol{0}$ and
$\boldsymbol{a_{j}^T(y-}A\hat{\boldsymbol{\beta}})=\boldsymbol{0}$.
Since each $\boldsymbol{a_{j}^T}$ is a row of $A^T$,
$A^T(\boldsymbol{y}-A\hat{\boldsymbol{\beta}})=\boldsymbol{0}$
$\Rightarrow
A^T\boldsymbol{y}-A^TA\hat{\boldsymbol{\beta}}=\boldsymbol{0}$
$\Rightarrow A^TA\hat{\boldsymbol{\beta}}=A^T\boldsymbol{y}$ [These
are called the Normal Equations] $\therefore
\hat{\boldsymbol{\beta}}=(A^T A)^{-1}A^T \boldsymbol{y}$ Finally we
have, $\hat{\boldsymbol{y}}=$
proj$_\boldsymbol{Col(A)}\boldsymbol{y}=A\boldsymbol{\beta}=A(A^T
A)^{-1}A^T \boldsymbol{y}$
Using QR decomposition:
Alternatively, $\hat{\boldsymbol{y}}=\boldsymbol{QQ^T}\boldsymbol{y}$
where A$=\boldsymbol{QR}$ is the $QR$ decomposition of $A$.
Here, the columns of $\boldsymbol{Q}$ form an orthonormal basis for
$Col(A)$ and $\boldsymbol{R}$ is an upper triangular invertible
matrix.
$$A\hat{\boldsymbol{\beta}}=\hat{\boldsymbol{y}} \\ \Rightarrow
\boldsymbol{QR}\hat{\boldsymbol{\beta}}=\boldsymbol{QQ^T}\boldsymbol{y}
\\\therefore \hat{\boldsymbol{\beta}}=\boldsymbol{R^{-1}Q^T y}$$ Note,
the pseudo-inverse of A, $A^\dagger = R^{-1}Q^T$
The Calculus Way
Using the derivative rules, we can use calculus to find the
$\boldsymbol{\beta}$:
\begin{align*} \lvert \lvert \mathbf{y} - \boldsymbol{A}
\boldsymbol{\beta} \rvert \rvert^2 &= (\mathbf{y} - \boldsymbol{A}
\boldsymbol{\beta})^T (\mathbf{y} - \boldsymbol{A} \boldsymbol{\beta})
\\ &= \lvert \lvert \mathbf{y} \rvert \rvert^2 - 2 \mathbf{y}^T
\boldsymbol{A} \boldsymbol{\beta} + \boldsymbol{\beta}^T
\boldsymbol{A}^T \boldsymbol{A} \boldsymbol{\beta} \\
\frac{\partial}{\partial \boldsymbol{\beta}} \lvert \lvert \mathbf{y}
- \boldsymbol{A} \boldsymbol{\beta} \rvert \rvert^2 &= 0 - 2
\boldsymbol{A}^T \mathbf{y} + 2 \boldsymbol{A}^T
\boldsymbol{A}\boldsymbol{\beta} \\ \therefore
\frac{\partial}{\partial \boldsymbol{\beta}} \lvert \lvert \mathbf{y}
- \boldsymbol{A} \boldsymbol{\beta} \rvert \rvert^2 &= 0 \\ \implies
\boldsymbol{A}^T \boldsymbol{A} \boldsymbol{\beta} = \boldsymbol{A}^T
\mathbf{y} \\ \therefore \hat{\boldsymbol{\beta}} = (\boldsymbol{A}^T
\boldsymbol{A})^{-1} \boldsymbol{A}^T \mathbf{y} \end{align*}