〽️

Matrix Decompositions

Breaking down the essentials of matrix decompositions, explaining how they are used in machine learning and why determinants, traces, eigenvalues, and eigenvectors enable us to draw better decision boundaries when designing our machine learning models.

Matrix essentials

To derive the properties of matrices, there are various tools one can use to make sense of the linear transformations or linear combinations.

Scalars, vectors, matrices

  • Basis vectors → The smallest set of vectors that spans the entire vector space.
  • Linear dependence
  • Linear independence

Analytic geometry

  • How long is the vector?
    • Norm → Measures length of vector.
  • What is the distance between vectors?
    • Metric → The measure of the distance between two vectors.
      • d(u,v)=uvd(\textbf{u}, \textbf{v})=\|\textbf{u}-\textbf{v}\|
  • How are the vectors related to each other?
    • Inner products → Tell us whether they are orthogonal.
    • Orthogonality → Tells us whether vectors are perpendicular.
  • What geometry of space is formed?
    • Inner Product → Capture the geometry of a vector space by giving the angles between two vectors. The angle tells us how similar the vectors’ orientations are.
    • Dot Product → Results in a scalar.
      • ab=a1b1+a2b2+a3b3\textbf{a}\cdot \textbf{b} = a_{1}b_{1}+a_{2}b_{2} + a_{3}b_{3}
      • ab=abcos  θ\textbf{a}\cdot \textbf{b} = |\textbf{a}| |\textbf{b}|\text{cos} \; \theta
      • Cosine affinity is dot product
      • ab=0\textbf{a} \cdot \textbf{b} = 0 means it is orthogonal
    • Cross Product → Results in a vector, specifically one perpendicular to the other vectors.
      • a×b=a2b3a3b2,a3b1a1b3,a1b2a2b1\textbf{a} \times \textbf{b}= \langle a_{2}b_{3} - a_{3}b_{2},a_{3}b_{1} - a_{1}b_{3},a_{1}b_{2} - a_{2}b_{1} \rangle
      • The vector a×b\textbf{a} \times \textbf{b} is orthogonal to both a\textbf{a} and b\textbf{b}
      • a×b=0\textbf{a} \times \textbf{b} = 0 means it is parallel

Dot product lines up all the vectors of the same dimensions, pairing up the coordinates, multiplying those pairs together and adding the result.

Matrix decomposition

  • What action does it take?
    • Inverse → Applying a transformation then the inverse of it should be comparable to doing nothing (returning to original state, which is the Identity).
    • Matrix multiplication → Applying multiplications of matrices denotes certain actions and the combination, in particular order, denotes the set of particular actions taken.
  • How has the space been affected by the change?
    • Determinant → Tells us how much the space was “scaled” by.
      • If det=0\text{det}=0, then it squishes it to a single line which tells us whether it is linearly dependent or independent — it means there is no inverse.
      • This affects the span of the matrix
    • Trace → Tells us how whether the matrix is invariant under cyclical permutations.
      • The sum of diagonal elements.
  • What is its orientation?
  • Does it stretch?
    • Eigenvector → When a linear transformation occurs, this vector remains on the same span.
      • In the case of a rotation transformation, this property implies it is the axis of rotation.
    • Eigenvalue → The scalar of how much it is stretched/squished by on the span.
  • How does it span?
    • Span → All possible vectors that can be created by linear combinations of those vectors.
    • Rank → Tells you how many dimensions the output of a transformation has. When it contains all the dimensions it originally started with, it has Full Rank.

Matrix factorization

Also known as matrix decomposition, this is the process in which matrices can be described with just a few numbers that characterize the overall properties.

Linear transformations

A transformation is a fancy way of saying “function” because you take an input and produce an output — but the particular use of this terminology should imply that the vector has moved because of this transformation. More specifically, it’s best to think of an input mapping to an output, this will help in our understanding of linear transformations.

All lines must remain lines and the origin must remain fixed in place as these transformations happen. In the case of vectors, you input a vector and output another vector. This enables us to see a matrix a little differently — as a certain transformation of space.

The determinant of a transformation explains just how much the area of an affected space has been transformed by

  • If the determinant is a 0, that means that it has been squished to a line or a point, or a plane (if it’s three dimensions)
  • If this is the case, then it tells is that this has linear dependence
  • If the determinant is negative, then it is flipping or inverting the space
  • Taking the inverse of a matrix is applying the reverse of the transformation. So if A has a transformation, then A1A^{-1} is the reverse of the transformation, back to where A had started. This results in an empty matrix — as if nothing was done to affect i and j.
    • If the determinant ≠ 0, then the inverse transformation does exist. if it did equal 0, then it gets squished down a dimension and would imply the inverse did not take place.
  • When the output of a transformation is a line, we say it has rank 1. When on a plane, it’s a rank 2. So the rank is the number of dimension that it falls within. When it transforms to the same number of dimensions it started with, we say it has full rank.
  • Because linear transformations must always keep the origin in place, the 0 vectors is always in the column space. For full rank transformations, the only vector that lands at the origin is the 0 vector. But for matrices that aren’t full rank, squished to a smaller dimension, you can have a whole bunch of vectors that land on 0. These set of vectors that land on the origin is known as the null space, or the kernel of the matrix. It’s the space of all vectors that become null because they land on the 0 vector.

Determinants

  • What is it?
    • A determinant maps a (square) matrix to a real number.
    • For a (square) matrix ARn×n\textbf{A} \in \mathbb{R}^{n \times n}, the determinant would be written as det(A)\text{det}(\textbf{A}).
    • In an example where we have AR2×2\textbf{A} \in \mathbb{R}^{2 \times 2}, we would express this as follows:
      • det(A)=a11a12a21a22=a11a22a12a21\text{det}(\textbf{A}) = \begin{vmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{vmatrix} = a_{11}a_{22} - a_{12}a_{21}
  • What does it tell us?
    • For any square matrix ARn×n\textbf{A} \in \mathbb{R}^{n \times n}, it holds that A\textbf{A} is invertible if and only if det(A)0\text{det}(\textbf{A}) \neq 0.
    • It tells us about the changes in volume of the space the matrix inhabits.
      • Because this tells us about the area/volume, if the angles of the vectors were linearly dependent, then the area would equal zero. Therefore if the vectors are linearly independent, then their determinant would be non-zero.
      • When you have your matrix, if you swap the columns, you are essentially reversing the orientation. This will have an impact on our machine learning models and the decision boundary.
        • Example: If you have the matrix [b00g]\begin{bmatrix} b & 0 \\ 0 & g \\ \end{bmatrix}, you would compute the determinant as follows: bg0=bgbg - 0 = bg. However if you swapped the columns of the matrix as follows: [0bg0]\begin{bmatrix} 0 & b \\ g & 0 \\ \end{bmatrix}, then computing the determinant would be as follows: 0gb=gb0 - gb = -gb. This clearly shows that the orientation is flipped and reversed.
      • The sign of the determinant indicates the orientation of the spanning vectors with respect to the standard basis.
      • The determinant acts as a function that measures the signed volume formed by column vectors composed in a matrix.
    • Determinant as Scaling Factor:
      • If the determinant of a matrix is positive, it indicates that the transformation scales volumes by a positive factor. In other words, the transformation expands or contracts space without flipping it.
      • If the determinant is negative, it means that the transformation includes a reflection, and the volume of the parallelepiped is scaled by a negative factor. This implies a flip or an orientation reversal.
  • How to compute determinants
    • 2×22\times2 matrix
    • 3×33\times3 matrix
      • Laplace’s expansion

Trace

  • What is a trace?
    • The trace of a square matrix is the sum of diagonal elements. The importance of this property illuminates whether the matrix is invariant under cyclic permutations, i.e. tr(AKL)=tr(KLA)\text{tr}(\textbf{AKL})=\text{tr}(\textbf{KLA}).
    • It is defined as the following, where essentially the trace is the sum of the diagonal elements of A\textbf{A}:
    • tr(A):=i=1naii\text{tr}(\textbf{A}):=\sum^{n}_{i=1}a_{ii}
    • The trace of a linear mapping is independent of the basis (while matrix representations of linear mappings usually do).

Characteristic Polynomial

  • The characteristic polynomial of a square matrix is a polynomial equation obtained by subtracting a scalar multiple of the identity matrix from the matrix itself, and then finding the determinant of the resulting expression. The roots of this polynomial are the eigenvalues of the matrix, which have significant applications in linear algebra and various mathematical fields.
  • They are represented as follows:
    1. pA(λ):=det(AλI)=c0+c1λ+c2λ2+...+cn1λn1+(1)nλn\mathcal{p}_{\textbf{A}}(\lambda):=\text{det}(\textbf{A}-\lambda\textbf{I}) \\ = c_{0}+c_{1}\lambda+c_{2}\lambda^{2}+...+c_{n-1}\lambda^{n-1}+(-1)^{n}\lambda^{n}
    2. In this case, the characteristic polynomial would be c0...,cn1Rc_{0}...,c_{n-1}\in \mathbb{R}. Specifically
    3. c0=det(A),cn1=(1)n1tr(A)c_{0}=\text{det}(\textbf{A}), \\ c_{n-1}=(-1)^{n-1}\text{tr}(\textbf{A})

Eigenvalues & Eigenvectors

  • What are eigenvalues & eigenvectors?
    • An eigenvalue of a linear mapping will tell us how a special set of vectors, the eigenvectors, are transformed by the linear mapping. This is similar to a scalar, so you can think of this as a scaling factor for how much the space is stretched or compressed in specific directions.
      • If the value is negative, then the direction of the scaling is flipped.
      • Eigenvalues are basis independent which are useful during linear transformations.
      • Symmetric, positive definite matrices always have positive, real eigenvalues.
      • One way to think about eigenvalues, is that:
        • The determinant of a square matrix is the product of its eigenvalues
        • For matrix  ARn×n, we get      det(A)=i=1nλi\text{For matrix} \;\textbf{A}\in \mathbb{R}^{n\times{n}}\text{, we get} \;\;\;\text{det}(\mathbf{A})=\prod^{n}_{i=1}\lambda_{i}
        • The trace of a square matrix is the sum of its eigenvalues
        • For matrix  ARn×n, we get      tr(A)=i=1nλi\text{For matrix} \;\textbf{A}\in \mathbb{R}^{n\times{n}}\text{, we get} \;\;\;\text{tr}(\mathbf{A})=\sum^{n}_{i=1}\lambda_{i}
    • An eigenvector is similar to an eigenvalue, except that it specifies the direction of the transformation, remaining unchanged after the transformation.
      • Eigenvalues with their associated eigenvectors are linearly independent.
    • The eigenvalue equation is as follows: Let ARn×n\textbf{A}\in\mathbb{R}^{n\times{n}} be a square matrix. Then λR\lambda\in\mathbb{R} is an eigenvalue of A\textbf{A} and xRnx\in\mathbb{R}^{n}\{0}\{0\} is the corresponding eigenvector of A\textbf{A} if Ax=λx\textbf{A}x=\lambda{x}.
      • When a linear transformation occurs and a vector remains on its own span, this is known as an “eigen”
    • Two vectors that point in the same direction are called codirected.
    • Two vectors that point in the same or he opposite direction are colinear.
    • Example: If xx is an eigenvector of A\textbf{A} associated with eigenvalue λ\lambda, then for any cRc\in\mathbb{R}\{0}\{0\} it holds that cx\text{c}x is an eigenvector of A\textbf{A} with the same eigenvalue. This is so because:
    • A(cx)=cAx=cλx=λ(cx)\textbf{A}(\text{c}x)=c\textbf{A}x=c\lambda{x}=\lambda(\text{c}x)
    • All vectors that are colinear to xx are also eigenvectors of A\textbf{A}.
    • Something can be the eigenvalue of a square matrix if and only if it is a root of the characteristic polynomial.
      • The algebraic multiplicity is the number of times the root appears in the characteristic polynomial.
      • The geometric multiplicity is the number of linearly independent eigenvectors associated with the eigenvalue of a square matrix.
      • The geometric multiplicity cannot exceed the algebraic multiplicity, but it may be lower.
    • The set of all eigenvectors associated with an eigenvalue spans a subspace known as an eigenspace.
    • An eigenspectrum is the set of all eigenvalues.
  • Properties of eigenvalues and eigenvectors:
    • A matrix A\textbf{A} and its transpose A\textbf{A}^{\top} possess the same eigenvalues, but not necessarily the same eigenvectors.
  • What are matrix decompositions?
    • The way in which you would break down the factors of a number like you would with prime numbers, that’s essentially what you’re doing with matrix decompositions. Those “prime factors” that are analogous to this are what we would consider the eigenvalues, eigenvectors, etc.
    • A Cholesky decomposition is this same process but most akin to a square root, where the factors you break them down into are the same numbers (a number times itself), and this is done specifically on symmetric, positive definite matrices.
      • So in this case, the matrix A=LL\textbf{A}=\textbf{LL}^{\top} would be represented as shown below, with L\textbf{L} being the Cholesky factor.
      • [a11a1nan1ann]=[l110ln1lnn][l110ln1lnn]\begin{bmatrix} a_{11} & \dots & a_{1n} \\ \vdots & \ddots & \vdots \\ a_{n1} & \dots & a_{nn} \end{bmatrix} = \begin{bmatrix} l_{11} & \dots & 0 \\ \vdots & \ddots & \vdots \\ l_{n1} & \dots & l_{nn} \end{bmatrix} \begin{bmatrix} l_{11} & \dots & 0 \\ \vdots & \ddots & \vdots \\ l_{n1} & \dots & l_{nn} \end{bmatrix}
  • What does diagonalization mean?
    • This is a preferred property when found because it makes calculations MUCH easier, especially when dealing with matrices of very high powers.
      • To determine if a matrix is diagonalizable, it must have n linearly independent eigenvectors.
    • A matrix ARn×n  is diagonalizable if it is similar to a diagonal matrix, i.e. if there exists an invertible matrix  PRn×n  such that  D=P1AP\textbf{A}\in \mathbb{R}^{n\times{n}} \; \text{is diagonalizable if it is similar to a diagonal matrix, i.e. if there exists an invertible matrix} \; \textbf{P}\in\mathbb{R}^{n\times{n}}\; \text{such that} \; \textbf{D}=\textbf{P}^{-1}\textbf{AP}.
    • A matrix B\textbf{B} is similar to matrix A\textbf{A} if there exists an invertible matrix P\textbf{P} such that P1AP=B\textbf{P}^{-1}\textbf{AP}=\textbf{B}.
    • When a matrix is similar to another, then that matrix is diagonalizable.
  • How do you apply decompositions to non-square matrices?
    • A Singular Value Decomposition is a way to decompose all types of matrices (not just square ones), quantifying the change between the underlying geometry of the vector spaces involved in linear mappings.
      • It decomposes the matrix into three simpler matrices: U,,V\textbf{U}, \sum, \textbf{V}^{\top}
        • U\textbf{U} = orthogonal matrix
        • V\textbf{V}^{\top} = orthogonal matrix
        • \sum = diagonal matrix with singular values
  • What benefit does all of this give us?

Jacobian Matrix

  • A Jacobian matrix is a way of listing out all the partial derivatives for each respective variable in a matrix format. The first column would give the partial derivatives for the first variable, the second column for the second variable, etc.
    • The Jacobian matrix of a function f:RnRm\textbf{f}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m} is the m×nm \times n matrix composed of the nn partial derivatives of f\textbf{f} evaluated at a\textbf{a}:
    • [Jf(a)]=[D1f1(a)Dnf1(a)D1fm(a)Dnfm(a)][\textbf{Jf(a)}]= \begin{bmatrix} D_{1}f_{1}(\textbf{a}) & \dots & D_{n}f_{1}(\textbf{a}) \\ \vdots & & \vdots \\ D_{1}f_{m}(\textbf{a}) & \dots & D_{n}f_{m}(\textbf{a}) \end{bmatrix}
  • As a rule of thumb, on a small enough scale, everything behaves linearly; on a larger scale, nonlinearities often appear.
  • The object of differential calculus is to study nonlinear mappings by replacing them with linear transformations: we replace nonlinear equations with linear equations, curved surfaces by their tangent planes, and so on.
  • The partial derivative DifD_{i}f answers the question, how fast does the function change when you vary the iith variable, keeping the other variables constant?
  • Now when a system allows all the components of its argument to change (allowing it to move in any direction (vector), if it is differentiable then its derivative consists of a matrix, called the Jacobian matrix, whose entries are the partial derivatives of the function.
  • An important concept is that as the derivative approaches the limit of 0, it does so from any direction.