Vector Spaces & Linear Mappings

The use of vectors is commonplace in machine learning, with the use of functions and linear transformations that result in vectors and/or probability distributions. Below we will dive into the mechanics of vector spaces and subspaces that encompass these linear transformations. We won’t cover specifically how they are applied in machine learning here, but will in our other posts pertaining to matrix decompositions and analytic geometry.

While Cartesian Coordinates work well in the context of flat surfaces and spaces, the use of scalars, vectors, and matrices open us up to a new paradigm. With these tools, one is able to treat traversals across space in a more abstract and nuanced way, one that has proven to be effective in advancing physics forward — as well as the field of machine learning. Much of machine learning is based on the computations of matrix decompositions and transformations, which warrants us diving into the understanding of these three mathematical objects.

Scalars, vectors, & matrices

When advancing along the real number line to count, instead think of it more as scaling a single number along the line. The motion that occurs in scaling from one other to the other is in essence what a scalar mathematically presents, expressed as a single numerical output. If it was a -5, it scaled to the left and if it was a +5 it scaled to the right. With a vector, the scaling needs to be accompanied with direction, thereby giving us the magnitude. A matrix, or matrices in plural, is essentially a set of linear equations but comes in the form of an expanded vector.

Scalars, vectors and matrices are useful in that they are like our coordinate systems we used on the Cartesian plane, but are useful in higher mathematics and higher dimensions when we need a universal way of describing our operations. What makes these powerful is that whether we need to stitch together various coordinate systems across various types of geometries, these three concepts are our consistent pathfinders.

We’ll introduce a concept called a transposing of a vector, which is an efficient way of presenting a vector. Normally a vector would appear as follows:

a =\begin{bmatrix} 5 \\ 3 \\ \end{bmatrix}

But the vertical height can take up unnecessary space if we consider higher-dimensional vectors. So instead you can present this as follows:

a^{\top}=\begin{bmatrix} 5 & 3 \\ \end{bmatrix}

Our three perpendicular axes $(x, y, z)$ are also known as basis vectors. When we begin to solve any set of linear functions, these basis vectors have unit length and lay the foundation for us to apply our operations on. This will make sense later, but you’ll come to find that transformations of matrices, using different coordinate systems, all these things would be confusing unless we have a grounding concept of basis vectors on hand. The basis vectors are attributed with $i, j, k$ coordinates and a carat, often called a "hat" on top: $\hat{i}, \; \hat{j}, \; \hat{k}$ . A vector itself is often presented with a right arrow on top and a vector along these basis vectors could be written as:

\vec{V} \; = \; V_{x}\hat{i} + V_{y}\hat{j} + V_{z}\hat{k}

Those coefficients are also known as the components of the vectors and influence the magnitude of the vectors (very similar how the slope $m$ influences the angle of a linear function like $y = mx+b$ ).

Back to matrixes, now matrices are also considered a matrix of coefficients because your standard set of functions, like the following will translate to the matrix below it.

4x \; \; + \; \; 4y \; \; = \; \; 6 \\ 4x \; \; + \; \; 2y \; \; = \; \; 5 \\ \downarrow \\ \begin{pmatrix} 4 & 4\\ 4 & 2 \end{pmatrix}

This is possible because the placement of the basis vectors gives us an easy and consistent way to solve this problem by obfuscating the variables attributed to their axes. With matrices, we use their positional notation to introduce efficiency in solving these system of functions. The structure of the matrix unveils algebraic applications and symmetries. The structure can also aid in determining whether these system of functions have no solution, a unique solution or an infinite number of solutions.

To fully capture the linear functions, we use a bar to denote the equal signs and the values they equal placed on the right hand side, just as they appear in the linear function. Known as the augmented matrix, the coefficients of the unknown and the constant values they equal are present. These are considered to be mathematical tools that help simplify our problems as much as possible before solving them.

\left[ \begin{matrix} 4 & 4 \; \\ 4 & 2 \; \\ \end{matrix} \left| \, \begin{matrix} \; 6 \\ \; 5 \\ \end{matrix} \right. \right]

These sets of linear functions can be solved in a number of ways, which we will explore in the next section.

Linear algebra

For a system of linear equations, we can have a combination of no solution, one solution, or many solutions. Our lines can become planes if taken to a higher dimension — and we need to assume that we can extend this logic across any number of dimensions in our solution set. They can take the form of a point, or even be empty.

Let’s start with a vector. A vector can be represented as an arrow where (usually) you extend from the origin to another point on a coordinate system — except instead of using the coordinates to chart the end location of the vector, an arrow is used to show its direction and magnitude of travel across the coordinate system.

There are also a number of ways you can solve systems of linear equations, one of them being the elementary row operation where you (1) multiply an equation by a non-zero constant (2) add or subtract a multiple of one equation to another (3) interchange equations.

Another method for solving these equations is the Gauss elimination where elementary row operations are used to convert the matrix into a triangular set of zeroes that emanate from the bottom left corner on up to the upper right. By isolating a single variable on the bottom row, two variables in the row above and so on, we replicate the principles we used in our algebraic functions before, using the single known variable to derive the next unknown variable until you derive the answer. The basic principles of algebra apply to these newly created mathematical objects, delivering the same effects. Adhering to these basic mathematical laws, a system of functions across many dimensions can be just as easily solved. What seems like a daunting new set of mathematics is merely the application of the familiar concepts in a new light.

Earlier, scalars were presented as measuring the size of a particular quantity. If you have $\vec{a}$ and you double it to $2\vec{a}$ , then you are effectively doubling the length of that vector, or multiplying the vector by the scalar equal to $2$ , or scaling the vector by a magnitude of $2$ . Because a vector is conveying direction and magnitude, multiplying it by a scalar value is "scaling" that magnitude, hence the name "scalar". Remember, a scalar can be any real number, so a negative scalar is scaling the vector in the opposite direction. Below we have an example of scaling a vector by $3$ .

3v \; \; = \; \; 3\begin{pmatrix} 3 & 1 \\ \end{pmatrix} \; \; = \; \; \begin{pmatrix} 6 & 2 \\ \end{pmatrix}

Arithmetic operations can be applied to vectors like addition and subtraction. Rather than using coordinates to indicate their position, vectors are depicted as arrows that point from its original location to the coordinate after the operation is applied. In this way, both depth and direction are expressed. If these coordinates are on a 2 dimensional plane, we say it is in $\mathbb{R}^{2}$ , or "2-space", implying the vector has two rows and contains real numbers.

u + v = \begin{pmatrix} a & b \\ \end{pmatrix} + \begin{pmatrix} c & d \\ \end{pmatrix} = \begin{pmatrix} a + c \\ b + d \\ \end{pmatrix}

Just as we easily added more dimensions to our Cartesian coordinates, a 2-space can be scaled out in the same manner. This now "3-space" would have three numbers each in their own row in the vector and we can extend this even further to something called N-space, where $N$ can be any numerical value. Known as Hilbert spaces, the $N$ values in this vector are called n-tuples, meaning they must be ordered in a particular sequence. Those $N$ values are analogous to "dimensions" and for a space to be $\mathbb{R}^n$ we call it an $N$ -dimensional Euclidean space.

Vectors can also be multiplied together, a process known as dot product, or the inner product. The dot product of two vectors is an ordinary number, a scalar and is really the product of the magnitudes of the two vectors. If the dot product is zero, then the two vectors are perpendicular, or orthogonal.

\text{Let }u = \begin{pmatrix} u_1 & u_2 & ... & u_n \\ \end{pmatrix} \text{and let } v = \begin{pmatrix} v_1 & v_2 & ... & v_n \\ \end{pmatrix}

You can multiply these two vectors in the following manner:

u \cdot v = u_{1}v_{1} + u_{2}v_{2} + u_{3}v_{3} + ... + u_{n}v_{n}

In higher dimensions we cannot visually identify relationships as we had in $\mathbb{R}^2$ or $\mathbb{R}^3$ . Linear algebra forces us to take our sets of equations, use scalars, vectors, and matrices and look for new patterns in this system.

Gone are the graphs, vectors and matrices are the new domain with which we see this higher-dimensional world. Matrices come in various sizes and when referring to items in a matrix we use a new syntax akin to the coordinates we used in the past: the row number is followed by the column number. For example to identify the value in the first row and third column, we write $a_{13}$ . Each row and column are their own vectors and appropriately called row vectors and column vectors. Matrices also abide by their own set of arithmetic and multiplication. A matrix can also be multiplied by a vector with the end result being a new matrix. The Russian doll set of abstractions shines bright here.

Linear equations are often written in the form of $Ax = b$ where A is the coefficient that is known, $b$ is the value that is known and $x$ is the variable that is unknown. In the case of matrices, $A$ would be the matrix of coefficients of the unknowns, $x$ is a vector of unknowns and $b$ is a vector containing constant values. If we consider $x$ to be a pure line, then $A$ is clearly affecting $x$ . So we can read our function as transforms the vector $x$ to another vector $b$ (just as in $f(x) = y$ , the function $f$ acts on the argument $x$ to output a value of $y$ ). This would appear as follows:

Ax = (v_1 v_2 .... v_n) \begin{pmatrix} x_1 \\ . \\ . \\ . \\ x_n \end{pmatrix} = x_{1}v_{1} + x_{2}v_{2} + ... + x_{n}v_{n}

The number of columns in matrix $A$ must equal the number of rows in the vector $\vec{x}$ . Two matrices can also be multiplied together in the same manner.

\begin{pmatrix} a & b \\ c & d \\ \end{pmatrix} * \begin{pmatrix} e & f \\ g & h \\ \end{pmatrix} = \begin{pmatrix} ae + bg & af + bh \\ ce + dg & cf + dh \\ \end{pmatrix}

Matrices can also have exponents. For example, a matrix taken to the k power has k copies of the square matrix. A square matrix to the $0^{\text{th}}$ power, like $A^{0} = I$ is known as the identity matrix. This is a matrix where the values $1$ are listed diagonally from left to right in descending order while the rest of the values are $0$ . Multiplying a matrix by the identity matrix results in the original matrix, like multiplying any number by $1$ . All identity matrices are square matrices.

\begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ \end{pmatrix}

The one arithmetic operation that matrices do not adhere to is the division operation. The closest analog to this would be multiplying by the inverse of a matrix (just as dividing by $5$ is akin to multiplying by $\frac{1}{5}$ , except in the case of inverses we cannot assume it is the number one over the value). We use the identity matrix to uncover the inverse matrix and we do this with the following property $AB = I$ . What this statement is saying, is that the inverse of $A$ is $B$ , so if you multiply the two together, you should get one. If in $BA$ , matrix $B$ is the inverse of $A$ , we call matrix $B$ the left inverse. In $AC$ if matrix $C$ is the inverse of $A$ , we call matrix $C$ the right inverse of $A$ .

Linear combinations

The essence of linear combinations involves the concept of adding together two scaled vectors. $x \; \& \; y$ are translated to $\hat{i}$ and $\hat{j}$ in the new vector coordinate system. So think of a vector as really just a basis vector that is then scaled with the scalar value in its place.

(x, y) \rightarrow \begin{bmatrix}i \\j \end{bmatrix}

The use of basis vectors allows us to have a consistent mathematics of vectors within the chosen coordinate system — with any new set of basis vectors chosen, the math will still apply. The basis of a vector space is a set of linearly independent vectors that span the full space. Any time you scale and add vectors it is called a linear combination of vectors. The span of vectors is all possible vectors that can be created by linear combinations of those vectors. So if $A$ is the span of $V$ then we would say:

V = \textnormal{span}[A]

Now, because the span gives us the range of movement among the vectors, if one of the vectors lies along the span of one of the other vectors, or two vectors happen to line up, we call this linear dependence implying that it’s redundant because it doesn’t add anything to the span. The concept of linear dependence and independence becomes important throughout linear algebra so keep this in mind!

A generating set is basically a way to create any vector in the space by adjusting the coefficients in the linear combinations. This is because a generating set is a set of vectors that span vector subspaces, which implies that every vector can be represented as a linear combination of all possible vectors in a given vector space. Mathematically this is accounted for as follows:

Within the vector space $V=(V, +, \cdot)$ and for the set of vectors $A = \{x_{1}, ..., x_{k} \} \subseteq V.$ If every vector $v\in V$ can be expressed as a linear combination of $x_{1}, ... , x_{k}$ , then $A$ is considered a generating set of $V$ .

A generating set that produces no smaller subset of the generating set is called a minimal. The basis is the smallest set of vectors that spans the entire vector space. Said differently, every linearly independent generating set of a vector space is known as the basis. If you remove any vector from the basis, then you lose the ability to span the space.

Since vectors are elements within the vector subspace, we want to understand what it is we can do with these vectors in this subspace. If you think of each dimension of a vector, it is a new direction it is pointing towards. So if we have an arrow that is extending to one direction, we have a scalar. A two-dimensional vectors would be scaling two separate-pointing arrows which gives you the extend of its space.

As we undergo these linear combinations of vectors, the question boils down to whether the use of each vector contributes to the unique aspect of its direction, or if it is redundant. If it’s redundant, then we say that is is linearly dependent. If not and each is unique, then they are linearly independent.

🤔

Linear independence breeds diversity while linear dependence breeds redundancy.

The redundancy that comes from linear dependence is akin to whether one of those arrows is aligned with one of the existing arrows or not — the best way we can mathematically derive this is if one of the scalars of the set of vectors $\lambda_{i} \neq 0$ . Let’s dive into the math.

For vector space V with $k \in \mathbb{N}$ and $x_{1}, ... x_{k} \in \textit{V}.$ If there is a non-trivial combination where:

0 = \sum_{i = 1}^{k}\lambda_{i}x_{i}

and where at least one $\lambda_{i}\neq 0$ , then the vectors $x_{1}, ... x_{k}$ are linearly dependent.

However, if there are only trivial zeroes and there does not exist at least one instance where $\lambda_{i} \neq 0$ then we would say that these are linearly independent. What we mean by this, is when you take the set of vectors and apply them to an augmented matrix, after applying Gaussian elimination methods

So for example, if we take the following vectors:

\vec{v}_{1} = \begin{bmatrix} 1 \\ 4 \\ 0 \end{bmatrix}, \; \; \vec{v}_{2} = \begin{bmatrix} 10 \\ 2 \\ 1 \end{bmatrix}, \; \; \vec{v}_{3} = \begin{bmatrix} -5 \\ 0 \\ 6 \end{bmatrix}

We would apply scalars to each vector and equate these to zero

c_{1} \begin{bmatrix} 1 \\ 4 \\ 0 \end{bmatrix} + \; c_{2} \begin{bmatrix} 10 \\ 2 \\ 1 \end{bmatrix} + \; c_{3} \begin{bmatrix} -5 \\ 0 \\ 6 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix}

Then we will create an augmented matrix.

\left[\begin{array}{ccc|c} 1 & 10 & -5 & 0 \\ 4 & 2 & 0 & 0 \\ 0 & 1 & 6 & 0 \\ \end{array}\right]

From here we will use Gaussian elimination and reduced row echelon form to derive the right-most vector which is shown below From here we will inspect if there are any non-trivial zeros.

\left[\begin{array}{ccc|c} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ \end{array}\right]

The net result here is that we have linearly independent vectors because there are no scalar multiples of one that can be written as the other. Now, if the result ended up looking like something below, then we would say it is linearly dependent.

\left[\begin{array}{ccc|c} 1 & 10 & -5 & 0 \\ 6 & 22 & -5 & 0 \\ 0 & 1 & 6 & 0 \\ \end{array}\right]

This is because the second row is now 6 times the first row minus the third row. The fact that you can multiply one of the rows by 6 implies that you can scale an existing vector by 6, adding no diversity to the vector subspace.

There’s more we can do with matrices, which we will cover in my other post on Matrix Decompositions.