Deep Learning

Insights into the gist of deep learning.

What is deep learning?

Deep learning is an architectural choice in a machine learning model, applied to most types of models, including: supervised, unsupervised, and reinforcement.
Deep learning uses the concept of neurons as a way to map inputs to outputs.

How does it work?

You start with the memory (input) while the output is the prediction.
When you have neurons, you can have N number of layers of neurons, with each layer having however many neurons you’ve designed
You start with a value from the input and this then gets mapped to the neurons within the first layer. What happens here is that the connection between two nodes is a weight that is applied to the calculation. At each layer, you take the weighted sum of the neurons, the output of that neuron multiplied by the weight in that connection, and you do this for every neuron in the layer, then you sum it.
You then feed this output to the next layer which continues in this manner.
Neurons use activation functions in order to take the calculated inputs and calculate an output. Activation functions can be linear, but by choosing non-linear functions, you really allow the neural network to allow for solving the true complexity behind the problem you are looking to solve.
Basically, a neuron is part of a two-step process. First, it calculates the weighted sum of the inputs that map to it, and then second it runs that through the activation function to produce its output, which is known as the activation value.
The activation function basically helps take the output and then convert it to a value that makes sense with the activation function. For example the function can be Rectified Linear, TanH, or Logistic and these are so because the use of this function against the output spits out a number that places this output within a boundary that helps the model “learn”. Doing this over and over with more layers theoretically allows it more opportunities to learn. This also allows us to pass the value in a manner that is represented in a non-linear fashion, which is a more accurate representation of how the world actually works.

Technically, any non-linear function can act as an activation function that enables a neural network to learn a nonlinear mapping from inputs to outputs.

The benefit to having multiple nodes is that theoretically, each node is tasked with solving different aspects of the problem, and the weighted sum function then combines those efforts together in a divide and conquer strategy.

An important aspect o the power of neural networks is that during training, as the weights on the connections within the network are set, the network is in effect learning a decomposition of the larger problem, and the individual neurons are learning how to solve and combine solutions to the components within this problem decomposition.
Generally all the neurons within a given layer of a network will be using the same activation function.

Hyperparameters are the elements of a neural network that are set manually by the data scientist prior to the training process. These are set manually, while the parameters are those that are set automatically through the training process.

Tuning models

When you adjust the weights of a model (which is done by learning or in advance by a data scientist, known as the weights), what happens is that the weight (in weight space) is represented as a vector. This vector is orthogonal (perpendicular) to the boundary you are looking to create, and because of this, understanding whether the angle of your weight vector and (????) allows you to understand the angle relative to each other. If it is less than 90 degrees, then it is negative and within the boundary and if positive, then outside the boundary. This vector composition is what allows us to determine where it falls because of that. Essentially this is because multiplying the output and the weight, each represented as vectors, uses the triangle law which rotates the boundary.
The bias that is introduced to a deep learning model is essentially adding a term to your scoring function. By doing so, you are moving the origin of the activation function up or down the y-axis, which is translating it. Translations of vectors occurs when you add them, as shown by the parallelogram law. By shifting the bias, you either give more or less slack to the decision boundary of choice, which makes sense because of the bias you’re injecting into it.

Geometry of models

Even though functions are use in machine learning, their value is derived from the geometry of spaces used, specifically: input space, weight space and activation space.

Input space → All initial values that we would feed into our machine learning algorithm would use some flavor of this input space.
Weight space → The weight space contains plotted points of all possible weights that an ML model can contain.
Activation space → Comparable to an “output” or “decision” space, a shift in any particular direction is the result of a difference in the weighted sum from the tuning of a particular weight’s coefficient. This output gets us a result but it is in this space that we can define our “decision boundary” in order to define the logic with which the system would predict one value or decision over another.

💬

Determining whether vectors are orthogonal as a means of defining the decision boundary is blowing my damn mind.

The power of neural networks to model complex relationships is not the result of complex mathematical models, but rather emerges from the interactions between a large set of simple neurons.

When viewing the deep learning architecture, we have an input layer, an output layer, and many hidden layers.
The various neurons we have in the hidden layers, the number of those layers, are all design decisions made by the data scientist.

How learning occurs

The process used to train a deep neural network can be characterized as: randomly initiatilizing the weight of a network, and the iteratively updating the weights of the network, and then updatingthe weights of the network in response to the errors the network makes on a dataset, until the network is working as expected. Within this training framework, the backpropagation algorithm solves the credit (or blame) assignment problem, and the gradient descent algorithm defines the learning rule that actually updates the weights in the network.
The goal of learning is to iteratively adjust the function of the model, by really adjusting the weights (coefficients) of the function so that the outputs of the model result in the least error (or divergence) from the trained data set. (Even this take with a grain of salt so that you don’t fall into overfitting)
Types of learning

Gradient Descent

Defines a rule to update the weights used in a function based on the error of the function.