Generative AI Transformers

Transformers let AI models track relationships between chunks of data and derive meaning.
Transformer is an architecture of neural networks that takes a text sequence as input and produces another text sequence as output.

The input is a sequence of tokens, which can be words or subwords, extracted from the text provided. In our example, that’s “Good Morning.” Tokens are just chunks of text that hold meaning. In this case, “Good” and “Morning” are both tokens, and if you added an “!”, that would be a token too.
Once the input is received, the sequence is converted into numerical vectors, known as embeddings, which capture the context of each token. These embeddings allow models to process textual data mathematically and understand the intricate details and relationships of language. Similar words or tokens will have similar embeddings.

For example, the word “Good” might be represented by a set of numbers that capture its positive sentiment and common use as an adjective. That means it would be positioned closely to other positive or similar-meaning words like “great” or “pleasant”, allowing the model to understand how these words are related.
Positional embeddings are also included to help the model understand the position of a token within a sequence, ensuring the order and relative positions of tokens are understood and considered during processing. After all, “hot dog” means something entirely different from “dog hot” - position matters!

Now that our tokens have been appropriately marked, they pass through the encoder. The encoder helps process and prepare the input data — words, in our case — by understanding its structure and nuances. The encoder contains two mechanisms: the self-attention and feed-forward mechanisms.

The self-attention mechanism relates every word in the input sequence to every other word, allowing the process to focus on the most important words. It's like giving each word a score that represents how much attention it should pay to every other word in the sentence.
The feed-forward mechanism is like your fine-tuner. It takes the scores from the self-attention process and further refines the understanding of each word, ensuring the subtle nuances are captured accurately. This helps optimize the learning process.

At the culmination of every epic Transformers battle, there's usually a transformation, a change that turns the tide. The Transformation architecture is no different! After the encoder has done its part, the decoder takes the stage. It uses its own previous outputs — the output embeddings from the previous time step of the decoder — and the processed input from the encoder.

This dual input strategy ensures that the decoder takes into account both the original data and what it has produced thus far. The goal is to create a coherent and contextually appropriate final output sequence.

Attention

The key to the Transformer’s ground-breaking performance is its use of Attention. While processing a word, Attention enables the model to focus on other words in the input that are closely related to that word.
To enable it to handle more nuances about the intent and semantics of the sentence, Transformers include multiple attention scores for each word.

Training the Transformer via ML

The Transformer works slightly differently during Training and while doing Inference. Let’s first look at the flow of data during Training. Training data consists of two parts:

The source or input sequence (eg. “You are welcome” in English, for a translation problem)
The destination or target sequence (eg. “De nada” in Spanish)

The Transformer’s goal is to learn how to output the target sequence, by using both the input and target sequence.
The Transformer processes the data like this:

The input sequence is converted into Embeddings (with Position Encoding) and fed to the Encoder.
The stack of Encoders processes this and produces an encoded representation of the input sequence.
The target sequence is prepended with a start-of-sentence token, converted into Embeddings (with Position Encoding), and fed to the Decoder.
The stack of Decoders processes this along with the Encoder stack’s encoded representation to produce an encoded representation of the target sequence.
The Output layer converts it into word probabilities and the final output sequence.
The Transformer’s Loss function compares this output sequence with the target sequence from the training data. This loss is used to generate gradients to train the Transformer during back-propagation.

Inference

During Inference, we have only the input sequence and don’t have the target sequence to pass as input to the Decoder. The goal of the Transformer is to produce the target sequence from the input sequence alone.
The flow of data during Inference is:

The input sequence is converted into Embeddings (with Position Encoding) and fed to the Encoder.
The stack of Encoders processes this and produces an encoded representation of the input sequence.
Instead of the target sequence, we use an empty sequence with only a start-of-sentence token. This is converted into Embeddings (with Position Encoding) and fed to the Decoder.
The stack of Decoders processes this along with the Encoder stack’s encoded representation to produce an encoded representation of the target sequence.
The Output layer converts it into word probabilities and produces an output sequence.
We take the last word of the output sequence as the predicted word. That word is now filled into the second position of our Decoder input sequence, which now contains a start-of-sentence token and the first word.
Go back to step #3. As before, feed the new Decoder sequence into the model. Then take the second word of the output and append it to the Decoder sequence. Repeat this until it predicts an end-of-sentence token. Note that since the Encoder sequence does not change for each iteration, we do not have to repeat steps #1 and #2 each time (Thanks to Michal Kučírka for pointing this out).

The Encoder contains the all-important Self-attention layer that computes the relationship between different words in the sequence, as well as a Feed-forward layer.
The Decoder contains the Self-attention layer and the Feed-forward layer, as well as a second Encoder-Decoder attention layer.
Each Encoder and Decoder has its own set of weights.
Generative AI (GenAI) analyzes vast amounts of data, looking for patterns and relationships, then uses these insights to create fresh, new content that mimics the original dataset.
A generative AI transformer is a neural network architecture that uses self-attention mechanisms to process input text. Transformer models are a type of deep learning architecture that are effective in tasks like text summarization, machine translation, and question-answering.
“Attention is all you need” represents the foundation of generative AI. The transformer model greatly shortens the training time by parallelism.
Generative pre-trained transformers (GPT) are a type of large language model (LLM) and a prominent framework for generative artificial intelligence. They are artificial neural networks that are used in natural language processing tasks.

Three types of Generative AI approaches:

Generative Adversarial Networks (GANs)

Generative Adversarial Networks (GANs) are a type of generative model that has two main components: a generator and a discriminator. The generator tries to produce data while the discriminator evaluates it.
The goal of the generator is to produce new forms of data that the discriminator cannot discern between real from fake — like a Turing test.
GANs have many limitations and challenges. For instance, they can be difficult to train—because of problems such as model collapse, where the generator produces limited varieties of samples or even the same sample, regardless of the input. For example, it might repeatedly generate the same type of image rather than a diversity of outputs.

Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) are a generative model used mainly in unsupervised machine learning. They can produce new data that looks like your input data. The main components of VAEs are the encoder, the decoder, and a loss function.
First, the encoder acts like a detailed scanner, capturing a Transformer's essence into latent variables. Then, the decoder aims to rebuild that form, often creating subtle variations. This reconstruction, governed by a loss function, ensures the result mirrors the original while allowing unique differences. Think of it as reconstructing Optimus Prime's truck form but with occasional custom modifications.
VAEs have many limitations and challenges. For instance, the loss function in VAEs can be complex, where striking the right balance between making generated content look real (reconstruction) and ensuring it's structured correctly (regularization) can be challenging.

Transformers

The Transformer architecture introduced several groundbreaking innovations that set it apart from Generative AI techniques like GANs and VAEs. Transformer models understand the interplay of words in a sentence, capturing context. Unlike traditional models that handle sequences step by step, Transformers process all parts simultaneously, making them efficient and GPU-friendly.
Additionally, the Transformer architecture's versatility extends beyond text, showing promise in areas like vision. Transformers' ability to learn from vast data sources and then be fine-tuned for specific tasks like chat has ushered in a new era of NLP that includes ground-breaking tools like ChatGPT. In short, with Transformers, there’s more than meets the eye!

A generative AI transformer is a type of artificial intelligence (AI) model designed to generate new content, such as text or images, based on patterns it has learned from existing data. The term "transformer" refers to the specific architecture used in these models.

Here's a simplified breakdown:

Generative AI: This type of AI is capable of creating new content rather than just recognizing or classifying existing patterns. It can generate novel text, images, or other types of data.
Transformer Architecture: The transformer architecture is a type of neural network architecture commonly used in natural language processing tasks. It's known for its effectiveness in handling sequential data and capturing long-range dependencies.

In summary, a generative AI transformer is an AI model that can create new content, and it employs the transformer architecture to understand and generate patterns in the data it has been trained on. These models have been widely used in various applications, including text generation, image synthesis, and more.