Georgia Tech - Spring 2022
Instructor : Zsolt Kira
Textbook : https://www.deeplearningbook.org/
https://github.com/pytorch/workshops/tree/master/CS7643
https://sebastianraschka.com/blog/2021/dl-course.html
http://cs231n.stanford.edu/
https://d2l.ai/index.html
AI Summer : https://theaisummer.com/start-here/
NLP
This module consists of an overview of the simple machine learning algorithms, with a focus on linear classifiers. Then we will build up to neural networks.
As a refresher, recall the three main types/areas of Machine Learning.
As the name suggests a Linear Classifier is one based on the idea of linear regression. These appear in the form $f(x,W)=Wx+b$, which is simply the equation of a line in compact Matrix notion.
To develop a general algorithm for such models we will look at how we can decompose a model into a series of blocks. We will also look at how we can build them up in a series of blocks called a directed acyclic graph (DAG).
We then look at a very generic optimization algorithm called backpropogation, that allows us to optimize the parameters.
Other considerations?
Reference Material:
"A computer program is said to learn from experience E with respect to come class of tarks T and performance measure P, if it's performance at tasks in T, as measured by P, improves with experience E."
This differs from basic programming in that ML creates an algo that can build a model that in turn can receive data and make decisions. Basic programming would require that the program receive the data itself in order to make a decision. The distinction here is that ML programs a learner to perform a task vs performing the task itself. Linear regression is the classic example. We take some data and we produce an equation, that in turn can be used to make predictions. Programming a learner that can create that equation is the machine learning algo's job.
Machine learning thrives the harder it is to create an algo. Natural Language Processing (NLP) is a perfect example of such a situation.
Supervised learning is the creation of model from training inputs {X,Y} where each element pair is denoted $(x_i,y_i)$ and if generally represented as vectors. The Y's are generally called our labels (or Ground Truth). In Unsupervised learning we are not given the labels Ys but only the data X itself. In Reinforcement learning we have an agent, an environment, actions and a reward. It's the reward here that best approximates the notion of a label. There's no supervision in reinforcement learning, it might be better to say observed experiences.
Supervised learning can be NonParametric Model which have no explicit mathematical formula, examples of this include decision trees and Knn classifiers. They can also be parametric, which have an explicit formula, examples include logistic regression and neural networks. Linear Classifiers fall into the parametric umbrella. These can face challenges though as the number of dimensions increases.
Let's now dive deeper into parametric learning algorithms.
Components
Functional form : $f(x,w)=y$
One of the simplest example of this is the equation of a line $y=mx+b$. If the output is continuous then we can apply a secondary function in order to turn it into a Binary classifier.
$y =\begin{cases}
1, & \text{if } f(x,w)>0 \\
0, & \text{ Otherwise}
\end{cases}$
For a multi-class classifier we take the class with the highest (max) score : $f(x,W) = Wx+b$ where W is a matrix and each row represents a class.
Linear Classifiers are models that try to find a line, or hyperplane, that can seperate the data elements into distinct groups. While simple they're highly versatile and have many applications. However, in order to compute them we will often require a higher dimension and this presents some difficulties.
For more complex functions like XOR and Circles it becomes more difficult, if not impossible to discover a linear seperation. For this we will need none linear activators.
Topic: Performance Measure to improve the loss or score function. For binary we could take 1 when the score is greater than one and for a multi-class we could take the maximum. However scores suffer from some issues. Difficult to interpret, no bounds and hard to compare to other classifiers. To remedy some of these issues we will use the softmax function to turn scores into probabilities.
For a score function of the form $s=f(x,W)$ we would take $P(Y=k|X=x)=\frac{e^{s_k}}{\sum_j e^{s_j}}$ for each class k.
In order to optimize this we need a function to optimize. This is often called our loss or objective function. It should have the following properties:
In ML we will often use empirical risk minimization: Reduce the loss over the training dataset, then average the loss over the training data.
Given $\{(x_i,y_i)\}_{i=1}^N$ Define the loss L as $L=\frac{1}{N}\sum L_i(f(x_i,W),y_i)$ where $x_i$ is an image and $y_i$ is a label (usually an integer).
Consider the example of Support Vector Machines (SVMs). For an SVM we have a loss function of the form:
$L_i = \sum_{j \ne y_i} max(0,s_j - s_{y_i} + 1)$
Note that the loss is 0 if the score for $y_i$, which is the ground truth label, is greater than or equal to the scores of all the other classes, which are incorect, plus one. This will maintain a margin between the score for the ground truth label and all the other possible categories. We want to have a score that is higher by some margin for the ground truth label. When this is not the case we penalize this by how different it is from the margin. To do thie we take the max over all the classes that are not the ground truth, and penalize the model whenever the score for the ground truth itself is not bigger by a particular margin. This is called a hinge loss.
Simple example, suppose we are trying to classify a picture into 1 of three categories: cat, frog, or car.
Picture 1 : Ground truth : CAT
Model Scores
CAT CAR FROG
3.2 5.1 -1.7
Using our loss function: Li = sum max(0,sj-syi+1) from above
Li = max(0,5.1-3.2+1) + max(0,-1.7-3.2+1)
= max(0,2.9)+max(0,-3.9)
= 2.9 + 0
= 2.9
While the above loss function can work the best loss function to use for a softmax function is the cross-entropy. It can be derived by looking at the distance between the two probability distributions (output of the model and the ground truth), this is called KL Divergence. It can also be derived from a maximum likelihood estimation perspective.
$L_i = -log P(Y=y_i | X=x_i)$ Maximum Likelihood Estimation: choose the probabilities to maximize the likelihood of the observerd data.
Let's re-examine our example above using the cross entropy
Recall the ground truth is CAT
CAT CAR FROG
3.2 5.1 -1.70 <- These are our un-normalized log probabilities (logits)
24.5 164.0 0.18 <- exponentiated np.exp(sk)
0.13 0.87 0.00 <- divided by the sum of the exponentiated (row 2)
at this point the model thinks it's a car
In a regression type of problem we can directly optimize to match the ground truth value.
Example House price prediction
For probabilities
In some cases we add a regularization term to the loss function to encourage smaller weights, and penalize higher weights which would over emphasis a feature.
The illustration above shows the characteristics of different loss functions.
import numpy as np
cat = np.exp(3.2)/(24.5+164.0+0.18)
car = np.exp(5.1)/(24.5+164.0+0.18)
frog = np.exp(-1.7)/(24.5+164.0+0.18)
cat_neglog = -np.log( cat)
car_neglog = -np.log( car)
frog_neglog = -np.log(frog)
print(cat,cat_neglog)
print(car,car_neglog)
print(frog,frog_neglog)
In general we will be working with Matrices W, and X. Where x is the vector.
Define:
For our discussion
Assume s is a scalar $s \in \mathbb{R^1}$, v is a vector $v \in \mathbb{R^m}$, and M is a matrix $M \in \mathbb{R^{k \times l}}$,
then $\frac{\partial v}{\partial s}$ is of size $\mathbb{R^{m \times 1}}$
Similarly $\frac{\partial s}{\partial v}$ is of size $\mathbb{R^{1 \times m}}$
What is the size of $\frac{\partial v^1}{\partial v^2}$? (2 distinct vectors of the same size).
It turns out to be $\mathbb{R^{m \times m}}$, because the result of this is a matrix. In fact this is a jacobian matrix which is the matrix of partial derivatives.
What is the size of $\frac{\partial s}{\partial M}$? (scalar and a matrix)
The result is a matrix composed of the partial of the scalar w.r.t. each element in M
What about $\frac{\partial L}{\partial W}$? (Recall that L=loss is a scalar)
then by the previous question this should result in a matrix of partials (Jacobian).
Often times our algorithms (like gradient descent) are implemented in batches, which can also be thought of as matrices and tensors.
I'm sure your beginning to see a problem here, these computations can get complicated pretty quickly. Instead we can flatten our inputs to get a vector of derivatives.
Deep Learning employs a few key concepts generally expects raw data and learns a feature representation. (Think images and audio signals). Neural Networks are the most popular form but there are a few others like probabilistic learning.
How do we find the best weights for our Model? Gradient Descent! It's the optimization algorithm for finding the best parameters. Given a model and a loss function - Finding the best set of weights is a search problem. It's a search for weights the minimize, reduce, the loss function.
There are several classes of methods:
As weights change, the loss changes as well. This is often somewhat smooth locally, so small changes in weights produce small changes in the loss. We can therefore think about iterative algorithms that take current values of weights and modify them a bit.
The strategy GD follows is pretty straight forward. We follow the slope to it's lowest point. To do this we use the gradient, aka derivative.
$f'(a)= \lim\limits_{h \to 0} \frac{f(a+h)-f(a)}{h}$
As an algo:
It should be noted that in practice we only compute the gradient across a small subset of data (aka Batch)
Gradient descent is guaranteed to converge under some conditions.
Compute the partials can be done in multiple ways
Update rule Derivation
Let
Then $ \begin{align} \frac{\partial L}{\partial w_j} & =\sum_{i=1}^{n} \frac{\partial}{\partial w_j} (y_i - w^T x_i)^2 &\\ & =\sum_{i=1}^{n} (y_i - w^T x_i) \frac{\partial}{\partial w_j} (y_i - w^T x_i) &\\ & =-2 \sum_{i=1}^{n} \delta_i \frac{\partial}{\partial w_j} (w^T x_i) (\text{ where } \delta_i = (y_i - w^T x_i)) &\\ & =-2 \sum_{i=1}^{n} \delta_i \frac{\partial}{\partial w_j} \sum_{k=1}^{n} (w_k x_ik) &\\ & =-2 \sum_{i=1}^{n} \delta_i x_{ij} \end{align} $
Finally we have a workable form of our update. $$w_j \leftarrow w_j + 2n \sum_{k=1}^N \delta_k x_{kj}$$ Which is our our gradient descent update rule for N examples indexed by i
As an exercise feel free to try this out with a non-linear function like sigmoid. Don't forget to use the chain rule.
To perform this take
You should get the following result: $w_j \leftarrow w_j + 2n \sum_{k=1}^N \delta_k \sigma_i (1-\sigma_i)x_{ij}$
where
So far we have kept things simple, but what happens when we begin using a composition of simple functions.
Finding the derivative and the update rule can be tricky. By decomposing we can develop a generic algorithm. We will do this at a later point, using acyclic graphs.
Recall that
Local representation: a concept is represented by a single node.
Distributed representation: a concept is represented by the pattern of activation across many nodes.
A large advantage of distributed representations is that they are much better when combined, whereas local reps make no sense when combined. Basically local reps are limited in complexity, But distributed are much more broad.
Reference Links:
A linear classifier can be broken down into
A simple neural network has similar structure as our linear classifier:
Of course this is an oversimplified model of the neurons in our brain.
Of course we can also have many neurons connected to the same input. This is what happens in a multi-class classifier. Each output node outputs the score for a class. These are often call "fully connected layers", or linear projection layers.
$f(x,W) = \sigma(Wx+b) \begin{bmatrix} w_{11} & w_{12} & w_{13} & \dots & w_{1n} \\ w_{21} & w_{22} & w_{23} & \dots & w_{2n} \\ \cdots \\ w_{m1} & w_{m2} & w_{m3} & \dots & w_{mn} \end{bmatrix}$
Terminology:
Functions can be made arbitrarily complex, and subject only to computational limits:
eg $ f(x,W)=\sigma(w_5 \sigma(W_4 \sigma(W_3 \sigma(W_2 \sigma(W_1 x))))) $
We can use any type of differentiable function (layer) we want! All we need to do is add the loss function at the end. The world is compositional, and we want our model to reflect this. Empirical and theoretical evidence suggests that it makes learning complex functions easier. Prior state of the art feature engineering often had a compositional nature as well.
eg: Pixels --> edges --> object parts --> objects
This begs the question how do we compute the partials for each component?
To this end we will develop a general algorithm for a function as a computation graph. Graphs can be any directed acyclic graph. Our training algo will then process this graph one module at a time.
Consider:
pretty cool eh?
Backprop algo consists of two main parts:
Forward Pass
We simply compute the output of each component and save. These will be needed to compute the gradients later on.
Backwards Pass
Is much more complex and involved, recursive algorithm. Here we seek to calculate the gradients of the loss with respect to the module's parameters.
There are three main gradients needed for each module:
We assume that
So our problem can be formalized as follows
Let's work through an example:
Step 1: Computing the local gradients $\{ \frac{\partial h^l}{\partial h^{l-1}} ; \frac{\partial h^l}{\partial W} \} $
This is simply the derivative of the function wrt it's parameters and inputs.
for example in a single network $h^l = W h^{l-1}$
t/f $\frac{\partial h^l}{\partial h^{l-1}} = W$
and $\frac{\partial h^l}{\partial W} =h^{l-1,T}$
Step 3: we compute $\{ \frac{\partial L}{\partial h^{l-1}} ; \frac{\partial L}{\partial W} \} $
So we simply apply our old friend the chain rule: $\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \frac{\partial y}{\partial x}$
Summary
Backprop only tells you what you need to do, it doesn't specify how to do it. The backprop idea though can be applied to any directed acyclic graph, aka DAG. Graphs represent an ordering constraining which paths must be calculated first. Given an ordering, we can then iterate from the last module backwards, using the chain rule. As we do so we will store, for each node, it's gradient outputs for efficient computation. This is called reverse mode automatic differentiation.
The idea here is that we need to create a framework such that we can just define the computation graph. We can put together a set of functions that use simple primitives like addition and multiplication etc etc. Doing so will allow us to avoid computing the backwards gradients, nor will we write code that computes the gradients of the functions. Because these are all simple primitives.
Let's start by defining our terminology.
Example :
Since this is not terribly intuitive let's review/refresh some points
$\frac{\partial (a_1+a_2)}{\partial a_1} = 1$ because we're only differentiating wrt a1
$\frac{d}{dx} sin(x) = cos(x)$
It is interesting to note what certain operations do, and what they tell us about gradient flow
Here are a few more observations not illustrated by our example
If the gradients do not flow backwards properly, learning will slow down and even come to a stop!
Key Idea is to store the computation graph in meemory and corresponding gradient functions.
Nodes broken down to basic primitive computations (addition, multiplication, log, etc.) for which corresponding derivative is known.
As a small aside there is another method that is used on occasion: forward mode automatic differentiation.
Here's a small example demonstrating graph building using pytorch: (This uses an older version of pytorch, it's easier today)
The last line computes the backwards gradients. All in one line of code.
Computation graphs are not limited to mathematical functions.
They can have control flows (if's, loops, etc etc) and can proprogate through algorithms
They can be done dynamically so that gradients are computed, nodes are added, and repeat
All this falls under the umbrella of differentiable programming.
In this section we look an example that more closely resemble a NN use case.
Recall from our earlier sections the sigmoid function given by the Logistic regression function $$\large -log(\frac{1}{1+e^{-w^T x}})$$
Define
Let's decompose as follows ( note that this also serves as our computation graph )
Don't get confused in the above image!! u,p,L are the outputs
Let's now work through the back prop using automatic differentiation:
$\bar{p}=-\frac{1}{p}$ the partial of -log(p) wrt p is just the derivative of log (nb Log here is Ln-Natural Log)
$\bar{u}=\frac{\partial L}{\partial p} \frac{\partial p}{\partial u}$
Let's group it all together $\begin{align} \bar{w} & = \frac{\partial L}{\partial p} \frac{\partial p}{\partial u} \frac{\partial u}{\partial w} \\ & = - \frac{1}{\sigma(w^T x)} \sigma(w^T x) (1 - \sigma(w^T x)) x^T \\ & = - (1 - \sigma(w^T x)) x^T \end{align}$
This effectively shows the gradient flow along the path from L to w.
Let's reconsider our computation graph using matrix notation
for a simple layer network $h^l = WH^{l-1}$
We have
Now let's dive into the partials (Jacobians)
Notice the sparse structure of the last matrix, the partial of the loss wrt the weights.
Up until now we've focused mostly on our sigmoid function, but we can use any differentiable function! This includes piesewise differentiable functions as well. A popular choice in many applications is the relu, Rectified linear Unit. $h^l=max(0,h^{l-1})$. It provides non-linearity but better gradient flow than sigmoid, and is performed element wise.
The full Jacobian of the ReLU Layer is LARGE, (Output dimensions x Input dims).
Recall that MAX() function funnels gradients through the selected path.
\begin{equation}
\frac{\partial L}{\partial h^{l-1}} =
\begin{cases}
1 &\text{ if} h^{l-1} > 0 \\
0 &\text{otherwise}
\end{cases}
\end{equation}$
A network with two or more hidden layers is often considered to be a deep model. Depth is important because it is needed to structure a model to represent an inherently compositional world. We have object shapes, parts and scenes for ex in Computer vision. Theoretical evidence also suggests it leads to parameter efficiency. Gentle dimensionality reduction.
Commonly encountered issues
Fully Connected NN: Take an input, convert it to a vector, then feed it into a series of linear and nonlinear transformations. There are be hidden layers in the middle that are expected to extract more and more abstract features from the high dimensional raw input data. For deeper networks we will want to reduce the features/size. In the end we have a layer that represents our class scores. each node will have an outptu that represents a score and these are combined to produce a probability. This is not very good for say images as the number of pixels is generally very high. and it ignores the spatial sctructure of the images. So we turn to CNNs
CNN-Convolutional Neural Networks: Rather than tie each node to each pixel these will reflect a feature extractuor for small windows in the image and each local window will have these features extracted from it such as shapes corners, eyes and wheels. In the end we will features that represent where each object or entire objects are located in the image. and finally we will pass these features into a fully connected layer. ALbeit this time it will be a much smaller than the previous approach.
RNN-Recurrent Neural Networks: Are yet another approach better suited for problems that have a sequential structure like NLP and sentences.
Similar to traditional ML we will face the questions
We need a good optimization algo to find our weights. Gradient descent is popular but there are others that still use gradients. Different optimizaers may make more sense.
How do we init our weights? A bad initilization can lead to difficult learning and require a diff optimizer.
Regularization? How can we prevent overfitting?
Loss function: which one do we use? do we design our own?
The practice of ML is complex. For any application wyou must look at the trade-offs between the considerations above. ANother trade-off is model capacity and the amount of data. Low capacity models can preform poorly when certain loss functions like sigmoid are used.
Unfortunately, all this is done via experience ... there is no good text book on all these.
What modules to use, and how to connect them. This is guided by the type of data being used and it's characteristics. Lots of data types (modalities) already have good architectures. The flow of gradients is the top most consideration when analyzing layers. It is quite possible to have modules that cause a battleneck.
Combinations of linear and non linear layers. A combo of linear layers only has the same representational power as one linear layer. Non linear layers are crucial. Compositions of nonlinear layers enables complex transformations. Gradient flow depends heavily on the shape of the nonlinear modules.
Points to look at
Sigmoid:
tanh:
ReLU:
There are a lot of variations in nonLinearity. For example
You may have noticed these are not differentiable. Turns out this is ok because there are few problematic points. only 0 is not differentiable. ReLU is often the best starting point and is very popular. Converges very quickly. Sometimes a leaky ReLU is a good thing and can make a difference. Sigmoid is generally avoided, except in some cases where we need the values to fit the 0-1 range.
Intelligent Initialization of Neural Network Weights. Initialization determines how the statistics of our outputs will behave. It will also affect both the forward, and backwards, gradient flows. If you initialize the weights to values that are in some way degenerate (close to a bad local minima) then this will lead to poor gradient flow. If the weights are initialized to be activated into statistically large and these large activations are fed into our nonlinearities (such as the tanh) then the algo will begin in the saturation range of the function. In a similar fashion if you can start in an area s.t. the inputs are small then you be in the linear regime or close to it in the nonlinear space, and you will have a strong gradient to learn from. Of course if you can start near a good local minima then that will be beneficial as well.
Let's consider an example. What happens when we use constant weights? ie $w_i = c \; \forall i$. This would lead to a degenerate solution, as all weights will be updated with the same rule. they will move in the same direction and with the same step size. There are cases where this may be good, so it depends.
A common approach is to use small normally distributed random number. Eg $N(\mu = 0,\sigma = 0.01)$
This is very safe and reasonable approach that is still widely used:
Deeper networks (with many layers) are more sensitive to initialization. In deeper network, activations (output of the nodes) get smaller. Standard deviation reduces signficantly. This leads to smaller values multiplied by upstream gradients. Larger values will lead to saturation. We want a balance between the layers but this proves to be more difficult as complexity increases.
Ideally, we'd like to maintain the variance at the output to be similar to that of the input. This condition leads to a simple initialization rule, we sample from the uniform distribution:
$Uniform(-\frac{\sqrt{6}}{\sqrt{n_j+n_{j+1}}},+\frac{\sqrt{6}}{\sqrt{n_j+n_{j+1}}})$
Where
Notice how the distribution is relatively equal across all the layers.
In practice there is an even simpler form, $N(0,1) \times \sqrt{\frac{1}{n_j}}$, This analysis holds for tanh and similar activations.
For ReLU activations a similar analysis yields $N(0,1) \times \sqrt{\frac{1}{n_j/2}}$
Summary
In ML and DL data drives the learning of features and classifiers. Always seek to understand your data is important before transforming. Relationships between output stats, layers such as non-linearities, and gradients is important.
Normalization can improve gradient flow and learning Typical methods include
Somtimes we will use a layer that can normalize the data across the neural network.
For example
This will allow the network to determine it's scaling, or normalizing, factors, giving it greater flexibility. This is called Batch Normalization. During inference, stored mean and variances calculated on training sets are used. Sufficient batch sizes must be used to get stable per-batch estimates during training.
This is a popular method called Batch Normalization. Always be sure your batches are of sufficient size to compute these parameters.
Batch Normalization presents some interesting challenges:
Sufficient batch sizes must be used to get stable per-batch estimates during training
Normalization is especially important before non-linearities. We want the input statistics to be well behaved such that they do not saturate the non-linearities. We do not want too low, or too high, or even unnormalized and unbalanced values, because they cause desaturation issues.
So far we have only talked about Steepest gradient descent, this section introduces other approaches
Deep learning often involves complex, compositional and nonlinear function. Consequently the loss landscape is often complex as a result. There is little direct theory and a lot of intuition needed to optimize these loss surfaces.
It used to be thought that existence of local minima is the sticking point in optimization. But it turns out this is not always true. In many cases though we can find local minima, but there may be other issues that arise and hinder our ability.
Other issues include
We generally use a subset of the data in each iteration to calulate the loss and the gradient. This is an unbiased estimator, but can have high variance.
Several loss surface geometries can present difficulties
Steepest gradient descent is always searching for the steepest direction, and can become stuck at saddle points. One way to overcome this is to think of momentum. Imagine a ball rolling down a loss surface, and use momentum to pass flat surfaces.
Recall our update rule from earlier $w_i = w_{i-1} - \alpha \frac{\partial L}{\partial w_i}$
Consider: $v_i = \beta v_{i-1} + \frac{\partial L}{\partial w_{i-1}}$ Update velocity(starts as 0, $\beta=0.99$)
Our new update rule : $w_i = w_{i-1} - \alpha v_i$ (In some text books alpha is pused inside the velocity term)
(Note that when $\beta=0$ this is just Stochastic Gradient Descent)
This is acutally used quite often in practice, and can help move you off areas with low gradients. Observe that the velocity term is an exponential moving average of the gradient.
$
\begin{split}
v_i & = \beta v_{i-1} + \frac{\partial L }{\partial w_{i-1}} &\\
& = \beta (\beta v_{i-2}+\frac{\partial L }{\partial w_{i-2}}) + \frac{\partial L }{\partial w_{i-1}} &\\
& = \beta^2 v_{i-2} + \beta \frac{\partial L }{\partial w_{i-2}} + \frac{\partial L }{\partial w_{i-1}}
\end{split}
$
This is actually part of a general class of accelerated gradient methods with theoretical analysis under some assumptions.
Nesterov Momentum: Rather than combining velocity with the current gradient, go along velocity first and then calculate the gradient at a new point.
We know that velocity is probably a reasonable direction, so
$\begin{split} \hat{w}_{i-1} & = w_{i-1} + \beta v_{i-1} \\ v_i & = \beta v_{i-1} + \frac{\partial L}{\partial \hat{w}_{i-1}} \\ w_i & = w_{i-1} \alpha v_i \\ \end{split}$
Of course there are various equivalent implementation, should you choose to google this you'll find a few.
Hessian and Loss Curvature
There are various mathematical ways to characterize the loss curve. Similar to Jacobians, Hessians use 2nd order derivatives that provide further information about the loss surface. However, these are computationally intensive. The ratio between the smallest and largest eigenvalue of a hessian is called a condition number. Condition Numbers tell us how different the curvature is along different dimensions. If it is high then SGD (Stichastic Gradient Descent) will make big steps in some dimensions and small steps in others. This will cause alot of jumping and learning becomes sporadic and unpredictable.
There are other second order optimization methods that divide steps by curvature, but are expensive to compute.
Pre-Parameter Learning Rate
Idea here is to have a dynamic learning rate for each weight.
Several flavors of optimization Algorithms
There is no one method that is the best in all cases. While SGD can achieve similar results it'll require much more tuning.
Adagrad Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
Use gradient statistics to reduce learning rate across iterations.
This method uses a gradient accumulator ($G_i$)
$G_i = G_{i-1}+(\frac{\partial L}{\partial w_{i-1}})^2$ - This is our accumulator
and then our weight update, will tune down low gradient directions
$w_i = w_{i-1} - \frac{\alpha}{\sqrt{G_i + \epsilon}} \cdot \frac{\partial L}{\partial w_{i-1}}$
Directions with a high curvature will have higher gradients, and their learning rate will be reduced.
One shortcoming to this is that the accumulator continues to grow, meaning that the denominator grows large, which will push the learning rate towards 0. So what do we do? Well we can apply the idea of a weighted/moving average rather than a simple additive accumulator. See the next set of equations
RMSProp
$G_i = \beta G_{i-1}+(1-\beta)(\frac{\partial L}{\partial w_{i-1}})^2$ - This is our accumulator
and then our weight update, (which hasn't changed)
$w_i = w_{i-1} - \frac{\alpha}{\sqrt{G_i + \epsilon}} \cdot \frac{\partial L}{\partial w_{i-1}}$
Another Approach that is very popular is Adam, and combines aspects from both of the above
ADAM Optimizer
This was written around 2015! Not all that long ago.
$
\begin{split}
v_i & = \beta_1 v_{i-1} + (1 - \beta_1) \frac{\partial L}{\partial w_{i-1}} \\
G_i & = \beta_2 G_{i-1}+(1-\beta_2)(\frac{\partial L}{\partial w_{i-1}})^2 \\
w_i & = w_{i-1} - \frac{\alpha v_i}{\sqrt{G_i + \epsilon}}
\end{split}
$
One drawback is that this performs poorly near small values, and can become instable.
So we apply a Time Varying bias, to get the version that is used most often in practice
ADAM w Time Varying Smoothing
$
\begin{split}
v_i & = \beta_1 v_{i-1} + (1 - \beta_1) \frac{\partial L}{\partial w_{i-1}} \\
G_i & = \beta_2 G_{i-1}+(1-\beta_2)(\frac{\partial L}{\partial w_{i-1}})^2 \\
\hat{v_i} & = \frac{v_i}{1-\beta_1^t} \\
\hat{G_i} & = \frac{G_i}{1-\beta_2^t} \\
w_i & = w_{i-1} - \frac{\alpha \hat{v_i}}{\sqrt{\hat{G_i} + \epsilon}}
\end{split}
$
It's important to note that all these optimizers act differently depending on the loss landscape/sruface. They will exhibit different behaviours such as overshooting, Stagnating, etc. Plain SGD+Momentim can generalize better than adaptive methods but require more tuning.
First order optimization methods use learning rate. Theoretical results rely on annealed learning rate.
Several Typical Schedules:
This is a crucial aspect needed in DL as well as ML.
Some examples are
L1 Norm - Penalizes Large weights and encourages sparsity and smaller weights
L2 Norm - Behaves similar to the L1 but it does so in a different way
Elastic L1/L2:
A problem that is commonly encountered is that a Network often will learn to rely heavily on a few strong features that work very well. This often results in overfitting as the model is not representative of the data.
To prevent this we employ drop-out regularization: For each node, keep it's output with probability p. Activations of deactivated nodes are essentially zero. This can mask out a particular node in each iteration. In practice this can be done by implementing a mask calculated at each iteration. During testing you wouldn't want to drop any nodes.
The dropping of nodes presents some challenges though: During training each node has an expected p*Fan_in nodes coming in(are activated). During testing though all nodes are activated. This violates a basic principle in model building, namely the training and testing data should have similar input/output distributions.
We solve for this by scaling our outputs (or equivalent weights) by p.
Why does this work?
In this section we will look at Data Augmentation techniques to prevent overfitting. The idea is simple: we apply a series of transformations to the data. This is essentially free, and increases the data. Of course we must not change the data, or it's meaning. ie flipping an image is fine. We want a range of tranformation that mirror what happens in the real world. What about a random crop of an image? This is also fine as it mirrors the real world, we've reduced the data but we haven't really changed it. In fact using this technique might also increase the robustness of your model. Another method similar to this is cut-mix where portions of an image are cut out.
A more sophistated approach is color jitter, performed by adding/subtracting from the values in the red, green, or blue channels. Other transforms include, Translation, Rotation, Scale, Shearing. Of course you can also mix and combine these different techniques. These transforms server to increase your dataset using manipulations of the original.
Another (oddly named) approach is the CowMix variation. This is when an image is masked with a cow hide pattern and then some noise is added. The noise is optional as you can also use the mask to blend two images together in a non linear way.
Let's now turn our attention to the training and monitoring of our Neural Network.
Analyzing what is happening always begin with good methodology.
Optimization and Sanity Checks
Change of loss is indicative of the rate/speed of learning. Always plot and monitor learning curves: Iterations v.s. Loss. This reveals issues:
HINT
pytorch has a neat little decorator to help diagnose issues in the loss
with autograd.detect_anomaly():
output = model(input)
loss = criterion(output,labels)
loss.backwards()
This is handy for debugging
Of course classic machine learning signs of under/over fitting still apply
Lots of hyperparamters to tune (NB the weights are NOT hyperparamters). Hyperparameters gen refer to the more design decisions that go into the construction of the network.
Even a good idea will fail if not tuned properly!
Typically you should start with a coarse search
There are automated methods that are decent, but intuition (and even randomness) can do well given enough of a tuning budget.
Interdependence of Hyperparameters can be troublesome
Remember that DL we are optimizing a loss function that is differentiable. However what we care about are the metrics surrounding the model which we cannot optimize (lack of derivatives)
The relationship between these and the loss curve can be complex.
Here's another example that looks at True Positive Recall (TPR) & False Positive Recall (FPR)
Finally we can obtain a curve by varying the probability threshold. The area under the curve (AUC) is a common single number metric used to summarize.
Mapping between these and the loss however is not so simple or straight forward.
Guest Lecturers:
Overview of common wrangling techniques. Along with specific consideration for deep learning. At a high level wrangling encompasses all the steps used to transform your data for modeling purposes: Setting up cross validation pipelines, evaluating results in light of the transformations, reproducibility of results. We will also look at matters such as handling missing data, addressing class imbalances.
We will target several areas/step in the DL model building and evaluation
Step 1: What is the population of interest?
Two simple probability sampling methods are
There are many others. These are just the simple ones
Best Practices
We will now focus on the training and testing steps in our illustration above.
Step 2 How do we cross-validate to evaluate our model? How do we avoid overfitting and data mining?
Cross validation is loosely defined as a method for estimating predicting error. One cross validation strategy is called K Fold cross validation. Split the data into k groups, and build the model on k-1 groups, using the remaining group as your test set. You repeat this process for each group and take the average error over all the groups to compute the cross validation error.
Random vs Grid search Bergestra & Bengio, 2012)
Class imbalance occurs when one class has a signifcantly larger sample size than another class, also called a minority class. This is problematic in classification type problems.
Methods for handling imbalances
Consider the problem if object detection: Region CNN and Single Shot Detector (SSD) are models that can localize and classify many objects in an image. How they work is by densely sampling many boxes of different sizes and at different anchor points in an image.
In object detection our goal is to classify these boxes into foreground and background. We do this by measuring the IoU (Intersection over Union). (IoU = Area of Overlap / Area of Union ). A Proposal box is assigned a ground truth label of
CrossEntropy is a popular method for Bounding Box Regression:
Cross Entropy - Easy Examples incur a non negligible loss, which in aggrgate mask out the harder, rare examples.
$
\large
\begin{equation}
CE(p,y) =
\begin{cases}
-log(p) &\text{ if } y = 1 \\
-log(1-p) &\text{otherwise}
\end{cases}
\end{equation}
$
An Alternative loss is Focal loss: down-weights easy examples, to give more attention to difficult examples
$FL(p_t) = -(1-p_t)^{\gamma} log(p_t) $
A through E in red to denote the minority class (NOTE THAT THE LEFT SIDE IS BAD BAD BAD)
We now focus on Model prediction and evaluation.
Step 3 What prediction task (classification vs regression) do we care about? What is the meaningful evaluation criteria?
Consider a binary classification model with the aim to predict the likelihood of the test observation of being in class one. We consider class one as positive, and class zero as negative. A typical approach to classification prediction is to compute a confusion matrix.
If this were a Regression problem we might instead look at
Taking classification one step further may involve computing the associated probability of the class label. If we care about the score then the calibration can also be beneficial. Here you might look at several logistic functions that illustrate the mean predicted probability on the x-axis and the fraction of positives on the y-axis.
The black dotted line along the diagonal is the ideally calibrated model, wehere mean predictions are exactly equal to the fraction of positives. This represents the Perfect model. Anything that deviates from the diagonal is a miscalibrated plot. Consider the image where logistic is closest to the diagonal.
Importance of Baselines: What are you comparing your model against?
Best Practices
3 Areas of Data cleaning for Deep Learning
STEP 1 Cleaning
How do we handle missing data? Depends on the nature of it's absence
Dropping a row of data due to a missing feature is easy but reduces the expressiveness of the data. Imputing (computing a value, best guess, for the missing feature) is another approach
STEP 2 Transforming
This depends highly on the input data format. For example an RGB image may be converted into greyscale. In an NLP problem text will generally need to be converted into a numerical representation. A basic approach might be to assign each word a number, or other unique value. You could also use a bag of words, such as a TDF-IDF ( Term Frequency times Inverse Document Frequency ). Or some other embedding.
STEP 3 Pre-Processing
This helps to make our models converge faster. The most commonly applied technique is scaling our data, based on the type of the model: Zero-centered and normalized data are the two most popular.
Example Depth Estimation:
For each pixel in an image we are trying to predict the distance from the object to the camera that took the picture. Consider the following image.
Notice the holes in the image? Depth sensors are often highly noisy and filled with small gaps, and holes that need to filled. ie we need to clean the image and handle the holes.
Another more complex transform is to take the 1 channel map and transform to 3 layers/channels.
Instead of a single channel disparity map, we feed all three into our model to improve the results of the same task.
For depth estimation, inverse depth helps to
It should be stressed again that 2 & 3 are determined in the absence of protected attributes.
MODULE COMPLETED!!!!!!
Until now we've focused exclusively on linear and nonlinear layers. We've also discussed at length fully connected layers (FCs) where all output layers are connected to all input nodes. Of course, these are not the only types of layers and in this section we will begin exploring another type: Convolutional Neural Networks.
Layers don't need to be fully connected!! For example, when building image based models it tends to make more sense to look at areas of an image rather than look at each pixel individually. So we might define nodes to focus on small patches of inputs, or windows. To approach this we consider the idea of convolution operations as a layer in the neural network.
Recall the idea of convolution. A convolution is a process whereby a Kernel(matrix) is multiplied against a matrix(window), this process is repeated for ALL possible windows in the target dataset (usually an image). The reason this is so beneficial is that kernal can be created to extract features from an image.
Mathematically we can define a convolution as the composition of a function x(t) and a weighting w(a) where a is the age of a measurement. Then we can write $$ s(t) = (x \ast w)(t) = \int x(a)w(t-a)da = \sum x(a)w(t-a) $$
For a ML problem the input x(t) is usually a multidimensional array such as an image. And the weighting is reflected in the kernel. We also will often use the two dimensional version as follows. $$ S(i,j) = (I \ast K)(i,j) = \sum_m \sum_n I(m,n) K(i-m,j-n) \text{ or } \sum_m \sum_n I(i-m,j-n) K(m,n)$$
Properties of convolutions:
Sparse Interactions (aka Sparse Weights, Sparse connectivity) This happens when the kernel K is smaller than the input. If there are M inputs and N outputs, then matrix multiplication requires MxN parameters and thus the complexity is O(MxN). If we limit the output to KxN where K < M then we can reduce the complexity by the same amount to O(KxN)
Parameter Sharing This refers to the use of the same parameter for more than one function in a model.
Equivariance To say a function is equivariant means that if the input changes, then the output changes in the same way. ie f(g(t))=g(f(t))
Here are some examples of convolutions.
Example: To focus on the edges, in an image use
This new convolution layer can take any input 3D tensor (say, RGB) and output another similarly shaped output. In fact what we will be looking to do is to use/apply multiple kernels for various features we want to extract. We will also need to take the output of these convolutions and organize them into feature maps so our neural network may perform the needed analysis.
Convolution layers can be combined with non-linear and pooling layers which reduce the dimensionality of the data. For example we can take a 3x3 patch from an image and take the max. This effectively reduces 9 numbers down to 1.
The following image gives a high level approach to convolutional neural networks. The idea is to extract more and more abstract features from the image/data. Finally in the end we create our fully connected layer (one or multiple) to produce our results. Hopefully by the time we get to the last stage our tensors have been reduces to a manageable size.
Here is a more advanced example of the type of architecture that is used in practice. These have existed since the 1980's. One application of these has been to read scanned cheques cashed in a bank. (Note that a Gaussian Connection is just a fully connected layer)
Nowadays these have become much more complex.
Backprop and autodiff allow us to optimize any function composed of differentiable blocks
However, connectivity in linear layers doesn't always make sense! Consider for example a 1024x1024 image, assuming it's greyscale then that's 1,048,576 pixels and therefore we need 1million by N weights plus another N bias terms just so we can properly feed it into a fully connected layer. This begs the question: Is this really necassary?
Image features are spatially localized, as opposed to stationary which would imply that features tend to appear in a particular location. Image features tend to be smaller features repeated across the image. For example Edges, Colors, motifs (corners).
How can we induce a bias in the design of a neural network layer to reflect this?
Idea 1: Receptive fields
What about learning location-specific features?
Idea 2: Shared Weights
What about learning many features? without sharing?
Idea 3: Learning Many Features
In mathematics a convolution is an operation on two functions f and g producing a third function that is typically viewed as a modified version of one of the original functions, giving the area of overlap between the two functions as a function of the amount that one of the original functions is translated.
There is also a highly similar and important related operation called cross correlation.
Formally the mathematical definition:
Convolution of two function f and x over a range t is defined as
$$\large y(t) = f \otimes x = \int_{-\infty}^{\infty} f(k) \cdot x(t-k) dk$$
Calculating convolution by sliding image patches over the entire image. One image patch (yellow) of the original image (green) is multiplied by the kernel (red numbers in the yellow patch) and the sum is written (output) to the feature map pixel in the convolved feature map (right side).
Summary: Convolutions are just simple linear operations
So why bother? Why not just call it a linear layer with a small receptive field?
CNN Layer Hyperparameters (PyTorch):
Output size of a vanilla convolution operation is $(H - k_1 + 1) \times (W - k_2 + 1)$
When you zero-pad the image borders to produce an output the same size as the raw input, you have a "same" convolution
Strides refers to the movement of the convolution. The default of one means we move the filter one pixel, but we can take larger strides. For example if stride = 2 then the convolution would be applied every other pixel. Ie we would take the convolution, then move two pixels before applying the convolution filter.
A "full" convolution is one where enough zeros are added to the borders such that, every pixel is visited k times in each direction. This results in an image of size m+k-1.
Summary
While we have generally spoken of images and kernels as 2 dimensional, this isn't a hard and fast rule. In fact many of the images we work with have colours, ant t/f must have a third dimension called the channel (which is always 3). The third dimension refers to the RGB structure of the image and would not exist in a greyscale, or black and white, image.
To perform convolutions on such images we simply add a 3rd channel to our convolution kernel, and multiply/copy our kernel across all 3 dimensions. To perform the convolution we again perform elementwise multiplication between the kernel and the image patch, then sum them up like a dot product.
Of course we can also vectorize these operations by flattening the image patch as row vectors, then flatten the kernels and take their transpose to produce column vectors.
Recall that dimension reduction is an important aspect of Machine learning. How can we make a layer to explicitly down sample an image or feature map? Well the answer is pooling operations!
Parameters needed to Pool:
Example 1 - Max Pooling:
Stride a window across an image and perform per-patch max operation
$\normalsize
max(X[0:2,0:2])
= max \left( \begin{bmatrix} 200 & 150 & 150 \\ 100 & 50 & 100 \\ 25 & 25 & 10 \end{bmatrix} \right )
= 200$
The great part about this is that there's no parameters to learn!
We are not restricted to max either. In fact you can use any differentiable function (average, min, etc etc)
Since the output of convolution and pooling layers are multi-channel images we can sequence them just as any other layer.
This combination adds some invariance to translation of features, BUT remember that translation preserves the output values. Of course they will move by the same amount of the translation though.
Excerpts from (https://www.deeplearningbook.org/contents/convnets.html)
CNNs typically consist of three stages: First stage is the application of several convolutions in parallel to produce a set of linear activations. In the second stage a nonlinear activation is applied to each linear activation. Lastly we apply a pooling layer. A pooling function is one that replaces the output of the network at a node/point with a summary statistic of the nearby outputs. Example max pooling takes the max value in a neighborhood. Other examples include minimum, average, L2 norm, and weighted average.
The benefit of pooling is that it makes the representation invariant to small changes in the input. We would care more about the existence of a face in an image rather than it's location. Another way of thinking of pooling is that it provides an infinitely strong prior that the function (that must be learned) that the layer learns must be invariant to small translations.
It is instructive to calculate the backwards pass of a convolution layer. Similar to fully connected layer, it will be simple vectorized linear algebra operation! We will see a duality between cross-correlation and convolution
As a reminder, Cross-Correlation is defined as
$y(r,c)=(x \ast k)(r,c) = \sum_{a=0}^{k_1 - 1 } \sum_{b=0}^{k_2 - 1} x(r+a,c+b)k(a,b)$
Where $a,k_1$ represents the summation over the rows, and $b,k_2$ over the columns.
so $|y|=H \times W$
$\frac{\partial L}{\partial y}$ Assuming size $H \times W$ (add padding for convenience)
$\frac{\partial L}{\partial y(r,c)}$ to access the element
Similar to before we can now compute the gradients using our friend the chain rule
Important Note: The right side gradients are for the weight update, which in this case is the kernel values
From the previous picture we have that the gradient for the weight update is given by $\frac{\partial L}{\partial k(a,b)}$
We could calculate this one pixel at a time using $\frac{\partial L}{\partial k(a,b)}$. It may not be explicit here but this will effect every pixel in the output, because the kernel is being strided across the image. This means that the gradient flow will be affecting every output pixel. Meaning we need to incorporate all the upstream gradients.
So for all upstream gradients we have $\large \{ \frac{\partial L}{\partial y(0,0)}, \cdots , \frac{\partial L}{\partial y(H,W)} \}$
Applying the chain rule yields
$\large \frac{\partial L}{\partial k(a',b')}
= \sum_{r=0}^{H - 1} \sum_{c=0}^{W - 1} \frac{\partial L}{\partial y(r,c)} \frac{\partial y(r,c)}{\partial k(a',b')}$
Where the summation is over all the pixels, $\frac{\partial L}{\partial y(r,c)}$ is the upstream gradient, and $\frac{\partial y(r,c)}{\partial k(a',b')}$ is what we need to compute
Now we observe that $\frac{\partial y(r,c)}{\partial k(a',b')} = x(r+a',c+b')$ where a',b' are the pixels from the kernel, and r,c are from the input image
Therefore we have $\frac{\partial L}{\partial k(a',b')} = \sum_{r=0}^{H - 1} \sum_{c=0}^{W - 1} \frac{\partial L}{\partial y(r,c)} x(r+a',c+b')$ Looking at this we can see that it is the cross correlation between the upstream gradient and the input.
Derived analytically
$\begin{split}
\frac{\partial L}{\partial x(r',c')}
& = \sum_{pixels} \frac{\partial L}{\partial y(p)} \frac{\partial y(p)}{\partial x(r',c')} \\
& = \sum_{a=0}^{k_1 - 1 } \sum_{b=0}^{k_2 - 1} \frac{\partial L}{\partial y(r'-a,c'-b)} \frac{\partial y(r'-a,c'-b)}{\partial x(r',c')}
\end{split}$
Using the definition of cross correlation (use $a',b'$ to distinguish from prior variables):
$y(r',c')=(x \ast k)(r',c') = \sum_{a'=0}^{k_1 - 1 } \sum_{b'=0}^{k_2 - 1} x(r'+ a',c' + b') k(a',b')$
Now plug in what we want
$y(r'-a,c'-b)=(x \ast k)(r',c') = \sum_{a'=0}^{k_1 - 1 } \sum_{b'=0}^{k_2 - 1} x(r'- a + a',c' - b + b') k(a',b')$
What is $\frac{\partial y(r'-a,c'-b)}{\partial x(r',c')} = ?$ Well it turns out this is just $k(a,b)$
We want term with $x(r',c')$ in it, this happens when a'=a and b'=b
Which we can plug back into our equation from earlier
$\begin{split}
\frac{\partial L}{\partial x(r',c')}
& = \sum_{a=0}^{k_1 - 1 } \sum_{b=0}^{k_2 - 1} \frac{\partial L}{\partial y(r'-a,c'-b)} \frac{\partial y(r'-a,c'-b)}{\partial x(r',c')} \\
& = \sum_{a=0}^{k_1 - 1 } \sum_{b=0}^{k_2 - 1} \frac{\partial L}{\partial y(r'-a,c'-b)} k(a,b)
\end{split}$
Hopefully this looks familiar? It's the convolution between the upstream gradient and the kernel. Which we can implement by just flipping the kernel and applying cross correlation. And again, all operations can be implemented via matrix multiplications, just as we do in a fully connected layer.
Simple CNNs
Since the output of a convolution and a pooling layers are (multi-channel) images, we can sequence them just as any other layer.
ie Image --> Convolution (+NonLinear Layer) --> Pooling --> ??
In the last position we can perform a (convolution --> pooling) a second time or create a fully connected layer
Here's a depiction of a simple CNN. The end goal is to have a more useful but lower dimensional set of features to feed into our fully connected layer. Which should simply result in some classifier.
Now we have an end to end CNN, that starts with a raw input, the extracts more and more abstract/meaningful features, that finally are fed into an FC to obtain a prediction. Through the loss we will be able to trace back our gradient flow, update weights/kernel and push them backwards until we get to the raw image.
Effectively we are optimizing our kernels to minimize the loss at all stages.
One side effect that should be noted is the increasing reseptive field for a pixel as you move backwards in the process. This should not come as a surprise as pooling, and convolution, can reduce the input.
Advanced CNNs
Here is the cutting edge CNN architecture ( Back in 2012 ):
Key aspects were
Other aspects include:
These can be pretty memory intensive though.
How can we generalize our learning (from training to testing dataset). We will also discuss transfer learning which can significantly reduce the amount of labelled data that we need to fit the large number of parameters that we will have in these very complex networks.
In any model construction process there will be several sources of error that prevent you from fully modelling the real world.
What if we don't have a lot, or enough, data? We can reuse our data.
This works surprisingly well! Features learned for 1000 object categories will work well for the 1001st! It also generalizes across tasks, ie from classification to object detection.
Learning with less labels, doesn't work out well though
Given a model we would like to understand what it has learned. Given some random initialization values what values has it arrived at, or converged, or trained, to? After optimization it should have found set of optimal values. For Linear classifiers we can take the learned weights, reshape them into images, and then plot them in a normalized way. This gives us a picture of the weight template used to classify objects.
For convoutional layers the kernels are the weights. Recall that the kernel is convolved with the image in a repeated process of step size = stride. So high response values from the covolution are indicative of similarity.
We can also plot the activations after running a kernel across an image. Each activation layer yields an output map (aka feature map, activation map). These maps represent spatial location with high values that highly correlate with the kernel. think of an edge map for example.
Using Gradients and gradient stats let us understand wht the model is learning.
Finally we can look at the robustness of the Network to determine any weak points.
For FC layers, if the nodes are connected to the image itself, then you can simply reshape the weights, normalize(0-255) and view as an image.
What's interesting about these are that they began life as purely random values, ie at initialization they would look like a speckled image for example. By now they begin to resemble patterns, or templates, with colour and textures. of course the smaller the kernel the smaller the image so they can be difficult at times.
Visualizing activation/filter maps can be a lot trickier.
Problem: Small convolution outputs are hard to visualize/interpret.
We can take the activations of any layer (FC, conv, etc) and perfrom dimensionality reduction to plot as follows
The better that these groups are seperated the better the classifier will perform.
While these methods provide some visual interpretation this is highly biased on the interpreter and can often be misleading, or simply uninformative.
Assessing interpretability is difficult
Neural Networks learn distributed representation, meaning they don't learn a feature/point. The nodes don't really represent any feature either. A node that is activated for a face may also be activated for bird (silly example), the point being that a node that is good for one trait may also effect other traits. This adds to the difficulty of interpretation.
In a nutshell: Visualizations help provide some guidance and direction. They are not conclusive on their own.
Recall that the backwards pass gives us the the gradients for all layers. How the loss changes as we change the different parts of the input. This is useful not only for optimization but also to understand what was learned:
The first
In practice,
Applying traditional CV methods and algo's on the gradients can get us object segmentation for free. Surprising because this is not really part of the supervision process.
This can be used to detect dataset bias. For example the presence of snow might be used by a model to detect the presence of a wolf. And so even a missclassification or incorrect prediction can be informative. Looking at the gradients and using them to create a segmentation would yield the presence of snow, indicating the bias.
Rather than loss or score we can pick a neuron somewhere deep in the network and compute the gradient of the activation w.r.t. the input.
Setps
It should be stressed that the way we've been using backprop up until now may not be good for visualizing. Until now we've been using it for optimization, and for that it is the best, but now our goal is to visualize not optimize. So we must alter our general method to account for this.
When visualizing we may get parts of an image that decrease the feature activation. So you can get parts of an image with negative gradients. You want to focus on areas with a positive influence/gradients
1 -1 5 -> 1 0 5 Forward Pass (ReLU?)
2 -5 -7 2 0 0 Positive values remain, negatives get zero'd out
-3 2 4 0 2 4
-2 0 -1 <- -2 3 -1 Backwards Pass : Back Prop
6 0 0 6 -3 1
0 -1 3 2 -1 3
0 3 0 <- -2 3 -1 Backwards pass: deconvnet ( composed of deconvolution and unpooling layers )
6 0 1 6 -3 1 Only positve values are passed back
2 0 3 2 -1 3 Neg influence are removed
0 0 0 <- -2 3 -1 Backwards pass: Guided backprop
6 0 0 6 -3 1 This is both the forward pass and the deconvnet applied
0 0 3 2 -1 3 Result: only positive influences remain and get propogated
VGG is another method also which can produce similar results.
VGG: Layer by Layer Visualization
VGG stands for Visual Geometry Group; it is a standard deep Convolutional Neural Network (CNN) architecture with multiple layers. The “deep” refers to the number of layers with VGG-16 or VGG-19 consisting of 16 and 19 convolutional layers.
In the first layer of a neural network, it produces what could best be described as edge orientation outputs. It may also learn things like colour or textual patterns. In general however it tends to find low level features.
In the second layer we'll see that these low level features such as edges are combined in order to find higher level features such as motifs, (Corners, circles, shapes, etc), as well as more interesting patterns and textures.
In layer 3 we get more abstract. We find things like wheels of a car or different parts of people.
By layer 4 and 5 we are nearing our templates. Here we may have the entire object being detected, or at the very least more than just one part as may have been located using layer 3.
Grad-CAM
As researchers began plotting more and more saliency maps, more and more sophisticated methods began to emerge to understand what neural networks consider to be important. One of these methods is the Grad-CAM.
GradCAM trys to determine where a CNN is looking in order to come to a decision. It can be used for
GradCAM is a form of post-hoc attention. Meaning that it is a method for producing heatmaps that are applied to an already trained (pretrained/prebuilt) neural network after training is complete and the parameters are fixed. (Whereas trainable attention refers to the production of a heat map in the building/training phase.
Define $y_c$ as the score of the class c, ie the output for class c BEFORE the softmax
GradCam Process:
Summary
The idea here is that since we have the gradient of the scores relative to the inputs, it begs the question: Can we optimize the input itself to maximize the score?
Why? Create images from scratch, Adversarial examples, searching for a particular object (ie looking for a class such as elephant that may not be in many images)
We do this by performing gradient ascent on an image. Rather than subtracting "learning rate * gradient" we add it instead ie $I = I + \alpha \frac{\partial S_c}{\partial I}$. We can also have a regularization term which, for example, prefers smaller pixel values over larger ones.
Procedure
Consider the following image. While less than optimal you can see that there is the formulation of an image.
We can improve these results with
The methods up until now have shown us how to perform gradient descent on a real image to maximize a score. So what if we want to start from scratch? ie what if we want to create an image from nothing, using just our algorithm. also what if we want to optimize the score of an arbitrary class (the score of being incorrect).
It turns out that only a very small number of pixels (per dimension) is needed to optimize the incorrectness. By adding a bit of noise or perturbations we can cause a network to misclassify.
This problem is not unique to Deep Learning/Machine Learning
There are various forms of these types of attacks (attacking the NN to confuse it)
Similar to other security related areas it's very much a cat and mouse game.
Some defenses exist and their performance/success depends on the situation
There are no universal methods however that exist, are robust, to all types of attacks.
Some other forms of robustness testing
Understanding the biases of CNNs
As it turns out almost all neural networks are biased towards texture, but more so biased are the most effective ones such as RES-NET50.
Summary
This refers to the generation of an image where the content of the target image remains the same but the style, or texture, is transfered from another image.
Images can be generated through backpropogation
But what if we want to preserve the content of a particular image? (let's denote this image C)
example: Suppose we have an image of a dog that we wish to replicate
We can take this another step forward as well! Recall that backwards edges are going to the same node are summed. So we can take the loss at each of the layers and sum them up: $L_{content}=\sum_l (F_C^l - F_P^l)^2$ where $l$ represents indices of the layers.
Idea: We want the content of one image and the texture/style of another image!
The idea here is very similar to the process above. We start with 2 images, let's call them the content and the style image. We compute the feature maps or activations. We also create a third image from scratch, from nothing. Now want we need to do is to minimize the loss between our zero image and each of the feature layers of the style and content images. It should be clear these need to be minimized together, at the same time. But we can't use the same loss on both sides. Otherwise you'll get converging content or converging styles, not a mix of both which is what we want.
How do we represent similarity in terms of textures? There is a history of this in classical Computer vision.
Deep Learning variants
NB These are also quite handy for data augmentation!!
Topics:
From experiment to production: How do we optimize our models?
Consider a small example:
def pairwise_dist(a,b):
p = a.shape[0]
q = b.shape[0]
squares = torch.zeros((p,q))
for i in range(p):
for j in range(q):
dif = a[i,:] - b[j,:]
diff_squared = diff ** 2
squares[i,j] = torch.sum(diff_squared)
return squares
a = torch.randn(100,2)
b = torch.randn(200,2)
%timeit pairwise_dist(a,b)
# 438 ms +- 16.7 ms per loop
This can be sped up using pytorch batched operations
def pairwise_dist(a,b):
diff = a[:,None,:]-b[None,:,:] # Broadcasting
diff_squared = diff**2
return torch.sum(diff_squared,dim=2)
a = torch.randn(100,2)
b = torch.randn(200,2)
%timeit pairwise_dist(a,b)
# 322 us +- 5.64 us per loop - us is microseconds
There are many ways to diagnose a bottle neck with pytorch
Pytorch has two modes
Of course there are other (jit) functionalities that are beneficial
In this section we outline a basic data pipline flow
from torch.optim import SGD
loader = ...
model = Net()
criterion = torch.nn.CrossEntropyLoss()
optimizer = SGD(model.parameters)
for epoch in range(10):
for batch, labels in loader:
outputs = model(batch)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# There are two basic types of data representations in pytorch
class IterableStyleDataset(torch.utils.data.IterableDataset):
def __iter__(self):
# Support for streams
class MapStyleDataset(torch.utils.data.Dataset):
def __getitem__(self,key):
# map from (non-int) keys
# in any order
def __len__(self):
# Supports Sampling
from torch.utils.data import DataLoader, RandomSampler
dataloader = DataLoader(
dataset, # Only for mapstyle dataset
batch_size=8, # balance speed and convergence
num_workers=2, # Non blocking when > 0
sampler=RandomSampler, # random read may saturate drive
pin_memory=True, # page-lock memory for data?
)
# Notes
# keep workers low and increase slowly!!
# pin_memory forces data to stay in RAM, which can eat up your memory rather quickly
Key Concepts:
Multiple Approaches
Single machine data parallel
This approach sends each batch to a different GPU for model building. Then it aggregates the results back into a single vector model to compute the loss. Pytorch has the ability to do this for you, as well as keep track of each models gradients in a seperate manner
model = Net().to("cuda:0")
model = torch.nn.DataParallel(model)
Single machine Model Parallel
In this scenario it's parts of the model that are computed on different GPUs. For example
This is much trickier than before, Intended to be used for large models that exceed the capacity of a single GPU.
class Net(torch.nn.Module):
def __init__(self, *gpus):
super().__init__(self)
self.gpu0 = torch.device(gpus[0])
self.gpu1 = torch.device(gpus[1])
self.sub_net1 = torch.nn.Linear(10,10).to(self.gpu0)
self.sub_net2 = torch.nn.Linear(10,5).to(self.gpu1)
def forward(self,x):
y = self.sub_net1(x.to(self.gpu0))
z = self.sub_net1(x.to(self.gpu1))
return z
model = Net("cuda:0","cuda:1")
Distributed data parallel
In this situation the data batches are split into multiple machines, each with possibly multiple GPU's. Similar to before we can send each batch to a machine+gpu for computation. In this case though each batch produces it's own model as well as loss. This is all done in parallel streams.
It replicates the model in every forward pass, and it's single-process multi-thread parallelism naturally suffers from Global Interpreter lock limitations.
def demo_model_parallel(rank, world_size):
print(f"Running DDP with model parallel example on rank {rank}.")
setup(rank, world_size)
# setup mp_model and devices for this process
dev0 = (rank * 2) % world_size
dev1 = (rank * 2 + 1) % world_size
mp_model = ToyMpModel(dev0, dev1)
ddp_mp_model = DDP(mp_model)
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_mp_model.parameters(), lr=0.001)
optimizer.zero_grad()
# outputs will be on dev1
outputs = ddp_mp_model(torch.randn(20, 10))
labels = torch.randn(20, 5).to(dev1)
loss_fn(outputs, labels).backward()
optimizer.step()
cleanup()
n_gpus = torch.cuda.device_count()
assert n_gpus >= 2, f"Requires at least 2 GPUs to run, but got {n_gpus}"
run_demo(demo_model_parallel, world_size)
Distributed data parallel with Model parallel
Similar to before - but this
def one_machine(machine_rank, world_size, backend):
torch.distributed.init_process_group(
backend, rank=machine_rank, world_size=world_size
)
gpus = {
0: [0, 1],
1: [2, 3],
}[machine_rank]
model = Net(gpus)
model = torch.nn.parallel.DDP(model)
# Training loop ...
for machine_rank in range(world_size):
torch.multiprocessing.spawn(
one_machine
, args=(world_size, backend)
, nprocs=world_size
, join=True
)
This is the most complicated of the bunch, and because of this pytorch provides a special set of distributed functions to handle the situation. torch.distributed.rpc (https://pytorch.org/docs/stable/rpc.html)
Here is a quick overview:
HogWild is an RPC that implements SGD in a distributed manner. It does so in a distributed manner though, so in this approach it gives each worker process the ability to update the model parameters. This gives them the ability to even undo, and overwrite, updates made by other workers. Hence the Hogwild name. Interestingly though as long as updates are sparse it has been shown that Hogwild will converge in a manner similar to that of SGD.
Here is a typical development trajectory for training:
PyTorch distributed Data Parallel module attains a near linear scalability using 256 GPUs!!
Let us now move beyond classification to other tasks such as object detection and segmentation. So far our classification tasks involve feeding many images into a neural network. NN then extracts features and uses these to classify it with a score that is fed into a softmax function to give us an output probability.
What may not be apparent is that so far our images are dominanted by an object of a single class. In fact the CNNs we've looked at so far depend on this characteristic. So how do we perform tasks when this doesn't hold true? For example semantic segmentation tasks require us to parse multiple object from a single image (road,buildings, cars, sky). In this situation we might want a probability distribution for each pixel in an image, and for each class. This would then give us the ability to distinguish between pixels belonging to a car vs say the road. In this type of task we care little about multiple instances, that situation is covered later. What about object? In this situation we want to identify an area (aka bounding box) with a class distribution for the box/category.
Given an image we want to output another image, where the output contains a class distribution per pixel. This is what's generally considered an image-to-image problem. There is an input tensor (C,H,W) which should output (D,H,W) where D represents the class, but the height and width remain the same.
This is a considerable leap from our previous work with CNNs. Generally a CNN downsamples the image, using the convolutional layers and pooling functions, to create a feature map. This reduces the dimensions to the necassary features which it then uses to classify the image. So how do we upsample an image, after it's been reduced to extract it's features? Another problem we need to tackle is that Fully connected layers, like those at the end of a CNN, ignore spatial information, and throw it away. We need to keep this type of information for a segmentation type of task.
One idea is to simply disregard the linear layers. In fact we can simply replace them with even more convolutions! This is true because, as you might recall, convolutions are linear operators! If you need the math just remember that a fully connected layer is given by Wx+b, so we can represent the Wx by a convolution that simply outputs the dot product scalar value. In this type of situation each kernel can be thought of as outputting a single node (the scalar output). We can of course do this for other layers as well, not just the last.
Mathematically the representation of a fully connected layer or a convolutional layer where the kernel is the same size as the input is equivalent. And so it begs the question why might be we want to perform this trick? As it turns out that this will allow us to be no longer dependent on a particular input size, and it turns out this is useful for segmentation.
For example suppose that we have an original input size here of a 5x5 image, and we have a kernel (3x3) that we've learned, this will result in an output of size 3x3. If we place a fully connected layer after this, or after we've converted it into a fully connected layer, we'll have a kernel that's exactly the size of the output. And so we'll have a fully convolutional layer kernel that's dot producted with the input to produce one output node in the fully connected, or convolutional, layer. Then we can have multiple of these if we want multiple output nodes. It turns out that if we feed it a larger input, as we know it, since it's normal convultion, the output map at the last convolutional layer will be bigger, because as you recall the convolution operation is defined by the striding operation, in kernel size matters none. Hence, if we retain the same convolutional layer as before.
Facebook Lecturers:
Topics: FAT*
Overview
Why should we care?
Overview of some headlines
Goodhart's Law:
Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes
In other words, if you incentivize optimization, you'll get overfitting.
Overview
Definition
A classifier is well calibrated if the probability of the observations with a given probability score of having a label is equal to the proportion of observations having that label
Now let's take a look at the miscalibrated cases
Calibration for fairness implies a calibration for groups like group a and b. Ideally you want the scenario on the right where either the model is calibrated for both, or possibly even equally miscalibrated for both. The key point is that it is trating both the same way.
How do we measure divergence? Basically any of the statistical methods can be applied here. Some models (eg Logistic Regression) tend to have well calibrated predictions. Some DL models (eg resnet) tend to be overconfident. A popular method for dealing with such biased models is Platt scaling (aka temperature or logistic scaling)
Platt Scaling, Binary classifier
Learn parameters a,b such that the calibrated probability is $\hat{q_i}=\sigma(az_i + b)$ where $z_i$ is the network's logit output.
Temperature scaling, Multiclass calssification
Learn a temperature T, and produce calibrated probabilities $\hat{q_i}=max_k \sigma_{softmax}(z_i/T)$
It's important to remember that calibration isn't the whole story. In fact it is just one element and it has it's own limitations such as it's group based. On a philosophical note the fact that using a group may raise issues. For example what features you use to create these groups. One of the inherent trade-offs in calibrations is that It is impossible for a classifier to achieve both equal calibration and error rates between groups, (if there is a difference in prevalence between the groups and the calssifier is not perfect)
Now we move past fully connected layers, and convolutions. Now we begin looking at richer structures like sequences, graphs. we'll see that the notion of attention will be important, which is looking at sequences of things (or graphs) where you have many nodes and elements and you need to determine their importance. Eventually we will look at combining computer vision and sequencial learning to jointly process both of these modalities together, by using them as building blocks together in a greater computational graph.
So far we have focused on Fully connected layers, and convolutional layers. In both cases the goal is to sufficiently load high dimensional structures we can process into lower dimensions that make it possible to extract meaningful hidden representations in the form of vectors for fully connected layers and then make our prediction. The reason we did this is to add architectural biases such that we can learn better and faster with less data. At the end of the day any network with sufficient depth and mild assumptions can learn any function. So it really comes down to how much data is needed and how fast can it be done.
It should come as no suprise then that for other types of structures, like sequential data, there are other architectures biases that will perform much better. For example
These are just a few examples that have been written about. RNNs were the first to be written about. They contain a state vector that encodes all previous encounters. So to process a sentence it must keep a vector of all words it has come across, ie read, before. The next word should trigger an update of the vector (state). This addresses the limitations of a fully connected layer which requires the data to be of fixed dimensions. RNNs will keep their vector size fixed.
For all their merits RNNs also have some disadvantages such as optimizing a long sequence (such as thousands) of elements. As we backpropogate through a long sequence the gradients may get smaller and smaller, or even larger and larger leading to vanishing and exploding gradients. Another issue is that backprop through a single vector is somewhat challenging. for example: when dealing with a documents with thousands of words. At any point in time we would need a vector that can represent all the words we have seen so far, like at the very beginning, while at the same time processing the last few paragraphs of the document. You have to propogate all the information necassary at any point in time across the entire sequential structure.
Attention based systems are designed to remedy the memory problem. Rather than try to remember every thing in a document up to the current point, it will just have a vector that represents each element (for example each word) in the document as well as a weighting mechanism that will update with each new occurance. Think of a softmax function which gives a probability distribution of all the words encountered so far.
Transformer architectures are multi-layer attention models that are currently state of the art in most language tasks. They have superior performance to previous attention based architectures by using
While many of these were developed for sequential data it need not be limited. These can be generalized. Softmax for example doesn't require the elements to be ordered. These notions can be thought of as a graph composed nodes and edges. Nodes have a vector representing some features. Then we can combine this across the graph where we refine the features of each node, using not just it's local features but also the features of it's neighborhood. So now for any node we can get the refined features that are the local features as well as the weighted sum of the neighborhood features, this can then be used to update that feature.
In many of the applications we will need to consider the architecture of the problem
In particular we will be looking at (Natural Language Processing)NLP tasks so let's begin building some intuition. Consider how we might develop a mathematical model for NLP.
For a sentence s, composed of words $w_1,\cdots,w_n$
$\begin{split}
P(s) & = P(w_1,w_2,\cdots,w_n) \\
& = P(w_1)P(w_2|w_1)P(w_3|w_2,w_1) \cdots P(w_n|w_{n-1}, \cdots w_1) \\
& = \prod_i P(w_i|w_{i-1}, \cdots w_1)
\end{split}$
Another approach might be the skip-gram model used in Word2vec. Here they use words to predict their context words. Where context might be defined on a fixed window of size 2m.
Another approach which is hinted at here is Embeddings. Think of this as a graph so we may be able to associate words with preceeding words. This gives us another important feature which is the ability to predict based on the strength of the association, such as learned similarity. (The arrow in red signify strength)
Now we can put it all together in an MLP to have a fullsome view and understanding.
This can even perform translatation.
References
University of Toronto
Plain CNNs are not built to be good with varying data inputs/outputs, but the earliest attempts at using sequential data did use them. In a standard feed forward neural network a single data set (an image or observation) is given a network in order to get a prediction. In an RNN data is fed one element at a time, so for example each word in a sentence is given one at a time, and after each input the RNN is formulating a response that takes into account the previous elements. In other words the output of an RNN at time t+1 depends on the output of the network at time t, the step before.
There are multiple configurations of an RNN
There are three main types of RNNs: Vanilla, LSTMs, and GRUs.
RNNs refer to a family of neural networks intended for use with sequential data, in much the same way that CNNs excel for grid style data sets like those found in images. Just as a CNN can scale itself to deal with large hieghts and widths so to can a RNN adjust for longer and longer sequences.
In moving from multilayer perceptrons and feed forward networks we take advantageof the idea mentioned earlier of weight sharing. Parameter sharing makes it possible for the data to increase without needing to adjust itself, once training is completed.
A recurrent neural network (RNN) is a special type of network adapted to working with sequential data. ie words in a sentence or time series. The key to this type of problem, and to RNNs in general, is to understand that each data point depends in some fashion on the data before it.
NB: The weights ($w_x, w_h, w_y$) in each box are unique to that layer
Training an RNN requires some modification to the standard backprop algorithm.
In this sections we introduce the concept of structured representation. Unlike CNNs with very specific connectivity biases we will try to develop architectures that can learn some of these structured representations explicitly.
We will revisit the principles of building and optimizing feed forward networks and generalize them to any arbitrary computation graphs. As it turns out the algo for backprop simply needs some tweaking for sequential data. Recall that when we began we spoke at length of acyclic graphs and their ability to be modularized into any shape or number of components needed by a function.
Recall that one of the central drivers of Convolutional networks was the need to develop architectural biases to represent the things we know to be true. Such as stationarity, features that can appear anywhere in and image. Using this approach we also showed that it was able to learn compositional feature space. That is low level features such as edges or textures or colours mid-level features such as corners and edges, and higher level features such as objects and parts.
So now we want to follow the same pattern of adding architectural biases, so can represent arbitrary items or elements and not just pixels and windows. Compositional structures are not limited to images, here are some more examples from other domains.
Hierarchical Composition
We may go even further in asking if we can create a more general mechanism that can automatically learn these relationships. The structures that we want to model may represent relationships between elements (ie "Man on top of horse in front of a mountain"). These represent yet another dimension, and we would need to model the spatial relationship between them. this can also take the form of relationships between words in a sentence, or elements in a scene.
The structures may go even higher yet. For example we may want to represent all concepts in the world, such as a graph of all pages in wikipedia. Current architectures simply cannot begin to represent higher level structures. Graphs are simply a more general way of looking at these structures. We can have embeddings or vectors, for each node. (NB Embeddings are a learned map from entities to vectors of numbers that encode similarity). Once we create an embedding map we can represent it as a graph where the nodes are feature vectors and the edges represent some sort of connectivity, or similarity, typically. This allows us to find all the inputs for a particular node. There can be multiple types of edges to represent different relationships such as (on top, or below)
Graph Embedding Optimize the objective that connected nodes have more similar embeddings than unconnected nodes via gradient descent.
When representing structured information, several things are important:
Similarly in language problems we will encounter the same concepts. Here we will use an attention mechanism on a flat linear structure. Attention is an extremely powerful concept. It allows to represent some arbitrarily set of information of any size. Softmax can also be applied to any size set. Then you can use the results to weight and combine information. This process can be applied to any sequential type of structure.
Let's take a moment now to formalize these concepts. we begin by looking at a simple example that nicely ties back to the ideas found in CNNs we have discussed already. recall that we developed convolution layers because it did not make sense to connect an output pixel to all input pixels across a high dimensional image or feature map. This was because we know that we will want to extract low level local features, which does not require a network tied to each single pixel but rather to an area.
However, we could generalize this and allow the network to learn it's own connectivity pattern. Here is an example of a Non-Local Neural Network. Unlike a convolution layer you don't have a very specific local receptive field. Instead we have the following eqquations
Guest Lecturer: Jean Maillard - Facebook AI research scientist
Topics : Sequence Modelling
At a high level Language model is simply trying to determine the probabilities of a sequence.
Language models allow us to estimate probabilities of sequences such as
This now allows us to model language as a sequence. So for a sentence S, composed of words $w_1,\cdots,w_n$, we can write
$\begin{split}
P(s) & = P(w_1,w_2,\cdots,w_n) \\
& = P(w_1)P(w_2|w_1)P(w_3|w_2,w_1) \cdots P(w_n|w_{n-1}, \cdots w_1) \\
& = \prod_i P(w_i|w_{i-1}, \cdots w_1)
\end{split}$
Where $w_i$ is the next word and $w_{i-1},...,w_1$
This sequential construction also helps us to see how this can be used in a generative tense. In fact all language models can be thought of as generative models.
There are many applications of this
Conditional Language Models extend this idea from
Some example of using extra content include
These can be used to model recurring sequential patterns such as Audio signals, Optical Character Recognition (OCR) and Sentiment Analysis.
The simplest approach to this is to use a one-hot vector representation where all the entries of a vector are zero, except for the entry corresponding to the word's index, which is one.
Pain Points
further readings
We will now examine RNN implementation.
RNNs are a family of Neural architectures which have been designed to model sequences. At every time step they receive an input and use it to update the state h of t. ie $h_t = F(\theta, x_t,h_{t-1})$ where $x_t$ is the input at time t, and $\theta$ represents the training weights of the network. You may notice this is recursive in nature. Unrolling this gives us
Observations
If the sequence is long then the computation graph can get as long as well. Because you will have as many effective layers as you have time steps, this makes the updates expensive. So what we do is use a form of "truncated backpropogation through time". Where the states are carried forward forever, but we only backpropogate for a fixed number of time steps backwards.
Vanilla RNN
So far our examples are what are commonly understood to be RNNs. However, it is more precisely, or formally, used to indicate any neural network whose information flow does not follow a directed acyclic graph.
Issues
The commonly used solution to this problem is the LSTM architecture, which alleviates vanishing gradients
$\begin{split}
\begin{bmatrix} \sigma \\ \sigma \\ tanh \\ \sigma \end{bmatrix} ( W_{\theta} \begin{bmatrix} x_t \\ h_{t-1} \end{bmatrix} + b_{\theta}) \end{split} $ $c_t = ft \bigodot c{t-1} + i_t \bigodot u_t$ $h_t = ot \bigodot tanh(c{t})$
(Note that $\bigodot$ is used to denote element-wise multiplication)
Explained
Here's a high level view of how it works
In the previous section we have looked at the standard language model $$P(s)=P(w_1,\cdots,w_n)=\prod_i P(w_i | w_{i-1},\cdots,w_1)$$ as well as Recurrent Neural Networks, RNNs, for modelling sequential data $$h_1=f(\theta,x_1,h_0) \Rightarrow h_2=f(\theta,x_2,h_1) \Rightarrow \cdots \Rightarrow h_n=f(\theta,x_1,h_{n-1}) $$
The way a language model will work is as follows:
Forward mode
Before we get to training it's useful to be able to determine our models performance metric. In an ideal scenario where the model is performing optimally then we would expect it to assign the highest probabilities to real sentences. In order to measure the performance we will use our trusty cross entropy. $$ H(p*,p) = - \sum_{x \in X} p*(x) log p(x)$$ Where p* is a reference distribution
In language modelling we implement this as a per-word cross-entropy defined as
$$ H = -\frac{1}{N} \sum_i^N log p(w_i | w_{i-1}, \cdots) $$
Another commonly used metric is Perplexity which is the geometric mean of the inverse probability of a sequence of words. It is defined as
$$ Perplexity(s) = \prod_i^N \sqrt[N]{\frac{1}{P(w_i|w_{i-1},\cdots)}} $$
This can be expanded using the law of logarithms to create a slightly simpler form $$ b^{-\frac{1}{N} \sum_i^N log p(w_i | w_{i-1}, \cdots)} $$ Where b can be any base you'd like to use. Notice that the exponent is just our per-word cross entropy loss.
The lower the perplexity the better our model is
It can be helpful to think of perplexity as A discrete uniform distribution of a k events is k
To train a model we feed the words (1 by 1), using again the special symbol s to denote the start of a new sequence. After each time step we've projected a high dimensional space, like we did during inference, turning that into a probability dostribution and calulate the loss using cross entropy.
Note that what we input at the following time step is not the word predicted at the previous time step, but rather the actual word in the training data. This is known as teacher forcing, and allows the model to keep training effectively, even if it would have made a mistake in the previous time step. Once the entire sentence has been fed into the model, the overall loss is computed as the average of the losses for each word. Now backpropogation can be used to calulate the values needed to perform the update step.
This module introduces the concept of masked language modelling, a concept that has been used very effectively in NLP to improve the performance in a variety of tasks.
Recall the language model probability of a sentence we've used in previous sections $$P(s)=P(w_1,\cdots,w_n)=\prod_i P(w_i | w_{i-1},\cdots,w_1)$$
Masked Language Modeling is a related Pre-training task - an auxiliary task, different from the final task we're targetting, but which can help us achieve better performance by finding good initial parameters for the model. In a manner of speaking it's similar to a training signal of sorts or even a feature, it is helpful in the final task. By pretraining on masked language modelling before training on our final task it is possible to obtain higher performance than by simply training on the final task alone.
Masked language models take a sequence of words as inputs, to which they augment special symbols to mark the start and end of a sequence of words. This is the same as in any language model. They will also cover up, or mask, some of the original words with special tokens. We then take the word embeddings corresponding to these words and we feed them into a transformer encoder.
Since transformer encoders have no inherent notion of position of their inputs, and this is useful for language, we will also add to the input some special positional embeddings. This may strike you as a bit of hack, but the fact is in practice it tends to work rather well.
The final task here is to predict what the masked words are. Although this is a strange and artificial task, it is quite difficult. A model that can learn to do this well has learned a great deal about language, as well as some of what we call common sense knowledge. Now when this model is trained to perform some task which we do care about, the model will not only learn to perform the new task, but it will also retain part of the knowledge to perform masked language modelling. This gives the model a significant boost in performance.
Consider a token level task - such as classification. Optical Character Recognition involves identifying which input tokens identify entities such as persons, dates, locations, and so on. Here we might input a sentence without any masked tokens. For the outputs we would train the network to perform the correct classification, which would be the entities we're looking for.
Example
Consider a sentence level task where we are interested in learning the global meaning of a sentence, such as sentence classification. Conventionally here we might take the first output of the transformer encoder at the top layer and use that to classify the sentence.
This can also be done to translate sentences
The model will learn to look at the two translations simultaneously in order to better predict what the masked words are. A model pretrained with cross lingual masked lanuage modelling can be used in the same way as a model with normal masked language modelling. But the real strength of these models comes from the cross lingual transfer. Image we took such a model and trained it to perform classification using english training data. Because of the cross lingual nature of it's pre training task, this model should now be able to perform classification in other languages as well.
Another popular task is Cross-Lingual NLP Inference. Given two sentences a, and b, the task is to decide whether the first implies the second, or whether it contradicts it, or whether the two sentences are unrelated.
Interestingly enough this will also work on French data, despite not being trained directly on it.
This class of models, which are all based on pre-trained transformers, have proven revolutionary in the field of NLP.
However, as you can see in the illustration above, as the transformer based models have led to an incredible boost in performance, often even beating the human baseline, they are also getting incredibly large!
A common approach to dealing with incredibly large models is knowledge distillation, which can reduce the size without significantly impacting the preformance. In this scenario we use two models: A pretrained teacher model and a smaller student model. The teacher model will be much larger, and more complex. We feed the input data into both models and we will obtain two predictions. We will then compute two loss metrics: distillation loss as the difference between the student pdiction and the teachers prediction, Student loss as the difference between the student's prediction and the target or ground truth.
The trick here is to encourage the student to align it's predictions to those of the teacher as well as the target. Why this works is still a topic research. It's not fully understood. One idea is that in using "soft predictions" the student model will try to align it's probability distribution to that of the teacher.
You can even take this a step further. You can take any unlabelled text and use the teacher to make predictions on it. This can then be used to augment the training for the student.
Let's examine how we might implement such a distillation loss.
Refence Links
Guest Lecturer: Ledell Wu - FaceBook Research engineer
Topics:
Embeddings can be seen as a mapping of objects to Vectors through a trainable function. Recall that a CNN maps an image into a vector in real space. Similarly we can map other things, like text, into a vector. Typically the neural net can be seen in RNNs or a transformer for text, or simply a shallow one such as word2vec.
Typically the distance between vectors in the same space can also be used as a similarity function between objects.
Representing words by their context has long been the goal of NLP
Some works that laid the foundation of using word vectors
Positive v Negative training sample (example)
What the illustration below suggests is using a margin loss, similar to those in SVMs. In this case we want to maximize the distance between the positive and negative examples.
Word2vec: SkipGram model
Consider the following example:
Our next step would be to move onto the word "banking" and repeat the process.
More formally the SkipGram can be described as follows
So How do we define $P(w_{t+j}|w_t;\theta)$?
Problem: The size of the voacbulary can get very large! Makes it expensive to compute, so we resort to
There are other forms of word embeddings.
In practice many people use embeddings as feature in downstream ML models, or as initialisation of neural net models for NLPs such as RNNs, LSTMs. More recent embeddings such as Elmo and Bert focus on context.
There are two ways to evaluate an word embedding
Consider the task of word analogy, which falls under the intrinsic umbrella. In this task, we have a pair of words, man and women, and also a word king. We might want to know what word is to king, as man is to woman?
More formally
We observe the above property. Evaluate word vectors by how and where the cosine distance after addition captures intuitive semantic and syntactic and analogy questions. If you find the vector that satisfies the formula above, the vector will be queen. So we can find a word analogy by simply doing a bit of vector addition and subtraction.
Graphs are networks of entities and connections between them. They're used to represent anything from protein structures or map data to networks of human interactions. Multi-relation graphs are typically studied in the context of knowledge graphs.
In fact embeddings such as Word2vec, from the previous section, are in fact trying to learn the embeddings of the graph node which is the word.
You can also think of this as similar to a matrix completion
Item1 Item2 ... Item4 - Relation between items and people
Person1 - + + - Items in {People,Movies,Page,Articles etc etc}
Person2 + ? - - Predict
... - If someone will like an item
PersonN - or if a word will follow a word
Embedding: A Learned map from entities to vectors of numbers that encode similarity
There is also Path and Structure embeddings that we will not cover.
Graph Embeddings:
Given
Margin Loss b/w the score for an edge $f(e)$ and a negative sampled edge $f(e')$
The score for an edge is a similarity (e.g. dot product) between the source embedding and a transformed version of the destination embedding
Negative samples are constructed by taking a real edge and replacing the source or destination with a random node.
In pytorch "BigGraph" there are two main types of Multi Relation Graphs
Let's now talk about applications of NLP language models
TagSpace
PageSpace
Example:
Here's a high level look of a Word2vec graph
As you can see different behaviours can be used to interpret different target tasks. So why not embed everything into a vector space? ie a World2vec instead of just word2vec. This will lead to an incredible large and powerful model.
The power of universal Behaviour features
Once you've created the embedded graph you can perform many tasks
A tSNE plot is a great way to visualize high dimensional data, which in fact uses a form of embedding.
As these types of graphs get larger and larger it becomes more imperitive that we can scale the algorithm to handle large graphs. The key idea here is to partition the graph.
Guest Lecturer: Arthur Szlam, Facebook research Scientist
Attention is All You Need
BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding
The Illustrated Transformer
Idea: Softmax function as a mechanism for Neural Attention
Recall Our friend the softmax function
In the following images we see an illustration. On the left is our input numbers (the $x_i's$) on the right is the output of the softmax. Notice that the scaling has changed on the right, it has a max of 1.0. This is not true for the left.
Softmax and Index Selection
Selecting a Vector from a Set (Non-Differentiably)
Differentiably Selecting a Vector from a Set
When Softmax is applied at the final layer of a MLP:
In Softmax attention:
Here we talk about attention as a general neural module. Recall that Attention is the distribution over inputs that depends on computational state and the inputs themselves.
Suppose we gave our model an image with a tent on the left side and a table on the right.
We would expect our model to be human like in it's "computational state"
Two Forms of Attention
This is the most standard form of attention in Neural Networks is done using softmax.
Alignment in machine translation: For each word in target, get a distribution over words in the source.
Above is an simple example of a translation from english to german. One might suspect a loose, or soft, type of correspondance. This is the attention for a translation model.
Attention as a neural network layer seems to be surprisingly new development.
Let's take a look at an example
An important property of softmax attention as a layer, or attention as a layer in general, is that the representational power grows with the size of the input. Contrast this to say a fully connected layer, which can only take as input a fixed size vector. Similarly a classical RNN in which the representational size remains fixed, independent of the length of the sequence that you feed into it. The key difference here is that the softmax attention module has a hidden state that grows with the input and that is it's secret.
Multi-Layer Soft Attention Model
Observe
The key difference here is that in each time step, we are computing the next controller state, and we recursively make use of the current state to do this. ( This is implemented in the equation at the bottom right )
If inputs have an underlying geometry you can include geometric information in the weighted bags.
Important example: For sequential data, use position encoding, giving the "location" in the sequence of that input
Transformer is multi-attention model that is currently State Of the Art (SOTA) in almost all NLP tasks, as well as many other others (such as audio).
It has a superior performance compared to previous attention architectures due to
Recall how the softmax attention layer works:
Self Attention uses a controller state for every input (Rather than share it like above)
Multi-Head Attention
Multiheaded attention combines multiple attention 'heads' being trained in the same way on the same data - but with different weight matrices, and thus yields different values
Each of the L attention heads yields values for each token - these values are then multiplied by trained paramters, then added, then recombined using a Fully connected layer.
While transformers operate on sets of vectors (and even graphs), they began in the setting of text processing.
Some standard specialization of transformers for text include:
Guest Lecturer : James Cross : Facebook, Applied Research - Machine Translation
Why is this such a hard problem?
Recall that translation is often modeled as a conditional language task:
Probability of each output token estimated seperately (left-to-Right) is based on
$argmax(p(t|s))$ is intractable (ie unsolvable)
Beam Search
3 High Level architectures for translations
RNN (includes LSTMs)
Convolutional
Transformers
seq2seq models for SOTA translations can be very large. While training can be expensive requiring GPUs the general intention is to deploy it for inference, using it many many times. This still an expensive task because, as you may recall in the previous section, the sequential nature requires a token prediction for each time step.
Whats expensive?
Strategies
Method 1 : to overcome difficulties
Method 1 : to overcome difficulties: Byte-Pair Encoding
Simple example of Byte Pair Encoding
abcdeababce input string
XcdeXXce S1 replace most freq pairs of letters with X(=ab)
YdeXYe S2 replace next most freq pairs of letters with Y(=Xc)
Inference is faster on specialized hardware (ie GPUs TPUs etc)
Batched vs OnDemand: Average efficiency can be increased by translating multiple inputs at once.
Guest Lecturer
Now let me talk a little bit about how we empower the experience of translation of user content.
Traditional Deep Learning Language models are based on classical statistical methods and work well on standardized languages like english, french, spanish. However since 2016 there has been a shift to Neural Machine translation (NMT) models and these require an even greater amount of data than the traditional DL models. They perform better then dl methods but also consume much more resources in order to overcome the DL benchmarks. Again they do well on standard languages. For non standard languages it becomes much more difficult due in part to the lack language data. Consider how much more english data there is on the internet vs say "Njerep". (Njerep is spoken by approx 4 people in Nigeria and faces extincion in the very near future)
In order to evaluate properly we needes to have very strong benchmark data sets that allow us to advance the state of the art. And to do so, we really worked on having datasets that are representative of the real low resource problem. We did this for Nepali-English, very different language, first in Sinhala-English. We called this the FLORES data set and we'll now explain what it consisted of.
One technique that was employed to great success was Multilingual Training. This is when you might use a translation in language with a high similarity to the target language. ie Belarsian and Ukrainian are highly similar in alphabet, vocab and grammer. Ideally one of these languages should be language rich. This supports transfer learning.
Another technique that has been used quite extensively is back translation. This same supervised technique allows us to leverage large quantities of monolingual data, which we then use to train a larger model. This works very well for languages like German and English, and has helped to bring equality to a level that has won several translation competitions.
Suppose you have some small amount of training data say from German to English, you train the small model, German to English, you then use to annotate large quantities of German data into fake English data, or translated English data. And now, you use this pair, you reverse the order and you use it in conjunction to your previous data. So now you can train a better English to German system.
Things can go horribly wrong when translations are bad!!
One idea proposed by the facebook team is to use a quality scale
In an ideal world a model would want to aim for the top 1-2 levels of perfection, however the world is not ideal. Another way to think of this is to think in terms of a negative scale where the worst choices are labelled "catastrophic".
Catastrophic Examples
An active area of research is Catastrophe prevention.
Let's look at different ways of building Automated Speech Recognition Systems
Overview:
Inputs and Outputs
Input : Audio Signal
Output : Transcript
We begin by taking the audio signal and break it up into continuous overlapping frames which we will call $x_1,x_2,...x_T$. Of course each of these frames should be the exact same length. This would need to be defined at processing time. Next we use a Fourier transform to convert the audio signal into a freqency domain representation. We can think of any audio as the summation of many different frequencies. At this point our signal has been transformed from the audio domain into the frequency domain.
Now we apply what are called Mel Filters. Mel Filters are an approximation, or rather a simulation, of the way a human might percieve the sound. Humans are better at detecting differences in lower frequencies than higher frequencies. They're used in a convolutional way. A bank of filters is applied to the audio signal to produce a collection of Mel-Features where each represents a different Mel-Frequency, which we apply a log function to.
The amount of overlap between frames is an architectural choice.
There are two main paradigms for ASR Systems.
Suppose we have an audio signal
(NB the choice of LSTM is purely for illustration, you can pick anything you'd like)
Other modules/models that can be found in ASR systems are
We would now pass all three model outputs (Acoustic,Language,Pronunciation) into a decoder to produce the final result which goes into the transcript.
#### 3.16.06 Non Modularized (End2End) ASR Systems
#### 3.16.07 RNN-T Training
#### 3.16.08 RNN-T Decoding: How to utilize RNN-T Model at Test Time
#### 3.16.09 RNN-T Greedy Decoding
#### 3.16.10 RNN-T Beam Search Decoding
#### 3.16.11 RNN-T Personalization
#### 3.16.12 ASR Different Approaches
In this module, we'll talk about advanced topics, that is we'll show the power of deep learning and moving beyond just the standard supervised machine learning setting that we've talked about thus far. For example, we'll try to address the problem of how to deal with unlabeled data, which is much easier to collect. And where we don't have a lot of labeled data, that is annotations that humans have made or given for each piece of data that we have. We'll also try to cover decision making tasks, which moves beyond just making predictions on pieces of data. But actually allows the neural network to make the decisions in the form of actions that actually affect the world.
This far, we've only dealt with supervised learning, where for each piece of data that the neural network processes we have a ground truth annotation or label, that is what the answer should be. This allows the neural network to update its parameters. This can be quite limiting in some situation. For example, it's much easier to collect unlabeled data, that is collections of data where we don't know what's in it. Maybe even a combination, a small amount of labeled data, but a very large amount of unlabeled data. For example, scraped across the internet. Similarly, we may also try a decision making task, where our predictions are actually actions in the world.
For example, playing a games such as chess. In a game we may make many such predictions and go through the entire game, and we don't really have labels for each point in the game, or what we should have done. All we know at the end is whether we won or lost. This is what's called sparse supervision, where you don't really have an answer for every point in time, or for every piece of data that the neural network processes. It's also a little bit of a different type of supervision, rather than the answer of what we should have done, what predictions we should have made. All we have is a reward, whether we've done well or not well.
Turns out that really, we can use a lot of the fundamental building blocks that we've used thus far. Except change some aspects of this, such as incorporating different types of loss functions for example. Let's look at these in a bit more detail.
One thing that these classical methods don't really deal with, is the feature learning part. As we've learned, the key thing that makes deep learning successful is that we're able to automatically optimize the feature space given the inputs. And specifically in supervised learning, we're optimizing the features so that we can make the best predictions on the labels. However, you can try to think about ways to learn features using just unlabeled data.
Reinforcement learning on the other hand, deals with evaluative feedback in the form of reward. So here the supervision is no longer for each piece of data, what should the predictions have been? Really, you have this cycle where you have an environment such as a chessboard, you make some action that changes the environment. And then you could go through the cycle many, many times. And only at the end you get some sort of reward that tells you whether you've won or lost. In some cases you can get some reward in the middle too. For example, playing an Atari game you get some score throughout the game.
In the real world it's difficult to get labeled data, because it's both costly and time consuming to pay people to annotate them. However, it's much easier to get unlabeled data. For example, we can just take videos across the world or crawl the internet for images and pieces of text and other data that already exists. And so there's a range of machine learning settings that people have developed and looked at that goes beyond supervised learning.
Different Low-Labeled Settings in DL Research
Setting | Source | Target | Shift Type |
---|---|---|---|
Semi Supervised | Single Labeled | Single unlabeled | None |
Domain Adaptation | Single Labeled | Single Labeled | Non-Semantic |
Domain Generalization | Multiple Labeled | Unknown | Non-Semantic |
Cross-Task Transfer | Single Labeled | Single unlabeled | Semantic |
Few Shot Learning | Single labeled | Single few-Labeled | Semantic |
Un/Self Supervised | Single unlabeled | Many Labeled | Both/Task |
In few-shot learning, we have a base class set of data, With many categories and lots of data for those categories. And so we're trying to understand how to learn really good feature representations on that larger set of data. But then during test time, we wanna classify a whole new set of categories that don't exist in that base set. However, in these new categories, we only have one to five, let's say, examples per category.
In this section, we will introduce reinforcement learning where the task is to make a sequence of decisions or predictions similar to what you're seeing in sequence models. However, we will not receive supervision in the form of the correct decision or a prediction which should have been made. Instead, we will only receive evaluative feedback in the form of reward for the decision or our prediction which was made.
Definition Reinforcement learning can be defined in one sentence as a sequential decision, making in an environment with evaluative feedback.
Let's break the last sentence into it's component parts.
Environment
Let's break this definition down by first talking about what we mean by an environment. For that, let's imagine we have an agent who can take actions in an environment and the environment produces a state which the agent can persist. In addition to giving a reward to the agent for the actions it did take. Now this environment may be unknown, very complex.
Think of the worst case as a real robot being deployed in the world, in which case the environment is the real world. The task of this agent is to learn a policy to map states of the environment to actions that it can play to affect the environment. The objective of this agent is to maximize the reward it will get from the environment in the long run.
Evaluative feedback
Evaluative feedback means that the agent is supposed to pick actions and receive rewards which can be positive or negative only for the actions that it did take and not for the actions that did not take. This means that the agent will never know or have supervision for the correct or best action that it should have taken at a particular state.
Sequential decisions
Unlike supervised learning, and sequential decisions just means that the agent needs to make a sequence of actions or a sequence of states in order to complete the task. And with this what can happen is that the reward may be delayed and
it can only happen at the end of the task, which means that the agent needs to optimize in the long term for rewards it may get at a very very later stage.
Challenges With these definitions, let us look at the signature challenges.
Here's the protocol for how the agent interacts with the environment.
At every time step t
In the environment at time step t
RL has also been applied to games like GO where the objective is to defeat the opponent.
It is plus one for winning the game or zero otherwise, making it an exceptionally challenging long term planning task.
Markov decision processes, MDPs, form the underlying framework of Reinforcement learning. They can be defined as a tuple of five values:
We can now define an Interaction Trajectory as a sequence of state/action/rewards
Markov Property
The Markov processes are an important class of the stochastic processes. The Markov property means that evolution of the Markov process in the future depends only on the present state and does not depend on past history. The Markov process does not remember the past if the present state is given. Hence, the Markov process is called the process with memoryless property.
MDPs are characterized by the Markov property, which as seen by the definition of the transition probability. It implies that the distribution of possible next states given state s, and action a, Does not depend on any of the previous states or actions in the trajectory.
In Reinforcement learning, we assume an underlying MDP with unknown
Instead, only samples from these distributions are observed by the agent interacting with the environment. Evaluative feedback comes into play, because of the lack of knowledge in the reward distribution. This will require a trial and error approach to determine rewards.
For our discussions we will assume known reward and transition functions. This will allow us to look at algorithms for solving MDPs, which comes down to finding the best policy. Which can be loosly defined as the best action for each possible state s in the set of all possible states
Let's take a look at a simple environment, (see below), where we can intuitively figure out what is the best policy. The 2D environment below is a grid world, where there's an agent indicated by a blue triangle. Where the state of the environment is the 2D coordinates of this agent. In which case the agent right now is at (2,3). The actions that the agent can take is either move one cell north, east, south, or west. And the rewards are either +1 or -1 at the absorbing states shown in the figure on the right.
With this setup, it is easy to see that the optimal policy is to head east twice, in order to reach the +1 absorbing state. However, let's assume that there is some noise in how the agent transitions to next state. This noise is in the form of a 20% chance, of the agent drifting once a left or right of the direction of motion. Now, things get a bit complicated. The policy will now have to prescribe an action, for the states that the agent could land up in. As it is not guaranteed to always move in the direction of east. One way to describe such a policy, would be in the form of a 2D matrix, of the same shape as this 2D grid. But each matrix element is prescribing an action to take, if the agent is at that cell. This leads us to the formal definition of a policy, that we will use for solving MDPs.
Formally a policy is a mapping from states to actions
What defines a good policy?
We can now define the optimal policy as
$\large \pi^* = argmax_{\pi} \mathbb{E} \left[ \sum_{t \geq 0} \gamma^t r_t | \pi \right] $
Expected Value here is taken over
Note the discount term raised to the power of t. This implies that immediate rewards are preferred over future rewards.
We need to now introduce one last component the value function. The value function is a prediction of discounted sum of future rewards.
For a policy that produces a trajectory sample $(s_0,a_0,s_1,a_1, \cdots )$
The V-function of the policy at state s, is the expected discounted sum of rewards from state s:
$\large V^{\pi}(s) = \mathbb{E} \left[ \sum_{t \geq 0} \gamma^t r_t | \pi, s_0 = s \right] $
The Q-Function of the policy at state s and action a is the expected cumulative reward upon taking action a in state s (and following policy thereafter):
$\large Q^{\pi}(s,a) = \mathbb{E} \left[ \sum_{t \geq 0} \gamma^t r_t | \pi, s_0 = s, a_0 = a \right] $
In the previous section, see above, we defined the V and Q functions for all policies $\pi$. In a simple slight of hand you can define the Q and V for the optimal policy by simply using the * notation to denote the optimal policy.
ie
$\large V^{*}(s) = \mathbb{E} \left[ \sum_{t \geq 0} \gamma^t r_t | \pi^*, s_0 = s \right] $
Similarly
$\large Q^{*}(s,a) = \mathbb{E} \left[ \sum_{t \geq 0} \gamma^t r_t | \pi^*, s_0 = s, a_0 = a \right] $
Now these two equation are highly interconnected, via the optimal, and have the following identities, with subtle yet important difference in interpretation
Taking a closer look at the definition of the optimal Q function, we will now try to rewrite it recursively in terms of the optimal value function at future states. Let's define the term return as the sum of discounted future rewards.
At time step t=0 our expected return is
$\large Q^{*}(s,a) = \mathbb{E} \left[ \sum_{t \geq 0} \gamma^t r(s,a) | s_0 = s, a_0 = a \right] $
Where the expected value is taken over $a_t \sim \pi^*(\cdot|s_t) \text{ and } s_{t+1}\sim p(\cdot|s_t,a_t)$
If we look ahead to time step t=1, we get a reward (from t=0) + the discounted return from (t=1)
$\large = \gamma^0 r(s,a) + \mathbb{E}_{s' \sim p(\cdot | s,a)}
\left[ \gamma \mathbb{E}_{a_t \sim \pi^*(\cdot|s_t) \text{ and } s_{t+1}\sim p(\cdot|s_t,a_t)}
[ \sum_{t \geq 1} \gamma^{t-1} r(s_t,a_t) | s_1 = s' ] \right] $
( Now imagine the next few steps because typing these equations in Latex is rather tedious )
We can now use this to put together to derive the bellman optimality equations
$\large
\begin{split}
Q*(s,a) & = \mathbb{E}_{s' \sim p(s'|s,a)} [r(s,a) + \gamma \color{red}{V^*(s')}] \\
& = \sum_{s'} p(s'|s,a)[r(s,a) + \gamma \color{red}{V^*(s')}] \\
& = \sum_{s'} p(s'|s,a)[r(s,a) + \gamma \color{red}{max_{a'} Q^*(s',a')}]
\end{split}
$
Similarly we can also define $V^*$ as follows
$\large V^*(s) = max_a \sum_{s'} p(s'|s,a) [r(s,a) + \gamma \color{red}{V^*(s')} ]$
And we can use it to define an algorithm as follows
Intuitively, each application of this recursive equation can be thought of as propagating the information of the return at a state s to the return values of its neighboring states under the transition probability. This update will produce a sequence of vectors V0, V1, and so on, until we get V star upon convergence.
Given the nature of the upgrade step, each iteration of this algorithm will have a time complexity order $O(S^2 A)$, where S is the cardinality of the state space and A is the cardinality of the action space.
In a similar fashion we can derive the Q function recursively as follows
$\large Q^{i+1}(s,a) = max_a \sum_{s'} p(s'|s,a) [r(s,a) + \gamma \color{red}{max_{a'} Q^i(s',a')} ]$
Which is basically the same as before, except for the last loop over the actions to get max a.
So all this begs the question of how to determine the optimal policy??
Algo 1: Policy Iteration
Begin with $\pi_0$ and iteratively refine it
$\large \pi_0 \rightarrow \pi_1 \rightarrow \cdots \rightarrow \pi_i \rightarrow \cdots \rightarrow \pi^*$
2 Step Process
$\pi_i$ may seem a bit out of place but it turns out that it converges to $\pi^*$ much faster than $V^{\pi_i}$ does to $V^{\pi^*}$
Another possible algorithm is Value Iteration. which we leave to the reader to research
Deep Q-learning assumes a parametrized Q-function that is optimised to match the optimal Q-function, from a defined data set of state, action, next state and reward tuples as shown below. The simplest example of such a parameters network, is a linear Q-network with one weight vector and a bias. Alternatively, the Q-network can also be a deep neural network.
Intuition: to learn a parametrized Q-Function from data $\{ (s,a,s',r)_i \}_{i=1}^N$
Examples
CASE Study
Updating the Q function is the application of the bellman equation seen before
During training, we can use a single Q-network to predict the Q-values for the current state and action shown on the left, and the next state and next actions shown in blue on the right. Instead of having a four loop over all states to update the Q-network, as was done in Q-iteration, we introduced a regression objective that will minimize this mean squared error.
Loss for a single point
Intuitively, this will attempt to make the predicted Q-values in red, match the target Q-values on the right. However, in practice, it has been observed that using a single Q-network, makes the loss minimization unstable.
For stability purposes you can
In Practice
Assuming a fixed dataset, the Avg MSE Loss can be optimized using the Fitted Q-Iteration Algorithm.
High Level look at gathering data
What should $\pi_{gather}$ be?
It should be noted though that there are potentially large pitfalls in this approach. If samples are correlated, then this leads to high variance gradients, which leads to inefficient learning.
This can be remedied by experience replay.
Now that we have all the needed pieces, we can fully understand and implement the original Deep Q-Learning Alogrithm
In this lecture, we will study a family of policy based RL Algorithms, which are based on graded updates or paramterized policies. Among the types of methods used in RL are (1) value based methods to learn Q-functions, and (2) model based methods to learn transition and reward functions. However, both methods provide an optimal policy that is used to perform a task with a high reward.
Policy based methods on the other hand directly parameterize a policy and optimize it to maximize returns.
Our class of Policies is parametrized by the family of parameters given by $\theta$, which can be the parameters of a linear transformation, or deep network, or other possibilities.
Formally
Going forward we will adjust our notation to properly convey our goal
To be clear this is not a change in the definition. It's simply making explicit that the optimal parameters fully describe the optimal policy
Let's now do a walkthrough of a policy based algorithm
Step 1: Data Gathering
We begin by defining a finite trajectory tau as
The probability of tau given a policy $\pi_{\theta}$, is denoted $p_{\theta}(\tau)$, is given by
$\begin{split}
\pi_{\theta}(\tau) & = p_{\theta}(\tau) \\
& = p_{\theta}(s_0,a_0,\cdots,s_T,a_T) \\
& = p(s_0) \prod_{t=0}^T p_{\theta}(a_t|s_t) \cdot p(s_{t+1}|s_t,a_t)
\end{split}$
Further, we will write our objective as finding the parameters theta that maximizes the expected sum of rewards over trajectories tau.
$\large \text{argmax}_{\theta} \mathbb{E}_{\tau \sim p_{\theta}(\tau} [\mathcal{R}(\tau)]$
Now we can gather data in a much simpler form than before
In the next few slides, when we formally derive our policy gradient expression, we will see a similar expectation that will again use the same batch of trajectories to estimate the gradient update.
$ \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T r(s_t^i, a_t^i) $
Let's now derive the policy gradient
$\large
\begin{split}
\nabla_{\theta} J(\theta) & = \nabla_{\theta} \mathbb{E}_{\tau \sim p_{\theta}}(\tau) [\mathcal{R}(t)] \\
& = \nabla_{\theta} \int \pi_{\theta}(\tau) \mathcal{R}(\tau) d\tau \\
& = \int \nabla_{\theta} \pi_{\theta}(\tau) \mathcal{R}(\tau) d\tau \\
& = \int \nabla_{\theta} \pi_{\theta}(\tau) (\frac{\pi_{\theta}(\tau)}{\pi_{\theta}(\tau)}) \mathcal{R}(\tau) d\tau \\
& = \int \pi_{\theta}(\tau) \nabla_{\theta} \text{log}\pi_{\theta}(\tau) \mathcal{R}(\tau) d\tau \\
& = \mathbb{E}_{\tau \sim p_{\theta}}(\tau) [\nabla_{\theta} \text{log}\pi_{\theta}(\tau) \mathcal{R}(\tau)]
\end{split}
$
Note that we used the identity
$\large \nabla_{\theta} \text{log}\pi_{\theta}(\tau) = (\frac{\nabla_{\theta} \pi_{\theta}(\tau)}{\pi_{\theta}(\tau)})$
The key idea behind these variations is that subtracting some baseline b shown in red, that does not depend on actions will preserve the mean of the gradient expectation while possibly reducing the variance for specific choices of B. The proof of why the mean is not affected by such a baseline. And what is the best choice of such a baseline is left as an exercise. The different choices of this baseline have resulted in two important variants of the policy gradient algorithm. The first is known as the actor-critic algorithm that replaces rewards with the Q function of the policy that is learned from data. The second algorithm is known as advantage actor-critic that substitutes the reward with the advantage of the policy. It is defined as the Q function minus the V function.
In this lesson, we'll go beyond supervised learning and talk about various other machine learning settings. For example, we'll look at low labeled machine learning settings where we have only a few labeled examples, possibly with a larger amount of unlabeled example. We'll also talk about pure unsupervised or self-supervised learning where we'll try to learn effective feature representations using unlabeled data only.
We'll also talk about pure unsupervised or self-supervised learning where we'll try to learn effective feature representations using unlabeled data only. Thus far, we've really looked at supervised learning where we have a large labeled data set X and associated labels Y, typically coming from human annotation. In deep learning, we wanna learn a function that maps from X to Y, typically as a conditional probability, the probability of the label given the data. However, annotation of data is costly, both in terms of time as well as monetary cost. This is because you have to pay people in order to annotate manually each example and this can really add up if you have hundreds of thousands, or millions of examples. So we may want to consider machine learning settings where we assume that have much less labeled data.
We will be looking at three main types, but there are many more
Note that in traditional machine learning, and to some degree in deep learning, there's also purely unsupervised learning where we can perform tasks, such as clustering or density estimation. We'll talk about the distinction between self-supervised and unsupervised learning later.
For modeling the P(x), there are deep generative models. For comparing and grouping, there are metric learning, or clustering approaches. And obviously, representation learning is pretty much what deep learning is.
Some Common ideas are shown below.
References
It is often much cheaper (cost/time) to get large-scale unlabeled datasets. This begs the question: Can we overcome a small set of labeled data by using unlabeled data? Which is often much cheaper to get in a large scale (such as the internet). Further, would it improve the performance all the way to a highly labeled scenario?
A very simple idea is to learn/train a model on labeled data, make predictions on unlabeled data, then add back into the training data. This can be repeated as often as is necessary. Another approach is to combine unlabeled and labelled data and perform co-training predictions across multiple views. In this method we generally use augmented data techniques.
There is a recent algorithm called FixMatch(https://arxiv.org/abs/2001.07685). Which has proven extremely effective in the Semi-Supervised Learning context. There are several ways to instantiate this idea. There would be two stages where we train the supervised model on the label data. And then we apply that model on unlabeled data. Obtain pseudo-labels for those predictions that are confident, feed those back into the training set. Again, we'll retrain the supervised model.
The beauty of deep-learning is that we don't actually have to do this in multiple stages. We can have this entire pipeline trained in an end to end manner. Specifically, we can have the mini batch consist of both labeled and unlabeled examples. For the labeled part of the mini batch, it goes through the traditional pipeline. We'll extract features, make predictions and then use cross-entropy loss because we have ground truth labels annotated by humans. For the unlabeled examples, we'll apply two types of augmentations. Again do feature extraction and prediction. And then use the predictions obtained on the weakly augmented labeled data, not in its own loss functions, but really as ground truth for the loss from the strongly augmented data. At the bottom right we can see what the loss function looks for the unlabeled part. We'll have a batch size B, for example, 64 examples for the labelled part. And then we'll have a multiplication factor here, for example 7. So our unlabeled mini batch size will actually be much larger than the labeled part. Finally, we just sum up the loss function and averages. And our specific loss function will include the cross-entropy loss H, between the pseudo-label Q hat B and our predictions from the strongly augmented data QB. And then we'll also have this indicator function, which looks like a 1, that only applies this to the predictions that have some greater than a threshold prediction.
Some details to keep in mind
While Fixmatch is the most recent there are other methods
Reference Links
In this lesson we discuss few shot learning where you may only have a few examples per category. Furthermore you may, or should, have a larger model from which you can transfer some learning, but which may have been built for some other purpose. The goal of few-shot learning is to learn a new capability from a very small number of examples.
In most literature on few-shot learning the approach is studied in terms of N-Way-K classification. Here the goal is to discriminate between N classes with K examples of each. For ex 10-5 would imply 10 classes with 5 samples a piece.
Some approaches include:
Interestingly a cosine layer (rather than a simple fully connected layer) has been shown to be beneficial in many situations but not all.
There are some drawbacks
One positive note is that it simulates what is seen during test time. This leads us to the idea of Meta-Learning. In the meta-learning framework, we learn how to learn to classifiy given a set of training tasks and evaluate using a set of test tasks. In other words, we use one set of classification problems to help solve other unrelated sets.
The key factor here is that the classes shown are different than anything seen before.
This begs the question: What can we learn from these meta learning tasks, and what models can be constructed and conditioned on a support set $M(\cdot | S)$.
We can think of this as parametrizing a learning algorithm.
There are Two main aproaches to defining a meta-learner:
For example if we wanted a model to learn gradient descent.
Meta-Learner LSTM
Contrast between MAML and RNN(LSTM) based meta learning methods
Great writeup can be found here
https://www.borealisai.com/en/blog/tutorial-2-few-shot-learning-and-meta-learning-i/
Reference Links:
In this lesson, we'll talk about unsupervised and self supervised learning, where we have only unlabeled data, and we'd like to learn effective feature representations, so that we can fine tune them for downstream tasks.
In supervised learning, we have explored a number of architectures and last function in order to optimize this task.
In unsupervised learning, there's a set of traditional tasks that we may want to perform.
For example:
One such task that we can use in order to learn feature representations that are effective is an outer encoder. Here, we'll use the same encoder decoder architectures that we've talked about throughout, in order to first convert the image into a small low dimensional compressed vector that is features. These features are low dimensional embeddings that represent, hopefully the most important aspects of the image. Given this bottleneck features, we'll perform a decoding step, we'll perform a reconstruction task. That is, we'll just try to reconstruct the input. This may seem strange, we already have the input, there's no actual need to reconstruct the input. But we're forcing the neural network to learn really effective low dimensional embeddings that capture important aspects of the input. Once we train the system, we can actually get rid of the encoder and just use the low dimensional embedding in order to for example, fine tune it to a labeled classification task. Again, the key thing is that here we can do this without any labels at all.
We know what the input is, we want to reconstruct it, we don't need any labels. And this also allows us to use a very concrete loss function, that is we can minimize the difference with mean squared error for example, between the input and the reconstructed input.
Another thing we can leverage is the clustering assumption. This is common in unsupervised learning, where we assume that high density regions form clusters, that is items or instances that are similar form groups or clumps in some feature space. While low density region, that is regions where there's not a lot of examples separate clusters, which hold semantic meaning.
The key idea is that whatever feature space we learned, we'd like those feature spaces to have these kinds of characteristics. We'd like them to essentially have features that are similar to similar items and features that are far apart, according to some distance metric or elements that are different. And so, this is what we're hoping, we're hoping that essentially whatever original feature space or raw input space we feed.Once we've output feature space that's transformed using a deep neural network, we'd like that feature space to have this property. One thing that we can test this property with is apply K-Mean.
While a lot of these tasks have already existed before deep learning, for example, reconstruction is an optimization criteria for principal component analysis. And clustering has long existed before deep learning, we can actually generalize this idea into the concept of surrogate tasks. That is we'd like to come up with tasks where we can get the answer or label for free in order to drive optimization and prevent the need for human annotation. These tasks though, must also force the neural network to learn effective feature representation. And we'd like to engineer the tasks so that there aren't any trivial solutions or ways that the neural network can cheat to prevent it from learning effective feature representation. It turns out that over the years, a large number of such tasks have been devised, and they all have different characteristics in terms of how effective the features that are learned are. And this can differ both in terms of the level at which the neural network learns features.
Caron et al., Deep Clustering for Unsupervised Learning of Visual Features
For example, low-level edge features versus high-level semantic features, as well as how well they can generalize to other tasks as well. One example is colorization. Given a red, green, blue three channel image, we can actually just convert it to grayscale. This can be done using a hard coded formula. And the input to a neural network will then be this grayscale image. And we'd like it to re-colorize the image. Now clearly, this forces the neural network to have to understand something about what is in the image. Otherwise it's really hard to colorize. For example, if you don't know that this is a fish or what type of fish it is or if you have other objects such as tennis balls and so on, it's going to be really hard to apply the same color. And what's nice is that we know the answer, we already have the original RGB image. So we can have the neural network predicts the colorized image and we can use for example, a mean squared error loss function to drive feature learning.
There's a specific way that we evaluate the results of these surrogate tasks. Specifically, what we'd like to do is answer the question how good are the feature representations that we learn in an unsupervised manner and how well do they generalize to new label tasks. And so what we typically do is just take the encoder part, for example in the rotation prediction, we don't actually care about the layer that actually predicts the rotation amount. And then we transfer it to the actual task. This is essentially transfer learning where we use it to initialize the model of another supervised learning task. And we really use what we learned in an unsupervised way to extract features such that we can add another classifier on top, typically a neural network or prior machine learning methods such as support vector machines. Often we limit the classifier to simple linear layers. This is because we're interested in how good are the feature representations that we learn and how generalizable are they. So if we're comparing many different surrogate tasks with each other, we don't wanna have another confounding effect by adding additional complex transformations or nonlinearity. That wouldn't essentially tell us how good or the feature representations, but it kind of adds the additional element of training those additional layers. So typically we just take the features from the unsupervised surrogate tasks and then add a linear layer to whatever supervised classification task we'd like.
In summary, there's a large number of surrogate tasks and variations that we can use to learn really good feature representations using unlabeled data alone. This includes contrast of losses which work across image patches or context or instance discrimination. And there's many different types of loss functions or training regimes that we can apply some of them more efficient than others. The two that have become dominant and extremely effective are for unsupervised learning. Contrastive losses and when we have semi supervised learning, pseudo labeling losses, where we have essentially the learned model applied to the labeled data, making predictions on the unlabeled data and that gets used to drive a cross entropy loss. You can also use soft knowledge distillation where you don't make the prediction of one hot for the pseudo-label. And I haven't covered all of these methods but a lot of these methods work and it's not clear which one is the best currently. What we do know is that data augmentation is now key. These methods are used across almost all of these unsupervised learning and semi-supervised learning methods. And maybe unfortunately, methods tend to be sensitive to the choice of data augmentation. So there's a lot of recent work exploring how we can automatically learn data augmentation or be able to figure out at least, what data augmentation to use. Overall, these advances have been extremely exciting and have really only occurred recently.
In this lesson, we'll talk about a different class of unsupervised learning methods called generative modeling. Here we take a probabilistic view of unsupervised learning and try to estimate the probability distribution over the input space.
In unsupervised learning, there's a whole host of different tasks we may want to do. In this lesson, we'll focus on Density Estimation, that is producing a model of the probability distribution over the input space. In some cases, we may want to just have the ability to generate samples from this distribution, that is actually generate artificial examples that come from this distribution.
Recall the distinction between Discriminative and Generative models:
Discriminative models, model the conditional distribution probability of the label given the input. Examples of this include neural networks, support vector machines, random forests and so on. You may not have known that this is actually what we're doing, because really we just approximate this probability distribution using a blackbox neural network. We haven't done any additional probabilistic reasoning. However, if you've taken a machine learning course, you may have learned that there's an entire probabilistic and statistical view of machine learning. Where we can actually write out these probability distributions and reason about them. For example, to derive what loss functions we may want to use.
Generative models, on the other hand, models the distribution over the input space. Now, this is a very complex distribution, and so there's a question of how we can actually model it. Just, like discriminative models we can have a parametric approximation of this distribution. That is we can have a set of parameterized models p(x,theta). And use the principle of maximum likelihood to optimize the parameters given the unlabeled data set. Here p(x,theta) is our likelihood and we wanna maximize it or are real theta, that is we want our model to output high probabilities for the real deta. We can use this as the optimal set of parameters, data star, being the argmax data of the product of the likelihood of each example. This is because the examples are drawn independently and identically, that is we're just sampling independently from this distribution. And so the likelihoods of all of the data can be decomposed as the product of the likelihood of each piece of data. We can then take the log of this because we're maximizing it, and this turns out into a sum of log likelihoods.
These are called generative models because they can often generate samples. Once, we have some parameters data, we can also generate examples from this distribution that we're approximating. Some examples include multi-variate Gaussian model, where we have mu and sigma, that is the mean and co-variants. And what we wanna do is estimate these values from the data, and then given these estimates, we can just sample from the multi-variate Gaussian to actually produce example. Now, of course, this is very difficult, especially for high dimensional data. Because generative modeling is so difficult, there's been a whole host of different methods that have been developed. These can be categorized in various ways, for example, whether they use the maximum likelihood principle in order to optimize their parameters. There are methods that perform explicit tractable density estimation. That is, they simplify the joint distribution into some factorized model consisting of simpler components and then learn parameters for those. There are methods that perform approximations of various kinds, that is, they learn distributions that approximate the true joint distribution. And then there are implicit density models, where we don't actually model the density itself. Rather, we're just able to perform tasks such as sampling from the density. That is we can generate samples from the probability distribution, but we can't really have an explicit model for it that we learned. We'll cover the three most popular methods across the spectrum, in order to give you a flavor of all the different methods that exist.
References and Links:
In this lesson, we'll talk about generative models that perform explicit tractable density estimation, that is still reduce or factorize the joint distribution into something that's much more manageable that we can then approximate or learn using a recurrent neural network or CNN.
The first set of models that we'll talk about represent the joint distribution as an explicit tractable density. This is going to be similar to methods such as language model, where we model the distribution over sentences as just the product of the likelihoods over the work. We'll do something similar, for example, for pixels in images. To simplify matters, we can use the chain rule to decompose the joint distribution. ie factorizing the joint distribution into a product of conditional distribution.
Mathematically: $\large p(x) = \prod_{i=1}^{n^2} p(x_i|x_1,\cdots,x_{i-1})$
In the context of language modeling, this should be familiar. Rather than modeling the joint distribution over all words in a sentence, we can just factorize it, and here we have a natural sequence or ordering because the words come in a word at a time. And so we can model the probability of the sentence as the product of the conditionals, the probability of the first word, the probability of the second word given the first word and so on. And we can just take the product of these likelihoods in order to model the probability of the sentence.
Mathematically:
We can do something similar with images, here rather than the basic unit being words, the basic units are pixels. We can factorize the joint distribution as the product of conditionals, where we define some ordering over the pixels. For example, we could start with just the upper left pixel, and have some prior probability distribution over that pixel value. We can then move to the subsequent pixels from left to right and top to bottom, we can then model the probability of x as equal to the probability of x1, the upper left pixel times the probability of x2 given x1 times the probability of x3 given x1 times the rest of the product.
We can also do student forcing, where the model predicts some distribution over the pixel values for this current pixel, and then we use the conditional as that value for the next pixels. Over time, we can then train this across the entire image and we're updating our model to be better and better at modeling this conditional distribution p(xi) given the prior pixels. Once we have the model trained, we can actually generate new images, just sample from p(x1) that is the prediction of the model for the first pixel and then use that as input to predict the next pixels and sample from that.
Another idea is to represent the conditional distribution as a convolution, that is rather than conditional on just the adjacent pixels, we have a receptive field or a window and then we have a learnable convolution filter that output this distribution. The downside is that there's some future pixels that is part of the convolution window that we may not know.
In this lesson we'll talk about a different class of generative models called generative adversarial networks. Generative adversarial networks or GANS did not learn an explicit density function p of x, rather they fit under the implicit density category. What that means is rather than learn a parameterised density function We will instead learn a parameterised generation process that can give us samples from the joint distribution.
We won't have an explicit density though, to perform things like classification or things like that. But we'll still be able to use the unlabeled data to learn how to generate new samples. That fits that distribution of the unlabeled data, or even learn features that might be good for downstream tasks.
First we have to learn how to sample from a neural network output. It's not clear how to produce random outputs. Given some fixed input thus far, we've only dealt with a particular input and then generate a particular output. We also will use the notion of adversarial training. We'll use one that works predictions to train another networks, lost function. This is essentially can be seen as a dynamic loss function and you have two neural networks learning at the same time that we will pit against each other.
There's also lots of tricks to make this more stable. Because you have a game theoretic optimization pitting two neural networks against each other, the complex dynamics of learning will actually turn out to be quite difficult to train. And so there's a lot of different tricks that researchers have developed in order to make this more stable and effective.
Suppose we would like to sample from a very complex distribution p(x) using a neural network. We're going to use a simple idea of first sampling from a simple distribution, they a Gaussian, where we know how to generate samples from, and then we'll actually learn a transformation function that transforms samples from that simple distribution to samples from the complex distribution. Here's an illustration.
Now this transformation function can be pretty complex, but that's okay. We have deep learning where we can learn really complex functions as a result.
Here's a concrete instantiation of this idea:
The key idea of generative adversarial networks is to have another network that distinguishes between real and generated or fake images. And the reason we wanna do this is that how well the discriminator performs, can be used as a signal to determine how well the generator is doing.
Here, the discriminator is supposed to output a probability of one for the real image and zero for the fake image. And so if d of g of z should be zero, because it's a fake image, then we want one minus d of g of z. And then we take the log of that. And so this is essentially saying how well does this discriminator going to do? And the generator wants to minimize this because the generator wants to fool the discriminator.
So again, this right side of the objective function is saying how well is the discriminator doing on fake data and generator wants to minimize that because it wants the discriminator to not do well. Note that the objective function for the generator only is affected by the right side. That is the gradients for the first part of the term of the objective function is zero. Because if you notice it only has log of D of x and there's no term there that depends on G. So no amount of changing G parameters will affect that part of the objective.
On the other hand, the discriminator is doing the opposite. We can sample from the fake data and the discriminator wants to maximize this, how well the discriminator does, that is, it wants to output zero for the fake data. At the same time we sample a mini-max from the real and again the discriminator is maximizing this Note that for the generator, only one part of this objective function is valid. On the left side of the objective, we just have [log of D(x)]. That is the discriminator output on real data. And nothing we change about the generator's parameters effects this part of the objective That is a gradient zero.
So really the gradient for the generator only comes from the right side of the term, whereas the gradients for the discriminator comes from both. It wants to both do well on the real as well as on the fake in discriminating them.
Recall however the generator's wants to minimize
So again, we'll update the generator using the gradients from this objective. So we're tuning both the generation process and the feature extraction in order to essentially output more and more realistic images. Here is a visual depiction of the same thing.
Again, we'll have a mini batch of real and fake data, we'll have the generator producing the fake data, where we start from a vector of random numbers, where each random number in the vector is sampled from a normal distribution. And then we'll also have in the mini batch real images. And we'll feed those discriminator, and each part will have an objective, the generators objective only touches the 1-D(G(z)), because that's the only part of the objective that has G in it. And we're going to average this over the mini batch, that is over the fake data in the mini batch.
The discriminator loss on the other hand, includes both of the terms it wants to both make sure that it outputs a high probability for the real data and a low probability for the fake data. And so we'll perform backprop and average the gradients over the entire mini batch. And it turns out that the generators part of the objective is actually not very good in terms of its gradient property.
Here's the full algorithm.
One of the reasons that GANs weren't as good in the early days is that training these mini-max objective functions are just intrinsically hard. You have all sorts of strange dynamics that can occur between the generator and discriminator. So a number of advancements have been developed, including more stable architectures, regularization methods to improve the optimization, and things like progressive growing, that is learning a generative model for small images and smaller networks. And then slowly scaling those up and reusing the learning from the smaller networks to the larger ones, or just scaling these two very large hyperparameter models.
One of the early innovations that occurred in GANs is just tuning the architectures, it turns out that different architectures are more stable than others for training this objective. This is from a paper called DC GANs where several findings were developed. For example, replacing pooling layers with strided convolution, using batchnorm both in the generator and discriminator, removing the fully connected hidden layers for deeper architectures. Using ReLUs except for certain places such as the output, or using LeakyReLU activations for the discriminator. Again, it's not clear why these things in particular help but through empirical investigation, researchers found it to be a good solution for stable GAN training.
Again, training GANs is difficult due to the mini-max objective. You can actually have all sorts of strange dynamics of the training, for example, what if the generator just learns to memorize and output samples of your training? This actually will do really well, it'll fool the discriminator because it can't tell those real samples. The problem is you don't have a lot of diversity or variety, you're not truly sampling from the distribution, you're just outputting the samples that you already have.
Another problem is only generating parts of the distribution, so you can do really well, for example, on male faces or only on female faces, but not both. There's something called mode collapse where you can't capture all the modes of the distribution. There are several regularization methods that were developed and these are typically theoretically motivated.
I'm not going to go through the entire library of regularization method since there are dozens at this point, but some really simple ones, for example, just add noise to the real images. It turns out that this can actually be theoretically-motivated and derived, as researchers improved all of these aspects that is, architectures, objectives, and optimization and regularization methods, as well as scaling. Researchers began to be able to generate extremely realistic images. In fact, these images that you see here are not real they're fake, and you can actually see that it's not that easy to tell them apart from real images, there are some details that are often wrong. For example, eyes are hard, especially making both eyes be consistent with a gate or in appearance, however, overall, these actually have pretty good details, which is quite surprising. Just like language models however, they're not without failures, you can have, for example, different types of class confusion that occurs where the generator confuses different types of classes together, such as a tennis ball or in some animal.
These are kind of comical failures but what's interesting is that they're not that bad, we can understand that this is a failure mode that's rather reasonable. It turns out we can also generate videos that are consistent across time, this shows those are really interesting use case, you should check out the YouTube video link here. What this is showing is images from videos that are completely generated. What the researchers did is take videos from professional dancers and extracted their body pose.
You can then take as input a 10 second footage of you moving around randomly, and then the GAN is able to automatically generate an entire video of you dancing like the professional dancer. This is quite cool and has some implications for things like bias and misinformation but check out the video, I think you'll like it. In summary, generative adversarial networks can produce amazing images and videos and these days other adaptations have been developed to produce audio waveforms and so on. They still have some drawbacks, high-fidelity generation is pretty heavy to train and requires a lot of compute, training can be unstable, so A lot of experience and research is needed to make it stable. At the end, we don't actually have an explicit model for the distribution P(x), and so if you wanna do things like classification, it's not clear how to directly take what's learned, really, we just take the features and then fine-tune them. There's a large number of extensions ranging from GANs conditioned on labelled data or other information if it's available. And the notion of adversarial losses, for example, a discriminator that discriminates between real or fake, or one data set to another data set are actually useful in a number of different domains and have been used successfully.
In this lesson we'll talk about variational autoencoders. Another type of generative model that actually maintain explicit density models, as well as have some approximation that allow us to actually train them in a tractable manner.
These are explicit density models, which have approximate densities. And in this case the optimization itself is approximate.
Recall that autoencoders are architectures where the input is fed through an encoder, in order to generate a low-dimensional embedding. And this low dimensional embedding or bottleneck is then used in the decoder in order to reconstruct the image. A loss function (such as mean squared error) can be used in order to minimize the difference between the input and the reconstruction. The actual bottleneck features are hidden or latent variables. What we'd like them to be is disentangled factors of variation that produce an image.
For example, in the case of handwritten digits, it can be what digit it is. The orientation, the scale and so on. Again, the key idea is that there exists some low dimensional set of factors that determine how the image is generated.
We can actually write down an equation for the likelihood that involves these latent variables which we call z(orange middle bar in image), specifically, we can marginalize out the Z. $$ P(X)=\int P(X|Z;\theta)P(Z)dZ $$ So P of x can be just the integral of p of x given z here, also including theta, since it's the parametric model that we'll use times P(z), that is the prior over z times dz. Now, if we could directly maximize this, then we're essentially maximizing the likelihood and optimizing the parameters. But we can't really do this. Instead we maximize a variational lower bound (VLB) that we can compute.
The notion of variational autoencoders combines a set of ideas, including
Just as before, we'll start with a Z vector, where we sample from a random z in this case and then feed it through a decoder. And again this decoder models are P(X|Z) with some parameters.
And here the output rather than being just the image for example, it will be a Gaussian distribution parameter. Here specifically, it will be mu and Sigma. That is the mean and covariance matrix of the distribution. We show them as vectors here, but they can just as well be images. And we know that for multi-variate Gaussian outputs, we're not actually going to output a complex, large covariance matrix. We're just going to output the diagonal covariance. We're going to assume that the different dimensions are independent. So, this is essentially nothing but a decoder that rather than outputting deterministic thing, it outputs a set of parameters for a simpler distribution that we know how to sample from, for example, a Gaussian. So given a Z vector, we can feed it through the decoder, it gives us some mu and sigma. And if we actually want to generate samples, we just sample from this Gaussian instead.
The problem still remains. How can we maximize the likelihood for this model? And it turns out that if we also have an encoder, then actually we'll be able to derive some gradients that actually do quite well. In this case, we'll have an encoder, where given an image will have Q of z given x with again, some new parameters $\phi$. So this is a different model than the decoder. And again given x, it'll output not a particular z but it will output the parameters of a Gaussian distribution mu and sigma. And if we wanna generate an actual z, we again just sample from this simpler Gaussian. So now we have both an encoder and a decoder. We can put the encoder and decoder together that is given real data. Again, the whole idea of all of these generative models is that we have samples from the distribution and we want a model P(X).
So given a piece of data X or a mini batch of data, we'll estimate mu and sigma and then sample from that normal distribution to generate Z. Given Z's, we can then feed it through the decoder. And then we can again estimate mu and sigma for x and then sample from that to generate samples. Again, if we want to use a standard straightforward autoencoder loss, we can just compare the reconstructed x with the original x, but in this case, we'll have a more sophisticated derivation of a loss function.
And so the question remains, How can we optimize the parameters of the two networks, the encoder and the decoder? Again, we're going to use the principle of maximum likelihood just write out the term for log of probability of xi where xi is a sample. What's interesting here is that, since we have an encoder that can take a particular xi produce mu z and sigma z and actually sample from that Gaussian in order to produce z samples. We can actually write this out as an expectation over q of z given x of the log probability of xi. One thing to note here is that the probability of xi doesn't actually depend on z.
(NB: It may be helpful to remember the laws of logarithms to understand the last step)
What's interesting to note is that the expectations that we get on the right side, the two terms on the right are actually equivalent to a similarity measure called KL divergence. As an aside, KL divergence is a distance measure between probability distributions.
And we saw this being used. For example, to derive why we use the cross entropy loss. One thing to note is that it's always greater or equal to 0. So if two probability distributions are exactly the same, we'll have a KL divergence of 0, whereas if they're different than it can be unbounded. The equation is as follows, H denotes the entropy
$KL(p||q)=H_c(p,q)-H(p)=\sum p(x)log p(x) - \sum p(x) log q(x)$
Recall that $E[f]=E_{x\sim q}[f(x)]=\sum q(x)f(x)$
So we can rewrite KL as $KL( q(z) || p(z|x) ) = E[log q(z)] - E[log p(z|x)]$
Now let's use this in our derivation from earlier
Explanation:
We can't compute term 3, so what we're going to do is do basically just ignore it. We know that KL divergence is always greater or equal to zero. And so if we just maximize the first two terms here, we know that we'll still be making progress.
This is something called the variational lower bound or "ELBO". And it's actually a powerful technique that's used in many different places. The key idea is that if you are able to compute parts of your objective, that is a lower bound where the terms that you're ignoring are greater or equal to zero. Then you can just maximize those terms and still make progress towards this objective.
let's summarize how we can train our variational autoencoder. We'll have a mini batch of unlabeled data x, and we'll run it through our encoder. $Q(Z|X;\phi)$ this will be used to output parameters of a Gaussian distribution. Again, this is an assumption we make mu z and sigma z. And so we're going to make the approximate Posterior distribution close to the prior. That is we're going to take the KL divergence between Q of z given x, which is what we're computing here, and P of z. Again this can be assumed to be a normal distribution, let's say with a mean of zero and standard deviation of one for example, and this is a KL divergence between two Gaussian distributions which actually can be computed in closed form. We can then sample from this and generate a bunch of Z. And then we feed it through our decoder, which feeds it through P of X given Z with parameters theta and generates mu X and sigma X. And now we can essentially maximize the likelihood of the original data being reconstructed. That is we're going to maximize the log p of x given z, which is what our decoder is computing here.
Finally we can put it all together in a single picture:
We want to maximize the likelihood and we compute the likelihood to be these terms again. There's an extra term that we ignored because we're performing the lower bound optimization.
We showed the forward propagation, but one problem is that we are trying to back propagate through a sampling function.
When we take the z given X and try to sample from the mu z and sigma Z, we're performing some sampling operation, and then we try to feed it through the decoder and then we want to generate some exes or parameters for exes. And then we have a loss function that we compute from there. But then we wanna back propagate through this entire thing all the way back to the original encoder. The problem is you can't actually back propagate through sampling. It's non differentiable. So there's something called a reparameterization trick, which allows you to do the sampling.
Specifically we need to differentiate through sampling process with respect to $\phi$ our encoder, which is probabilistic. The way we solve this is by moving the sampling part out of the part of the computation graph that has to go all the way back to the encoder. Specifically, we're going to make the random this independent of the encoder output. We'll sample from a normal distribution with a mean of zero and identity covariance matrix. And then we'll essentially take that output which we call epsilon here and multiply it by the sigma and then add it to the mu, the mean. And so rather than sampling from an of N of mu gamma sigma, we're sampling from N of zero gamma identity and then we're also taking that output that we sample and multiplying it with the sigma that the encoder output. Then we add it back to the mean.
But the key idea is that using this trick allows us to have a computation graph where if we back propagate all the way back to the encoder, it doesn't go through the sampling process. The sampling process, as you can see here is actually out of the way of the path that we actually need. And so you can actually use variational autoencoders in order to again generate samples. And what you can do is you can actually at the end, take the decoder and then try to feed it in different random Z's, again sampled from a normal distribution with a zero mean and unit variance for example. And then try to actually sample multiple Z and try to see how well these samples look.
You can also do what's called walking the latent space, that is you can take a Z1 that is one dimension of your latent space Z2, which is another dimension, and then you can actually interpolate between them. So rather than sample the Z's, you could just take for example, for z one, all values from zero to one via some interpolation in between and similarly to Z2, and then take those values and generate the samples for each of those. And what you can see is some smoothness in the dimensionality of the Z. That is, if you take 21, it seems to correspond something to do with smiles. For example, if 21 is at the top, you have a frown face, whereas if 21 is at the bottom value, a low value for 21 then you have smiling. For Z2 you can see perhaps something different being encoded such as orientation of the face. You can see that the left of the x axis tends to be the one orientation, whereas the right side of the space for Z2 is a different orientation. And you see a nice smooth interpolation between these. And you can see some similar things for training in this ten written digit data set. And so it is actually learning some disentangled space where each dimension of your latent space is representing some sort of attribute independently. And we can actually generate samples for what that means.
In summary variational autoencoders provide a principled way to perform approximate, maximum likelihood optimization. We have a bunch of assumptions, for example, that the prior distribution over Z is normally distributed with some hand-coded mean and standard deviation. And we also assume that the encoder and decoder both output parameters of Gaussian distribution. However, given those assumptions, we can then derive what the log likelihood terms can be. And we can actually perform a maximization of that log likelihood. Albeit it's an approximation because there are some terms that we can't compute and so we ignore.
The samples from VAEs are actually not that competitive with GANs, they tend to be blurry. However, the latent space that it tends to learn in an unsupervised way are actually pretty effective for downstream tasks. One example, is this cool reference on world models that is learning latent features, given some input from a game, for example, images of a game, and we can then learn the dynamics of that latent space. For example, how the latent space evolves over time as you play the game. And this is essentially a simulator that you learned, also known as a world model. And you can actually perform reinforcement learning within the simulator that you learned rather than actually having to gather real world experience in the game. This is just one example.
There's lots of other examples of downstream tasks that use the latent features which are nicely disentangled in many cases and pretty effective.
<img src="CS7643_images/MOD3_27.png" width="400" />
$\large P(w_{t+1}|w_t) = \frac{exp(u_w_t \cdot v_w_{t+j})}{\sum_{k \in V} exp(u_w_t \cdot v_k)} $
$ \begin{bmatrix} x{11} & x{12} & x{13} & \dots & x{1n} \ x{21} & x{22} & x{23} & \dots & x{2n} \ \cdots \ x{d1} & x{d2} & x{d3} & \dots & x{dn}
$
$\large \frac{\partial L}{\partial h^{l-1}} = \frac{\partial L}{\partial h^l} \frac{\partial h^l}{\partial h^{l-1}}$
$ \large \begin{equation} \frac{\partial L}{\partial h^{l-1}} = \begin{cases} 1 &\text{ if} h^{l-1} > 0 \\ 0 &\text{otherwise} \end{cases} \end{equation} $
$ \begin{align} \frac{\partial L}{\partial w_j} & =\sum_{i=1}^{n} \frac{\partial}{\partial w_j} (y_i - w^T x_i)^2 &\\ & =\sum_{i=1}^{n} (y_i - w^T x_i) \frac{\partial}{\partial w_j} (y_i - w^T x_i) &\\ & =-2 \sum_{i=1}^{n} \delta_i \frac{\partial}{\partial w_j} (w^T x_i) (\text{ where } \delta_i = (y_i - w^T x_i)) &\\ & =-2 \sum_{i=1}^{n} \delta_i \frac{\partial}{\partial w_j} \sum_{k=1}^{n} (w_k x_ik) &\\ & =-2 \sum_{i=1}^{n} \delta_i x_{ij} \end{align} $