skip to content
Ben Lau statistics . machine learning . programming . optimization . research

Neural Networks

6 min read Updated:

Definition

  • A highly parametrized model, promoted as a universal approximator, a machine that with enough data could learn any smooth predictive relationship.
  • It contains hidden layers that derive transformation of the inputs, nonlinear transformations of linear combinations, which are then used to model the ouput.
  • In a simple three layer model with one input layer, one hidden layer, and one output layer, it contains a set of predictors or input xjx_j, a set of hidden units al=g(wl0(1)+j=1njwlj(1)xj)a_l=g(w_{l0}^{(1)}+\sum^{n_j}_{j=1}w^{(1)}_{lj}x_j), and a set, possibly one, of output units o=h(w0(2)+l=1nlwl(2)al)o=h(w_{0}^{(2)}+\sum^{n_l}_{l=1}w^{(2)}_{l}a_l).
  • The intercept terms wl0(1)w^{(1)}_{l0} are called a bias, and the function gg is a nonlinearity, such as the sigmoid function g(t)=1/(1+et)g(t)=1/(1+e^{-t}).

Formula

Transition from layer k1k-1 to layer kk:

z(k)=W(k1)a(k1)z^{(k)} = \bm{W}^{(k-1)}a^{(k-1)} a(k)=g(k)(z(k))a^{(k)} = g^{(k)}(z^{(k)})

where W^{(k-1)} represents the matrix of weights that go from layer Lk1L_{k-1} to layer LkL_k, a(k)a^{(k)} is the entire vector of activations at layer LkL_{k}, and our notation assumes that g(k)g^{(k)} operates elementwise on its vector argument.

M-class classification transformation g(K)g^{(K)}

g(K)(zm(K);z(K))=ezm(K)l=1Mezl(K)g^{(K)}(z^{(K)}_m;z^{(K)}) = \frac{e^{z^{(K)}_m}}{\sum^M_{l=1}e^{z^{(K)}_l}}

Given a training set {xi,yi}1n\{x_i,y_i\}^n_1, and a loss function L[y,f(x)]L[y, f(x)], along familiar lines we might seek to solve

minimizeW1ni=1nL[yi,f(xi;W)]+λJ(W)\underset{W}{\text{minimize}} \frac{1}{n}\sum^n_{i=1}L[y_i, f(x_i;W)] + \lambda J(\mathcal{W})

where J(W)J(\mathcal{W}) is a nonnegative regularization term on the elements of W\mathcal{W}, and λ0\lambda \geq 0 is a tuning parameter. (In practice there may be multiple regularization terms, each with their own λ\lambda.).

Building the neural network

Tuning Parameters

  • Number of hidden layers, and their sizes
  • Choice of Nonlinearities (activation functions)
    • sigmoid, tanh, ReLU, ELU, softplus
    • full range of x only use part of the range of activation function
  • Choice of Regularization
  • Early stopping

Algorithms

Backpropagation (P. 357)

A neural network starts out with unknown weights and biases (parameter values) that are estimated when we fit the model to the training data using Backpropagation.

Main ideas

  1. Uses gradient descent to minimize the loss function
  2. Uses the chain rule to compute the gradient of the loss function with respect to the weights and biases
  3. Plug the computed gradients into the gradient descent algorithm to optimize(update) the weights and biases

References

Gradient Descent (P. 358)

A method to find the optimal solution to a wide range of optimization problems (gradient ascent for maximization). It is a first-order optimization algorithm that finds the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point. It takes big steps when it is close to the minimum and small steps when it is far away.

With the quadratic form for the penalty, a gradient-descent update is

W(k)W(k)α(ΔW(k)+λW(k)),k=1,...,K1\bm{W}^{(k)} \leftarrow \bm{W}^{(k)} - \alpha (\Delta{\bm{W}^{(k)}}+\lambda\bm{W}^{(k)}), \quad k=1,...,K-1

Stochastic Gradient Descent

Rather than process all the observations before making a gradient step, it can be more efficient to process smaller batches at a time — even batches of size one. These batches can be sampled at random, or systematically processed. For large data sets distributed on multiple computer cores, this can be essential for reasons of efficiency. An epoch of training means that all nn training samples have been used in gradient steps, irrespective of how they have been grouped.

Side projects

  • v.s. Newton’s method

Architectures

CNN

Autoencoder

A special neural network for computing a type of nonlinear principal-component decomposition. The linear principal component decomposition is a popular and effective linear method for reducing a large set of correlated variables to a typically smaller number of linear combinations that capture most of the variance in the original set.

GAN

Transformer

Diffusion

CLIP

Mamba

Concepts

  • CNN tackles the problem of spatial invariance, locality, sparsity, and computation difficulty, allow for more efficient learning.
  • Resnet tackles the vanishing gradient problem.
  • Dropout tackles overfitting.
  • DenseNet, LSTM, Transformer allow for capturing complex interactions in the data.
  • Backpropagation is a method to compute the gradient of the loss function with respect to the weights and biases of the network by chain rule. Since the coefficients of the first layer are distant from the output layer, which calculates the loss, the gradient is needed to be propagated back through the whole network.
  • Adversarial examples are inputs to neural networks that are designed to fool the network. They are often created by applying small perturbations to regular inputs to cause the network to misclassify the input, while the perturbations are imperceptible to humans.
  • Hallucination is a problem of llm that the model generates sentences that are not meaningful but trying to fool you instead of admitting their incapability of generating an correct answer.
  • Diffusion models, aka denoising model, uses the idea of diffusion to generate results, which is a process of spreading out the information due to entropy. It uses the idea that the diffusion process follows a Gaussian distribution, and reverses the process will have a tendency towards center. The possible reverse path could be determined by score functions. It also benefits from the idea of manifold hypothesis.
  • Long range dependencies is the fundamental idea of video processing, otherwise the output videos would be non-sense. However, time continuity does not equal to logical or realistic, which is a problem that needs to be solved in SORA.

When to use neural networks

  • Easy to collect data, and there are lots of data
  • Invariant patterns
  • Mechanistic workflow

video ref

References