Neural Networks
Definition
- A highly parametrized model, promoted as a universal approximator, a machine that with enough data could learn any smooth predictive relationship.
- It contains hidden layers that derive transformation of the inputs, nonlinear transformations of linear combinations, which are then used to model the ouput.
- In a simple three layer model with one input layer, one hidden layer, and one output layer, it contains a set of predictors or input , a set of hidden units , and a set, possibly one, of output units .
- The intercept terms are called a bias, and the function is a nonlinearity, such as the sigmoid function .
Formula
Transition from layer to layer :
where W^{(k-1)} represents the matrix of weights that go from layer to layer , is the entire vector of activations at layer , and our notation assumes that operates elementwise on its vector argument.
M-class classification transformation
Given a training set , and a loss function , along familiar lines we might seek to solve
where is a nonnegative regularization term on the elements of , and is a tuning parameter. (In practice there may be multiple regularization terms, each with their own .).
Building the neural network
Tuning Parameters
- Number of hidden layers, and their sizes
- Choice of Nonlinearities (activation functions)
- sigmoid, tanh, ReLU, ELU, softplus
- full range of x only use part of the range of activation function
- Choice of Regularization
- Early stopping
Algorithms
Backpropagation (P. 357)
A neural network starts out with unknown weights and biases (parameter values) that are estimated when we fit the model to the training data using Backpropagation.
Main ideas
- Uses gradient descent to minimize the loss function
- Uses the chain rule to compute the gradient of the loss function with respect to the weights and biases
- Plug the computed gradients into the gradient descent algorithm to optimize(update) the weights and biases
References
Gradient Descent (P. 358)
A method to find the optimal solution to a wide range of optimization problems (gradient ascent for maximization). It is a first-order optimization algorithm that finds the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point. It takes big steps when it is close to the minimum and small steps when it is far away.
With the quadratic form for the penalty, a gradient-descent update is
Stochastic Gradient Descent
Rather than process all the observations before making a gradient step, it can be more efficient to process smaller batches at a time — even batches of size one. These batches can be sampled at random, or systematically processed. For large data sets distributed on multiple computer cores, this can be essential for reasons of efficiency. An epoch of training means that all training samples have been used in gradient steps, irrespective of how they have been grouped.
Side projects
- v.s. Newton’s method
Architectures
CNN
Autoencoder
A special neural network for computing a type of nonlinear principal-component decomposition. The linear principal component decomposition is a popular and effective linear method for reducing a large set of correlated variables to a typically smaller number of linear combinations that capture most of the variance in the original set.
GAN
Transformer
- AI Language Models & Transformers - Computerphile
- 【機器學習2021】Transformer (上)
- 09 Transformer 之什么是注意力机制(Attention)
Diffusion
- Why Does Diffusion Work Better than Auto-Regression?
- Stable Diffusion in Code (AI Image Generation) - Computerphile
- How AI Image Generators Work (Stable Diffusion / Dall-E) - Computerphile
- Diffusion models from scratch in PyTorch
- 【生成式AI】淺談圖像生成模型 Diffusion Model 原理 — Hung-yi Lee
- 【生成式AI】Stable Diffusion、DALL-E、Imagen 背後共同的套路 — Hung-yi Lee
CLIP
Mamba
Concepts
- CNN tackles the problem of spatial invariance, locality, sparsity, and computation difficulty, allow for more efficient learning.
- Resnet tackles the vanishing gradient problem.
- Dropout tackles overfitting.
- DenseNet, LSTM, Transformer allow for capturing complex interactions in the data.
- Backpropagation is a method to compute the gradient of the loss function with respect to the weights and biases of the network by chain rule. Since the coefficients of the first layer are distant from the output layer, which calculates the loss, the gradient is needed to be propagated back through the whole network.
- Adversarial examples are inputs to neural networks that are designed to fool the network. They are often created by applying small perturbations to regular inputs to cause the network to misclassify the input, while the perturbations are imperceptible to humans.
- Hallucination is a problem of llm that the model generates sentences that are not meaningful but trying to fool you instead of admitting their incapability of generating an correct answer.
- Diffusion models, aka denoising model, uses the idea of diffusion to generate results, which is a process of spreading out the information due to entropy. It uses the idea that the diffusion process follows a Gaussian distribution, and reverses the process will have a tendency towards center. The possible reverse path could be determined by score functions. It also benefits from the idea of manifold hypothesis.
- Long range dependencies is the fundamental idea of video processing, otherwise the output videos would be non-sense. However, time continuity does not equal to logical or realistic, which is a problem that needs to be solved in SORA.
When to use neural networks
- Easy to collect data, and there are lots of data
- Invariant patterns
- Mechanistic workflow
References
- Deep Learning, Goodfellow, I., Bengio, Y., & Courville, A. (2016). MIT press.
- Neural Networks and Deep Learning, Nielsen, M. (2015). Determination Press.
- The Elements of Statistical Learning, Hastie, T., Tibshirani, R., & Friedman, J. (2009). Springer.
- Efron, B., & Hastie, T. (2021). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge University Press.
- Deep learning series — 3Blue1Brown
- Neural Networks / Deep Learning series — StatQuest
- Building a neural network FROM SCRATCH using numpy
- A Neural Network Playground
- Why Neural Networks can learn (almost) anything
- Neural Networks From Scratch