Aman's AI Journal • CS231n • Deep Learning Hardware and Software

Hardware for Deep Learning
Algorithms for Efficient Inference
Hardware for Efficient Inference
Algorithms for Efficient Training
Deep Learning Software
Deep learning Frameworks
Static vs dynamic graphs
Citation

Hardware for Deep Learning

Deep ConvNets, Recurrent nets, and deep reinforcement learning are shaping a lot of applications and changing a lot of our lives.
- Like self driving cars, machine translations, alphaGo and so on.
But the trend now says that if we want a high accuracy we need a larger (Deeper) models.
- The model size in ImageNet competition from 2012 to 2015 has increased 16x to achieve a high accuracy.
- Deep speech 2 has 10x training operations than deep speech 1 and that’s in only one year! # At Baidu
There are three challenges we got from this
- Model Size
  - Its hard to deploy larger models on our PCs, mobiles, or cars.
- Speed
  - ResNet152 took 1.5 weeks to train and give the 6.16% accuracy!
  - Long training time limits ML researcher’s productivity
- Energy Efficiency
  - AlphaGo: 1920 CPUs and 280 GPUs. $3000 electric bill per game
  - If we use this on our mobile it will drain the battery.
  - Google mentioned in their blog if all the users used google speech for 3 minutes, they have to double their data-center!
  - Where is the Energy Consumed?
    - larger model => more memory reference => more energy
We can improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design.
- From both the hardware and the algorithm perspectives.
Hardware 101: the Family
- General Purpose # Used for any hardware
  - CPU # Latency oriented, Single strong threaded like a single elephant
    - GPU # Throughput oriented, So many small threads like a lot of ants
  - GPGPU
    - Specialized HW #Tuned for a domain of applications
      - FPGA# Programmable logic, Its cheaper but less efficient`
      - ASIC# Fixed logic, Designed for a certain applications (Can be designed for deep learning applications)
Hardware 101: Number Representation
- Numbers in computer are represented with a discrete memory.
- Its very good and energy efficient for hardware to go from 32 bit to 16 bit in float point operations.

Algorithms for Efficient Inference

Pruning neural networks
- Idea is can we remove some of the weights/neurons and the NN still behave the same?
- In 2015 Han made AlexNet parameters from 60 million to 6 Million! by using the idea of Pruning.
- Pruning can be applied to CNN and RNN, iteratively it will reach the same accuracy as the original.
- Pruning actually happens to humans:
  - Newborn(50 Trillion Synapses) ==> 1 year old(1000 Trillion Synapses) ==> Adolescent(500 Trillion Synapses)
- Algorithm:
  1. Get Trained network.
  2. Evaluate importance of neurons.
  3. Remove the least important neuron.
  4. Fine tune the network.
  5. If we need to continue Pruning we go to step 2 again else we stop.
Weight Sharing
- The idea is that we want to make the numbers is our models less.
- Trained Quantization:
  - Example: all weight values that are 2.09, 2.12, 1.92, 1.87 will be replaced by 2
  - To do that we can make k means clustering on a filter for example and reduce the numbers in it. By using this we can also reduce the number of operations that are used from calculating the gradients.
  - After Trained Quantization the Weights are Discrete.
  - Trained Quantization can reduce the number of bits we need for a number in each layer significantly.
- Pruning + Trained Quantization can Work Together to reduce the size of the model.
- Huffman Coding
  - We can use Huffman Coding to reduce/compress the number of bits of the weight.
  - In-frequent weights: use more bits to represent.
  - Frequent weights: use less bits to represent.
- Using Pruning + Trained Quantization + Huffman Coding is called deep compression. ![]((assets/deeplearning-HW-SW/37.png) ![]((assets/deeplearning-HW-SW/38.png)
  - SqueezeNet
    - All the models we have talked about till now was using a pretrained models. Can we make a new architecture that saves memory and computations?
    - SqueezeNet gets the AlexNet accuracy with 50x fewer parameters and 0.5 model size.
  - SqueezeNet can even be further compressed by applying deep compression on them.
  - Models are now more energy efficient and has speed up a lot.
  - Deep compression was applied in Industry through Facebook and Baidu.
Quantization
- Algorithm (Quantizing the Weight and Activation):
  - Train with float.
  - Quantizing the weight and activation:
    - Gather the statistics for weight and activation.
    - Choose proper radix point position.
  - Fine-tune in float format.
  - Convert to fixed-point format.
Low Rank Approximation
- Is another size reduction algorithm that are used for CNN.
- Idea is decompose the conv layer and then try both of the composed layers.
Binary / Ternary Net
- Can we only use three numbers to represent weights in NN?
- The size will be much less with only -1, 0, 1.
- This is a new idea that was published in 2017 “Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR’17”
- Works after training.
- They have tried it on AlexNet and it has reached almost the same error as AlexNet.
- Number of operation will increase per register: https://xnor.ai/
Winograd Transformation
- Based on 3x3 WINOGRAD Convolutions which makes less operations than the ordiany convolution
- cuDNN 5 uses the WINOGRAD Convolutions which has improved the speed.

Hardware for Efficient Inference

There are a lot of ASICs that we developed for deep learning. All in which has the same goal of minimize memory access.
- Eyeriss MIT
- DaDiannao
- TPU Google (Tensor processing unit)
  - It can be put to replace the disk in the server.
  - Up to 4 cards per server.
  - Power consumed by this hardware is a lot less than a GPU and the size of the chip is less.
- EIE Standford
  - By Han at 2016 [et al. ISCA’16]
  - We don’t save zero weights and make quantization for the numbers from the hardware.
  - He says that EIE has a better Throughput and energy efficient.

Algorithms for Efficient Training

Parallelization
- Data Parallel – Run multiple inputs in parallel
  - Ex. Run two images in the same time!
  - Run multiple training examples in parallel.
  - Limited by batch size.
  - Gradients have to be applied by a master node.
- Model Parallel
  - Split up the Model – i.e. the network
  - Split model over multiple processors By layer.
- Hyper-Parameter Parallel
  - Try many alternative networks in parallel.
  - Easy to get 16-64 GPUs training one model in parallel.
Mixed Precision with FP16 and FP32
- We have discussed that if we use 16 bit real numbers all over the model the energy cost will be less by 4x.
- Can we use a model entirely with 16 bit number? We can partially do this with mixed FP16 and FP32. We use 16 bit everywhere but at some points we need the FP32.
- By example in multiplying FP16 by FP16 we will need FP32.
- After you train the model you can be a near accuracy of the famous models like AlexNet and ResNet.
Model Distillation
- The question is can we use a senior (Good) trained neural network(s) and make them guide a student (New) neural network?
- For more information look at Hinton et al. Dark knowledge / Distilling the Knowledge in a Neural Network
DSD: Dense-Sparse-Dense Training
- Han et al. “DSD: Dense-Sparse-Dense Training for Deep Neural Networks”, ICLR 2017
- Has a better regularization.
- The idea is Train the model lets call this the Dense, we then apply Pruning to it lets call this sparse.
- DSD produces same model architecture but can find better optimization solution arrives at better local minima, and achieves higher prediction accuracy.
- After the above two steps we go connect the remain connection and learn them again (To dense again).
- This improves the performance a lot in many deep learning models.
Part 4: Hardware for Efficient Training
- GPUs for training:
  - Nvidia PASCAL GP100 (2016)
  - Nvidia Volta GV100 (2017)
    - Can make mixed precision operations!
    - So powerful.
    - The new nuclear bomb!
- Google Announced “Google Cloud TPU” on May 2017!
  - Cloud TPU delivers up to 180 teraflops to train and run machine learning models.
  - One of our new large-scale translation models used to take a full day to train on 32 of the best commercially-available GPUs—now it trains to the same accuracy in an afternoon using just one eighth of a TPU pod.
We have moved from PC Era ==> Mobile-First Era ==> AI-First Era

Deep Learning Software

This section changes a lot every year in CS231n due to rapid changes in the deep learning software.
CPU vs GPU
- GPU The graphics card was developed to render graphics to play games or make 3D media,. etc.
  - NVIDIA vs AMD
    - Deep learning choose NVIDIA over AMD GPU because NVIDIA is pushing research forward deep learning also makes it architecture more suitable for deep learning.
- CPU has fewer cores but each core is much faster and much more capable; great at sequential tasks. While GPUs has more cores but each core is much slower “dumber”; great for parallel tasks.
- GPU cores needs to work together. and has its own memory.
- Matrix multiplication is from the operations that are suited for GPUs. It has MxN independent operations that can be done on parallel.
- Convolution operation also can be paralyzed because it has independent operations.
- Programming GPUs frameworks:
  - CUDA (NVIDIA only)
    - Write c-like code that runs directly on the GPU.
    - Its hard to build a good optimized code that runs on GPU. That’s why they provided high level APIs.
    - Higher level APIs: cuBLAS, cuDNN, etc
    - CuDNN has implemented back prop. , convolution, recurrent and a lot more for you!
    - In practice you won’t write a parallel code. You will use the code implemented and optimized by others!
  - OpenCl
    - Similar to CUDA, but runs on any GPU.
    - Usually Slower.
    - Haven’t much support yet from all deep learning software.
- There are a lot of courses for learning parallel programming.
- If you aren’t careful, training can bottleneck on reading data and transferring to GPU. So the solutions are:
  - Read all the data into RAM. # If possible
  - Use SSD instead of HDD
  - Use multiple CPU threads to prefetch data!
    - While the GPU are computing, a CPU thread will fetch the data for you.
    - A lot of frameworks implemented that for you because its a little bit painful!

Deep learning Frameworks

Its super fast moving!
Currently available frameworks:
- TensorFlow (Google)
- Caffe (UC Berkeley)
- Caffe2 (Facebook)
- Torch (NYU / Facebook)
- PyTorch (Facebook)
- Theano (U monteral)
- Paddle (Baidu)
- CNTK (Microsoft)
- MXNet (Amazon)
The instructor thinks that you should focus on TensorFlow and PyTorch.
The point of deep learning frameworks:
- Easily build big computational graphs.
- Easily compute gradients in computational graphs.
- Run it efficiently on GPU (cuDNN - cuBLAS)
NumPy doesn’t run on GPU.
Most of the frameworks tries to be like NUMPY in the forward pass and then they compute the gradients for you.
TensorFlow (Google)
- Code are two parts:
  1. Define computational graph.
  2. Run the graph and reuse it many times.
- TensorFlow uses a static graph architecture.
- TensorFlow variables live in the graph. while the placeholders are feed each run.
- Global initializer function initializes the variables that lives in the graph.
- Use predefined optimizers and losses.
- You can make a full layers with layers.dense function.
- Keras (High level wrapper):
  - Keras is a layer on top pf TensorFlow, makes common things easy to do.
  - So popular!
  - Trains a full deep NN in a few lines of codes.
- There are a lot high level wrappers:
  - Keras
  - TFLearn
  - TensorLayer
  - tf.layers #Ships with TensorFlow
  - tf-Slim #Ships with TensorFlow
  - tf.contrib.learn #Ships with TensorFlow
  - Sonnet # New from deep mind
- TensorFlow has pretrained models that you can use while you are using transfer learning.
- Tensorboard adds logging to record loss, stats. Run server and get pretty graphs!
- It has distributed code if you want to split your graph on some nodes.
- TensorFlow is actually inspired from Theano. It has the same inspirations and structure.
PyTorch (Facebook)
- Has three layers of abstraction:
  - Tensor: ndarray but runs on GPU #Like numpy arrays in TensorFlow
    - Variable: Node in a computational graphs; stores data and gradient #Like Tensor, Variable, Placeholders
  - Module: A NN layer; may store state or learnable weights#Like tf.layers in TensorFlow
- In PyTorch the graphs runs in the same loop you are executing which makes it easier for debugging. This is called a dynamic graph.
- In PyTorch you can define your own autograd functions by writing forward and backward for tensors. Most of the times it will implemented for you.
- torch.nn is a high level api like keras in TensorFlow. You can create the models and go on and on.
  - You can define your own nn module!
- Also Pytorch contains optimizers like TensorFlow.
- It contains a data loader that wraps a Dataset and provides minbatches, shuffling and multithreading.
- PyTorch contains the best and super easy to use pretrained models
- PyTorch contains Visdom that are like tensorboard. but Tensorboard seems to be more powerful.
- PyTorch is new and still evolving compared to Torch. Its still in beta state.
- PyTorch is best for research.
TensorFlow builds the graph once, then run them many times (called static graph)
In each PyTorch iteration, we build a new graph (called dynamic graph)

Static vs dynamic graphs

Optimization:
- With static graphs, framework can optimize the graph for you before it runs.
Serialization
- Static: Once graph is built, can serialize it and run it without the code that built the graph. Ex use the graph in c++
- Dynamic: Always need to keep the code around.
Conditional
- Is easier in dynamic graphs. And more complicated in static graphs.
Loops:
- Is easier in dynamic graphs. And more complicated in static graphs.
TensorFlow fold make dynamic graphs easier in TensorFlow through dynamic batching.
Dynamic graph applications include: recurrent networks and recursive networks.
Caffe2 uses static graphs and can train model in python also works on iOS and Android.
TensorFlow/Caffe2 are used a lot in production especially on mobile.

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DeepLearningHardwareAndSoftware,
  title   = {Deep Learning Hardware and Software},
  author  = {Chadha, Aman},
  journal = {Distilled Notes for Stanford CS231n: Convolutional Neural Networks for Visual Recognition},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}