CS231n • Deep Learning Hardware and Software
- Hardware for Deep Learning
- Algorithms for Efficient Inference
- Hardware for Efficient Inference
- Algorithms for Efficient Training
- Deep Learning Software
- Deep learning Frameworks
- Static vs dynamic graphs
- Citation
Hardware for Deep Learning
- Deep ConvNets, Recurrent nets, and deep reinforcement learning are shaping a lot of applications and changing a lot of our lives.
- Like self driving cars, machine translations, alphaGo and so on.
- But the trend now says that if we want a high accuracy we need a larger (Deeper) models.
- The model size in ImageNet competition from 2012 to 2015 has increased 16x to achieve a high accuracy.
- Deep speech 2 has 10x training operations than deep speech 1 and that’s in only one year!
# At Baidu
- There are three challenges we got from this
- Model Size
- Its hard to deploy larger models on our PCs, mobiles, or cars.
- Speed
- ResNet152 took 1.5 weeks to train and give the 6.16% accuracy!
- Long training time limits ML researcher’s productivity
- Energy Efficiency
- AlphaGo: 1920 CPUs and 280 GPUs. $3000 electric bill per game
- If we use this on our mobile it will drain the battery.
- Google mentioned in their blog if all the users used google speech for 3 minutes, they have to double their data-center!
- Where is the Energy Consumed?
- larger model => more memory reference => more energy
- Model Size
- We can improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design.
- From both the hardware and the algorithm perspectives.
- Hardware 101: the Family
- General Purpose
# Used for any hardware
- CPU
# Latency oriented, Single strong threaded like a single elephant
- GPU
# Throughput oriented, So many small threads like a lot of ants
- GPU
- GPGPU
- Specialized HW
#Tuned for a domain of applications
- FPGA# Programmable logic, Its cheaper but less efficient`
- ASIC
# Fixed logic, Designed for a certain applications (Can be designed for deep learning applications)
- Specialized HW
- CPU
- General Purpose
- Hardware 101: Number Representation
- Numbers in computer are represented with a discrete memory.
- Its very good and energy efficient for hardware to go from 32 bit to 16 bit in float point operations.
Algorithms for Efficient Inference
- Pruning neural networks
- Idea is can we remove some of the weights/neurons and the NN still behave the same?
- In 2015 Han made AlexNet parameters from 60 million to 6 Million! by using the idea of Pruning.
- Pruning can be applied to CNN and RNN, iteratively it will reach the same accuracy as the original.
- Pruning actually happens to humans:
- Newborn(50 Trillion Synapses) ==> 1 year old(1000 Trillion Synapses) ==> Adolescent(500 Trillion Synapses)
- Algorithm:
- Get Trained network.
- Evaluate importance of neurons.
- Remove the least important neuron.
- Fine tune the network.
- If we need to continue Pruning we go to step 2 again else we stop.
- Weight Sharing
- The idea is that we want to make the numbers is our models less.
- Trained Quantization:
- Example: all weight values that are 2.09, 2.12, 1.92, 1.87 will be replaced by 2
- To do that we can make k means clustering on a filter for example and reduce the numbers in it. By using this we can also reduce the number of operations that are used from calculating the gradients.
- After Trained Quantization the Weights are Discrete.
- Trained Quantization can reduce the number of bits we need for a number in each layer significantly.
- Pruning + Trained Quantization can Work Together to reduce the size of the model.
- Huffman Coding
- We can use Huffman Coding to reduce/compress the number of bits of the weight.
- In-frequent weights: use more bits to represent.
- Frequent weights: use less bits to represent.
- Using Pruning + Trained Quantization + Huffman Coding is called deep compression.
![]((assets/deeplearning-HW-SW/37.png)
![]((assets/deeplearning-HW-SW/38.png)
- SqueezeNet
- All the models we have talked about till now was using a pretrained models. Can we make a new architecture that saves memory and computations?
- SqueezeNet gets the AlexNet accuracy with 50x fewer parameters and 0.5 model size.
- SqueezeNet can even be further compressed by applying deep compression on them.
- Models are now more energy efficient and has speed up a lot.
- Deep compression was applied in Industry through Facebook and Baidu.
- SqueezeNet
- Quantization
- Algorithm (Quantizing the Weight and Activation):
- Train with float.
- Quantizing the weight and activation:
- Gather the statistics for weight and activation.
- Choose proper radix point position.
- Fine-tune in float format.
- Convert to fixed-point format.
- Algorithm (Quantizing the Weight and Activation):
- Low Rank Approximation
- Is another size reduction algorithm that are used for CNN.
- Idea is decompose the conv layer and then try both of the composed layers.
- Binary / Ternary Net
- Can we only use three numbers to represent weights in NN?
- The size will be much less with only -1, 0, 1.
- This is a new idea that was published in 2017 “Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR’17”
- Works after training.
- They have tried it on AlexNet and it has reached almost the same error as AlexNet.
- Number of operation will increase per register: https://xnor.ai/
- Winograd Transformation
- Based on 3x3 WINOGRAD Convolutions which makes less operations than the ordiany convolution
- cuDNN 5 uses the WINOGRAD Convolutions which has improved the speed.
Hardware for Efficient Inference
- There are a lot of ASICs that we developed for deep learning. All in which has the same goal of minimize memory access.
- Eyeriss MIT
- DaDiannao
- TPU Google (Tensor processing unit)
- It can be put to replace the disk in the server.
- Up to 4 cards per server.
- Power consumed by this hardware is a lot less than a GPU and the size of the chip is less.
- EIE Standford
- By Han at 2016 [et al. ISCA’16]
- We don’t save zero weights and make quantization for the numbers from the hardware.
- He says that EIE has a better Throughput and energy efficient.
Algorithms for Efficient Training
- Parallelization
- Data Parallel – Run multiple inputs in parallel
- Ex. Run two images in the same time!
- Run multiple training examples in parallel.
- Limited by batch size.
- Gradients have to be applied by a master node.
- Model Parallel
- Split up the Model – i.e. the network
- Split model over multiple processors By layer.
- Hyper-Parameter Parallel
- Try many alternative networks in parallel.
- Easy to get 16-64 GPUs training one model in parallel.
- Data Parallel – Run multiple inputs in parallel
- Mixed Precision with FP16 and FP32
- We have discussed that if we use 16 bit real numbers all over the model the energy cost will be less by 4x.
- Can we use a model entirely with 16 bit number? We can partially do this with mixed FP16 and FP32. We use 16 bit everywhere but at some points we need the FP32.
- By example in multiplying FP16 by FP16 we will need FP32.
- After you train the model you can be a near accuracy of the famous models like AlexNet and ResNet.
- Model Distillation
- The question is can we use a senior (Good) trained neural network(s) and make them guide a student (New) neural network?
- For more information look at Hinton et al. Dark knowledge / Distilling the Knowledge in a Neural Network
- DSD: Dense-Sparse-Dense Training
- Han et al. “DSD: Dense-Sparse-Dense Training for Deep Neural Networks”, ICLR 2017
- Has a better regularization.
- The idea is Train the model lets call this the Dense, we then apply Pruning to it lets call this sparse.
- DSD produces same model architecture but can find better optimization solution arrives at better local minima, and achieves higher prediction accuracy.
- After the above two steps we go connect the remain connection and learn them again (To dense again).
- This improves the performance a lot in many deep learning models.
- Part 4: Hardware for Efficient Training
- GPUs for training:
- Nvidia PASCAL GP100 (2016)
- Nvidia Volta GV100 (2017)
- Can make mixed precision operations!
- So powerful.
- The new nuclear bomb!
- Google Announced “Google Cloud TPU” on May 2017!
- Cloud TPU delivers up to 180 teraflops to train and run machine learning models.
- One of our new large-scale translation models used to take a full day to train on 32 of the best commercially-available GPUs—now it trains to the same accuracy in an afternoon using just one eighth of a TPU pod.
- GPUs for training:
- We have moved from PC Era ==> Mobile-First Era ==> AI-First Era
Deep Learning Software
- This section changes a lot every year in CS231n due to rapid changes in the deep learning software.
- CPU vs GPU
- GPU The graphics card was developed to render graphics to play games or make 3D media,. etc.
- NVIDIA vs AMD
- Deep learning choose NVIDIA over AMD GPU because NVIDIA is pushing research forward deep learning also makes it architecture more suitable for deep learning.
- NVIDIA vs AMD
- CPU has fewer cores but each core is much faster and much more capable; great at sequential tasks. While GPUs has more cores but each core is much slower “dumber”; great for parallel tasks.
- GPU cores needs to work together. and has its own memory.
- Matrix multiplication is from the operations that are suited for GPUs. It has MxN independent operations that can be done on parallel.
- Convolution operation also can be paralyzed because it has independent operations.
- Programming GPUs frameworks:
- CUDA (NVIDIA only)
- Write c-like code that runs directly on the GPU.
- Its hard to build a good optimized code that runs on GPU. That’s why they provided high level APIs.
- Higher level APIs: cuBLAS, cuDNN, etc
- CuDNN has implemented back prop. , convolution, recurrent and a lot more for you!
- In practice you won’t write a parallel code. You will use the code implemented and optimized by others!
- OpenCl
- Similar to CUDA, but runs on any GPU.
- Usually Slower.
- Haven’t much support yet from all deep learning software.
- CUDA (NVIDIA only)
- There are a lot of courses for learning parallel programming.
- If you aren’t careful, training can bottleneck on reading data and transferring to GPU. So the solutions are:
- Read all the data into RAM. # If possible
- Use SSD instead of HDD
- Use multiple CPU threads to prefetch data!
- While the GPU are computing, a CPU thread will fetch the data for you.
- A lot of frameworks implemented that for you because its a little bit painful!
- GPU The graphics card was developed to render graphics to play games or make 3D media,. etc.
Deep learning Frameworks
- Its super fast moving!
- Currently available frameworks:
- TensorFlow (Google)
- Caffe (UC Berkeley)
- Caffe2 (Facebook)
- Torch (NYU / Facebook)
- PyTorch (Facebook)
- Theano (U monteral)
- Paddle (Baidu)
- CNTK (Microsoft)
- MXNet (Amazon)
- The instructor thinks that you should focus on TensorFlow and PyTorch.
- The point of deep learning frameworks:
- Easily build big computational graphs.
- Easily compute gradients in computational graphs.
- Run it efficiently on GPU (cuDNN - cuBLAS)
- NumPy doesn’t run on GPU.
- Most of the frameworks tries to be like NUMPY in the forward pass and then they compute the gradients for you.
- TensorFlow (Google)
- Code are two parts:
- Define computational graph.
- Run the graph and reuse it many times.
- TensorFlow uses a static graph architecture.
- TensorFlow variables live in the graph. while the placeholders are feed each run.
- Global initializer function initializes the variables that lives in the graph.
- Use predefined optimizers and losses.
- You can make a full layers with layers.dense function.
- Keras (High level wrapper):
- Keras is a layer on top pf TensorFlow, makes common things easy to do.
- So popular!
- Trains a full deep NN in a few lines of codes.
- There are a lot high level wrappers:
- Keras
- TFLearn
- TensorLayer
- tf.layers
#Ships with TensorFlow
- tf-Slim
#Ships with TensorFlow
- tf.contrib.learn
#Ships with TensorFlow
- Sonnet
# New from deep mind
- TensorFlow has pretrained models that you can use while you are using transfer learning.
- Tensorboard adds logging to record loss, stats. Run server and get pretty graphs!
- It has distributed code if you want to split your graph on some nodes.
- TensorFlow is actually inspired from Theano. It has the same inspirations and structure.
- Code are two parts:
-
PyTorch (Facebook)
- Has three layers of abstraction:
- Tensor:
ndarray
but runs on GPU#Like numpy arrays in TensorFlow
- Variable: Node in a computational graphs; stores data and gradient
#Like Tensor, Variable, Placeholders
- Variable: Node in a computational graphs; stores data and gradient
- Module: A NN layer; may store state or learnable weights
#Like tf.layers in TensorFlow
- Tensor:
- In PyTorch the graphs runs in the same loop you are executing which makes it easier for debugging. This is called a dynamic graph.
- In PyTorch you can define your own autograd functions by writing forward and backward for tensors. Most of the times it will implemented for you.
torch.nn
is a high level api like keras in TensorFlow. You can create the models and go on and on.- You can define your own nn module!
- Also Pytorch contains optimizers like TensorFlow.
- It contains a data loader that wraps a Dataset and provides minbatches, shuffling and multithreading.
- PyTorch contains the best and super easy to use pretrained models
- PyTorch contains Visdom that are like tensorboard. but Tensorboard seems to be more powerful.
- PyTorch is new and still evolving compared to Torch. Its still in beta state.
- PyTorch is best for research.
- Has three layers of abstraction:
-
TensorFlow builds the graph once, then run them many times (called static graph)
- In each PyTorch iteration, we build a new graph (called dynamic graph)
Static vs dynamic graphs
- Optimization:
- With static graphs, framework can optimize the graph for you before it runs.
- Serialization
- Static: Once graph is built, can serialize it and run it without the code that built the graph. Ex use the graph in c++
- Dynamic: Always need to keep the code around.
-
Conditional
- Is easier in dynamic graphs. And more complicated in static graphs.
-
Loops:
- Is easier in dynamic graphs. And more complicated in static graphs.
-
TensorFlow fold make dynamic graphs easier in TensorFlow through dynamic batching.
-
Dynamic graph applications include: recurrent networks and recursive networks.
-
Caffe2 uses static graphs and can train model in python also works on iOS and Android.
- TensorFlow/Caffe2 are used a lot in production especially on mobile.
Citation
If you found our work useful, please cite it as:
@article{Chadha2020DeepLearningHardwareAndSoftware,
title = {Deep Learning Hardware and Software},
author = {Chadha, Aman},
journal = {Distilled Notes for Stanford CS231n: Convolutional Neural Networks for Visual Recognition},
year = {2020},
note = {\url{https://aman.ai}}
}