CS231n • Deep Learning Hardware and Software
 Hardware for Deep Learning
 Algorithms for Efficient Inference
 Hardware for Efficient Inference
 Algorithms for Efficient Training
 Deep Learning Software
 Deep learning Frameworks
 Static vs dynamic graphs
 Citation
Hardware for Deep Learning
 Deep ConvNets, Recurrent nets, and deep reinforcement learning are shaping a lot of applications and changing a lot of our lives.
 Like self driving cars, machine translations, alphaGo and so on.
 But the trend now says that if we want a high accuracy we need a larger (Deeper) models.
 The model size in ImageNet competition from 2012 to 2015 has increased 16x to achieve a high accuracy.
 Deep speech 2 has 10x training operations than deep speech 1 and that’s in only one year!
# At Baidu
 There are three challenges we got from this
 Model Size
 Its hard to deploy larger models on our PCs, mobiles, or cars.
 Speed
 ResNet152 took 1.5 weeks to train and give the 6.16% accuracy!
 Long training time limits ML researcher’s productivity
 Energy Efficiency
 AlphaGo: 1920 CPUs and 280 GPUs. $3000 electric bill per game
 If we use this on our mobile it will drain the battery.
 Google mentioned in their blog if all the users used google speech for 3 minutes, they have to double their datacenter!
 Where is the Energy Consumed?
 larger model => more memory reference => more energy
 Model Size
 We can improve the Efficiency of Deep Learning by AlgorithmHardware CoDesign.
 From both the hardware and the algorithm perspectives.
 Hardware 101: the Family
 General Purpose
# Used for any hardware
 CPU
# Latency oriented, Single strong threaded like a single elephant
 GPU
# Throughput oriented, So many small threads like a lot of ants
 GPU
 GPGPU
 Specialized HW
#Tuned for a domain of applications
 FPGA# Programmable logic, Its cheaper but less efficient`
 ASIC
# Fixed logic, Designed for a certain applications (Can be designed for deep learning applications)
 Specialized HW
 CPU
 General Purpose
 Hardware 101: Number Representation
 Numbers in computer are represented with a discrete memory.
 Its very good and energy efficient for hardware to go from 32 bit to 16 bit in float point operations.
Algorithms for Efficient Inference
 Pruning neural networks
 Idea is can we remove some of the weights/neurons and the NN still behave the same?
 In 2015 Han made AlexNet parameters from 60 million to 6 Million! by using the idea of Pruning.
 Pruning can be applied to CNN and RNN, iteratively it will reach the same accuracy as the original.
 Pruning actually happens to humans:
 Newborn(50 Trillion Synapses) ==> 1 year old(1000 Trillion Synapses) ==> Adolescent(500 Trillion Synapses)
 Algorithm:
 Get Trained network.
 Evaluate importance of neurons.
 Remove the least important neuron.
 Fine tune the network.
 If we need to continue Pruning we go to step 2 again else we stop.
 Weight Sharing
 The idea is that we want to make the numbers is our models less.
 Trained Quantization:
 Example: all weight values that are 2.09, 2.12, 1.92, 1.87 will be replaced by 2
 To do that we can make k means clustering on a filter for example and reduce the numbers in it. By using this we can also reduce the number of operations that are used from calculating the gradients.
 After Trained Quantization the Weights are Discrete.
 Trained Quantization can reduce the number of bits we need for a number in each layer significantly.
 Pruning + Trained Quantization can Work Together to reduce the size of the model.
 Huffman Coding
 We can use Huffman Coding to reduce/compress the number of bits of the weight.
 Infrequent weights: use more bits to represent.
 Frequent weights: use less bits to represent.
 Using Pruning + Trained Quantization + Huffman Coding is called deep compression.
![]((assets/deeplearningHWSW/37.png)
![]((assets/deeplearningHWSW/38.png)
 SqueezeNet
 All the models we have talked about till now was using a pretrained models. Can we make a new architecture that saves memory and computations?
 SqueezeNet gets the AlexNet accuracy with 50x fewer parameters and 0.5 model size.
 SqueezeNet can even be further compressed by applying deep compression on them.
 Models are now more energy efficient and has speed up a lot.
 Deep compression was applied in Industry through Facebook and Baidu.
 SqueezeNet
 Quantization
 Algorithm (Quantizing the Weight and Activation):
 Train with float.
 Quantizing the weight and activation:
 Gather the statistics for weight and activation.
 Choose proper radix point position.
 Finetune in float format.
 Convert to fixedpoint format.
 Algorithm (Quantizing the Weight and Activation):
 Low Rank Approximation
 Is another size reduction algorithm that are used for CNN.
 Idea is decompose the conv layer and then try both of the composed layers.
 Binary / Ternary Net
 Can we only use three numbers to represent weights in NN?
 The size will be much less with only 1, 0, 1.
 This is a new idea that was published in 2017 “Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR’17”
 Works after training.
 They have tried it on AlexNet and it has reached almost the same error as AlexNet.
 Number of operation will increase per register: https://xnor.ai/
 Winograd Transformation
 Based on 3x3 WINOGRAD Convolutions which makes less operations than the ordiany convolution
 cuDNN 5 uses the WINOGRAD Convolutions which has improved the speed.
Hardware for Efficient Inference
 There are a lot of ASICs that we developed for deep learning. All in which has the same goal of minimize memory access.
 Eyeriss MIT
 DaDiannao
 TPU Google (Tensor processing unit)
 It can be put to replace the disk in the server.
 Up to 4 cards per server.
 Power consumed by this hardware is a lot less than a GPU and the size of the chip is less.
 EIE Standford
 By Han at 2016 [et al. ISCA’16]
 We don’t save zero weights and make quantization for the numbers from the hardware.
 He says that EIE has a better Throughput and energy efficient.
Algorithms for Efficient Training
 Parallelization
 Data Parallel – Run multiple inputs in parallel
 Ex. Run two images in the same time!
 Run multiple training examples in parallel.
 Limited by batch size.
 Gradients have to be applied by a master node.
 Model Parallel
 Split up the Model – i.e. the network
 Split model over multiple processors By layer.
 HyperParameter Parallel
 Try many alternative networks in parallel.
 Easy to get 1664 GPUs training one model in parallel.
 Data Parallel – Run multiple inputs in parallel
 Mixed Precision with FP16 and FP32
 We have discussed that if we use 16 bit real numbers all over the model the energy cost will be less by 4x.
 Can we use a model entirely with 16 bit number? We can partially do this with mixed FP16 and FP32. We use 16 bit everywhere but at some points we need the FP32.
 By example in multiplying FP16 by FP16 we will need FP32.
 After you train the model you can be a near accuracy of the famous models like AlexNet and ResNet.
 Model Distillation
 The question is can we use a senior (Good) trained neural network(s) and make them guide a student (New) neural network?
 For more information look at Hinton et al. Dark knowledge / Distilling the Knowledge in a Neural Network
 DSD: DenseSparseDense Training
 Han et al. “DSD: DenseSparseDense Training for Deep Neural Networks”, ICLR 2017
 Has a better regularization.
 The idea is Train the model lets call this the Dense, we then apply Pruning to it lets call this sparse.
 DSD produces same model architecture but can find better optimization solution arrives at better local minima, and achieves higher prediction accuracy.
 After the above two steps we go connect the remain connection and learn them again (To dense again).
 This improves the performance a lot in many deep learning models.
 Part 4: Hardware for Efficient Training
 GPUs for training:
 Nvidia PASCAL GP100 (2016)
 Nvidia Volta GV100 (2017)
 Can make mixed precision operations!
 So powerful.
 The new nuclear bomb!
 Google Announced “Google Cloud TPU” on May 2017!
 Cloud TPU delivers up to 180 teraflops to train and run machine learning models.
 One of our new largescale translation models used to take a full day to train on 32 of the best commerciallyavailable GPUs—now it trains to the same accuracy in an afternoon using just one eighth of a TPU pod.
 GPUs for training:
 We have moved from PC Era ==> MobileFirst Era ==> AIFirst Era
Deep Learning Software
 This section changes a lot every year in CS231n due to rapid changes in the deep learning software.
 CPU vs GPU
 GPU The graphics card was developed to render graphics to play games or make 3D media,. etc.
 NVIDIA vs AMD
 Deep learning choose NVIDIA over AMD GPU because NVIDIA is pushing research forward deep learning also makes it architecture more suitable for deep learning.
 NVIDIA vs AMD
 CPU has fewer cores but each core is much faster and much more capable; great at sequential tasks. While GPUs has more cores but each core is much slower “dumber”; great for parallel tasks.
 GPU cores needs to work together. and has its own memory.
 Matrix multiplication is from the operations that are suited for GPUs. It has MxN independent operations that can be done on parallel.
 Convolution operation also can be paralyzed because it has independent operations.
 Programming GPUs frameworks:
 CUDA (NVIDIA only)
 Write clike code that runs directly on the GPU.
 Its hard to build a good optimized code that runs on GPU. That’s why they provided high level APIs.
 Higher level APIs: cuBLAS, cuDNN, etc
 CuDNN has implemented back prop. , convolution, recurrent and a lot more for you!
 In practice you won’t write a parallel code. You will use the code implemented and optimized by others!
 OpenCl
 Similar to CUDA, but runs on any GPU.
 Usually Slower.
 Haven’t much support yet from all deep learning software.
 CUDA (NVIDIA only)
 There are a lot of courses for learning parallel programming.
 If you aren’t careful, training can bottleneck on reading data and transferring to GPU. So the solutions are:
 Read all the data into RAM. # If possible
 Use SSD instead of HDD
 Use multiple CPU threads to prefetch data!
 While the GPU are computing, a CPU thread will fetch the data for you.
 A lot of frameworks implemented that for you because its a little bit painful!
 GPU The graphics card was developed to render graphics to play games or make 3D media,. etc.
Deep learning Frameworks
 Its super fast moving!
 Currently available frameworks:
 TensorFlow (Google)
 Caffe (UC Berkeley)
 Caffe2 (Facebook)
 Torch (NYU / Facebook)
 PyTorch (Facebook)
 Theano (U monteral)
 Paddle (Baidu)
 CNTK (Microsoft)
 MXNet (Amazon)
 The instructor thinks that you should focus on TensorFlow and PyTorch.
 The point of deep learning frameworks:
 Easily build big computational graphs.
 Easily compute gradients in computational graphs.
 Run it efficiently on GPU (cuDNN  cuBLAS)
 NumPy doesn’t run on GPU.
 Most of the frameworks tries to be like NUMPY in the forward pass and then they compute the gradients for you.
 TensorFlow (Google)
 Code are two parts:
 Define computational graph.
 Run the graph and reuse it many times.
 TensorFlow uses a static graph architecture.
 TensorFlow variables live in the graph. while the placeholders are feed each run.
 Global initializer function initializes the variables that lives in the graph.
 Use predefined optimizers and losses.
 You can make a full layers with layers.dense function.
 Keras (High level wrapper):
 Keras is a layer on top pf TensorFlow, makes common things easy to do.
 So popular!
 Trains a full deep NN in a few lines of codes.
 There are a lot high level wrappers:
 Keras
 TFLearn
 TensorLayer
 tf.layers
#Ships with TensorFlow
 tfSlim
#Ships with TensorFlow
 tf.contrib.learn
#Ships with TensorFlow
 Sonnet
# New from deep mind
 TensorFlow has pretrained models that you can use while you are using transfer learning.
 Tensorboard adds logging to record loss, stats. Run server and get pretty graphs!
 It has distributed code if you want to split your graph on some nodes.
 TensorFlow is actually inspired from Theano. It has the same inspirations and structure.
 Code are two parts:

PyTorch (Facebook)
 Has three layers of abstraction:
 Tensor:
ndarray
but runs on GPU#Like numpy arrays in TensorFlow
 Variable: Node in a computational graphs; stores data and gradient
#Like Tensor, Variable, Placeholders
 Variable: Node in a computational graphs; stores data and gradient
 Module: A NN layer; may store state or learnable weights
#Like tf.layers in TensorFlow
 Tensor:
 In PyTorch the graphs runs in the same loop you are executing which makes it easier for debugging. This is called a dynamic graph.
 In PyTorch you can define your own autograd functions by writing forward and backward for tensors. Most of the times it will implemented for you.
torch.nn
is a high level api like keras in TensorFlow. You can create the models and go on and on. You can define your own nn module!
 Also Pytorch contains optimizers like TensorFlow.
 It contains a data loader that wraps a Dataset and provides minbatches, shuffling and multithreading.
 PyTorch contains the best and super easy to use pretrained models
 PyTorch contains Visdom that are like tensorboard. but Tensorboard seems to be more powerful.
 PyTorch is new and still evolving compared to Torch. Its still in beta state.
 PyTorch is best for research.
 Has three layers of abstraction:

TensorFlow builds the graph once, then run them many times (called static graph)
 In each PyTorch iteration, we build a new graph (called dynamic graph)
Static vs dynamic graphs
 Optimization:
 With static graphs, framework can optimize the graph for you before it runs.
 Serialization
 Static: Once graph is built, can serialize it and run it without the code that built the graph. Ex use the graph in c++
 Dynamic: Always need to keep the code around.

Conditional
 Is easier in dynamic graphs. And more complicated in static graphs.

Loops:
 Is easier in dynamic graphs. And more complicated in static graphs.

TensorFlow fold make dynamic graphs easier in TensorFlow through dynamic batching.

Dynamic graph applications include: recurrent networks and recursive networks.

Caffe2 uses static graphs and can train model in python also works on iOS and Android.
 TensorFlow/Caffe2 are used a lot in production especially on mobile.
Citation
If you found our work useful, please cite it as:
@article{Chadha2020DeepLearningHardwareAndSoftware,
title = {Deep Learning Hardware and Software},
author = {Chadha, Aman},
journal = {Distilled Notes for Stanford CS231n: Convolutional Neural Networks for Visual Recognition},
year = {2020},
note = {\url{https://aman.ai}}
}