## Hardware for Deep Learning

• Deep ConvNets, Recurrent nets, and deep reinforcement learning are shaping a lot of applications and changing a lot of our lives.
• Like self driving cars, machine translations, alphaGo and so on.
• But the trend now says that if we want a high accuracy we need a larger (Deeper) models.
• The model size in ImageNet competition from 2012 to 2015 has increased 16x to achieve a high accuracy.
• Deep speech 2 has 10x training operations than deep speech 1 and that’s in only one year! # At Baidu
• There are three challenges we got from this
• Model Size
• Its hard to deploy larger models on our PCs, mobiles, or cars.
• Speed
• ResNet152 took 1.5 weeks to train and give the 6.16% accuracy!
• Long training time limits ML researcher’s productivity
• Energy Efficiency
• AlphaGo: 1920 CPUs and 280 GPUs. \$3000 electric bill per game
• If we use this on our mobile it will drain the battery.
• Google mentioned in their blog if all the users used google speech for 3 minutes, they have to double their data-center!
• Where is the Energy Consumed?
• larger model => more memory reference => more energy
• We can improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design.
• From both the hardware and the algorithm perspectives.
• Hardware 101: the Family
• General Purpose # Used for any hardware
• CPU # Latency oriented, Single strong threaded like a single elephant
• GPU # Throughput oriented, So many small threads like a lot of ants
• GPGPU
• Specialized HW #Tuned for a domain of applications
• FPGA# Programmable logic, Its cheaper but less efficient
• ASIC# Fixed logic, Designed for a certain applications (Can be designed for deep learning applications)
• Hardware 101: Number Representation
• Numbers in computer are represented with a discrete memory.
• Its very good and energy efficient for hardware to go from 32 bit to 16 bit in float point operations.

## Algorithms for Efficient Inference

• Pruning neural networks
• Idea is can we remove some of the weights/neurons and the NN still behave the same?
• In 2015 Han made AlexNet parameters from 60 million to 6 Million! by using the idea of Pruning.
• Pruning can be applied to CNN and RNN, iteratively it will reach the same accuracy as the original.
• Pruning actually happens to humans:
• Newborn(50 Trillion Synapses) ==> 1 year old(1000 Trillion Synapses) ==> Adolescent(500 Trillion Synapses)
• Algorithm:
1. Get Trained network.
2. Evaluate importance of neurons.
3. Remove the least important neuron.
4. Fine tune the network.
5. If we need to continue Pruning we go to step 2 again else we stop.
• Weight Sharing
• The idea is that we want to make the numbers is our models less.
• Trained Quantization:
• Example: all weight values that are 2.09, 2.12, 1.92, 1.87 will be replaced by 2
• To do that we can make k means clustering on a filter for example and reduce the numbers in it. By using this we can also reduce the number of operations that are used from calculating the gradients.
• After Trained Quantization the Weights are Discrete.
• Trained Quantization can reduce the number of bits we need for a number in each layer significantly.
• Pruning + Trained Quantization can Work Together to reduce the size of the model.
• Huffman Coding
• We can use Huffman Coding to reduce/compress the number of bits of the weight.
• In-frequent weights: use more bits to represent.
• Frequent weights: use less bits to represent.
• Using Pruning + Trained Quantization + Huffman Coding is called deep compression. ![]((assets/deeplearning-HW-SW/37.png) ![]((assets/deeplearning-HW-SW/38.png)
• SqueezeNet
• All the models we have talked about till now was using a pretrained models. Can we make a new architecture that saves memory and computations?
• SqueezeNet gets the AlexNet accuracy with 50x fewer parameters and 0.5 model size.
• SqueezeNet can even be further compressed by applying deep compression on them.
• Models are now more energy efficient and has speed up a lot.
• Deep compression was applied in Industry through Facebook and Baidu.
• Quantization
• Algorithm (Quantizing the Weight and Activation):
• Train with float.
• Quantizing the weight and activation:
• Gather the statistics for weight and activation.
• Choose proper radix point position.
• Fine-tune in float format.
• Convert to fixed-point format.
• Low Rank Approximation
• Is another size reduction algorithm that are used for CNN.
• Idea is decompose the conv layer and then try both of the composed layers.
• Binary / Ternary Net
• Can we only use three numbers to represent weights in NN?
• The size will be much less with only -1, 0, 1.
• This is a new idea that was published in 2017 “Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR’17”
• Works after training.
• They have tried it on AlexNet and it has reached almost the same error as AlexNet.
• Number of operation will increase per register: https://xnor.ai/
• Based on 3x3 WINOGRAD Convolutions which makes less operations than the ordiany convolution
• cuDNN 5 uses the WINOGRAD Convolutions which has improved the speed.

## Hardware for Efficient Inference

• There are a lot of ASICs that we developed for deep learning. All in which has the same goal of minimize memory access.
• Eyeriss MIT
• TPU Google (Tensor processing unit)
• It can be put to replace the disk in the server.
• Up to 4 cards per server.
• Power consumed by this hardware is a lot less than a GPU and the size of the chip is less.
• EIE Standford
• By Han at 2016 [et al. ISCA’16]
• We don’t save zero weights and make quantization for the numbers from the hardware.
• He says that EIE has a better Throughput and energy efficient.

## Algorithms for Efficient Training

• Parallelization
• Data Parallel – Run multiple inputs in parallel
• Ex. Run two images in the same time!
• Run multiple training examples in parallel.
• Limited by batch size.
• Gradients have to be applied by a master node.
• Model Parallel
• Split up the Model – i.e. the network
• Split model over multiple processors By layer.
• Hyper-Parameter Parallel
• Try many alternative networks in parallel.
• Easy to get 16-64 GPUs training one model in parallel.
• Mixed Precision with FP16 and FP32
• We have discussed that if we use 16 bit real numbers all over the model the energy cost will be less by 4x.
• Can we use a model entirely with 16 bit number? We can partially do this with mixed FP16 and FP32. We use 16 bit everywhere but at some points we need the FP32.
• By example in multiplying FP16 by FP16 we will need FP32.
• After you train the model you can be a near accuracy of the famous models like AlexNet and ResNet.
• Model Distillation
• The question is can we use a senior (Good) trained neural network(s) and make them guide a student (New) neural network?
• For more information look at Hinton et al. Dark knowledge / Distilling the Knowledge in a Neural Network
• DSD: Dense-Sparse-Dense Training
• Han et al. “DSD: Dense-Sparse-Dense Training for Deep Neural Networks”, ICLR 2017
• Has a better regularization.
• The idea is Train the model lets call this the Dense, we then apply Pruning to it lets call this sparse.
• DSD produces same model architecture but can find better optimization solution arrives at better local minima, and achieves higher prediction accuracy.
• After the above two steps we go connect the remain connection and learn them again (To dense again).
• This improves the performance a lot in many deep learning models.
• Part 4: Hardware for Efficient Training
• GPUs for training:
• Nvidia PASCAL GP100 (2016)
• Nvidia Volta GV100 (2017)
• Can make mixed precision operations!
• So powerful.
• The new nuclear bomb!
• Cloud TPU delivers up to 180 teraflops to train and run machine learning models.
• One of our new large-scale translation models used to take a full day to train on 32 of the best commercially-available GPUs—now it trains to the same accuracy in an afternoon using just one eighth of a TPU pod.
• We have moved from PC Era ==> Mobile-First Era ==> AI-First Era

## Deep Learning Software

• This section changes a lot every year in CS231n due to rapid changes in the deep learning software.
• CPU vs GPU
• GPU The graphics card was developed to render graphics to play games or make 3D media,. etc.
• NVIDIA vs AMD
• Deep learning choose NVIDIA over AMD GPU because NVIDIA is pushing research forward deep learning also makes it architecture more suitable for deep learning.
• CPU has fewer cores but each core is much faster and much more capable; great at sequential tasks. While GPUs has more cores but each core is much slower “dumber”; great for parallel tasks.
• GPU cores needs to work together. and has its own memory.
• Matrix multiplication is from the operations that are suited for GPUs. It has MxN independent operations that can be done on parallel.
• Convolution operation also can be paralyzed because it has independent operations.
• Programming GPUs frameworks:
• CUDA (NVIDIA only)
• Write c-like code that runs directly on the GPU.
• Its hard to build a good optimized code that runs on GPU. That’s why they provided high level APIs.
• Higher level APIs: cuBLAS, cuDNN, etc
• CuDNN has implemented back prop. , convolution, recurrent and a lot more for you!
• In practice you won’t write a parallel code. You will use the code implemented and optimized by others!
• OpenCl
• Similar to CUDA, but runs on any GPU.
• Usually Slower.
• Haven’t much support yet from all deep learning software.
• There are a lot of courses for learning parallel programming.
• If you aren’t careful, training can bottleneck on reading data and transferring to GPU. So the solutions are:
• Read all the data into RAM. # If possible
• Use SSD instead of HDD
• Use multiple CPU threads to prefetch data!
• While the GPU are computing, a CPU thread will fetch the data for you.
• A lot of frameworks implemented that for you because its a little bit painful!

## Deep learning Frameworks

• Its super fast moving!
• Currently available frameworks:
• Caffe (UC Berkeley)
• Theano (U monteral)
• CNTK (Microsoft)
• MXNet (Amazon)
• The instructor thinks that you should focus on TensorFlow and PyTorch.
• The point of deep learning frameworks:
• Easily build big computational graphs.
• Easily compute gradients in computational graphs.
• Run it efficiently on GPU (cuDNN - cuBLAS)
• NumPy doesn’t run on GPU.
• Most of the frameworks tries to be like NUMPY in the forward pass and then they compute the gradients for you.
• Code are two parts:
1. Define computational graph.
2. Run the graph and reuse it many times.
• TensorFlow uses a static graph architecture.
• TensorFlow variables live in the graph. while the placeholders are feed each run.
• Global initializer function initializes the variables that lives in the graph.
• Use predefined optimizers and losses.
• You can make a full layers with layers.dense function.
• Keras (High level wrapper):
• Keras is a layer on top pf TensorFlow, makes common things easy to do.
• So popular!
• Trains a full deep NN in a few lines of codes.
• There are a lot high level wrappers:
• Keras
• TFLearn
• TensorLayer
• tf.layers #Ships with TensorFlow
• tf-Slim #Ships with TensorFlow
• tf.contrib.learn #Ships with TensorFlow
• Sonnet # New from deep mind
• TensorFlow has pretrained models that you can use while you are using transfer learning.
• Tensorboard adds logging to record loss, stats. Run server and get pretty graphs!
• It has distributed code if you want to split your graph on some nodes.
• TensorFlow is actually inspired from Theano. It has the same inspirations and structure.

• Has three layers of abstraction:
• Tensor: ndarray but runs on GPU #Like numpy arrays in TensorFlow
• Variable: Node in a computational graphs; stores data and gradient #Like Tensor, Variable, Placeholders
• Module: A NN layer; may store state or learnable weights#Like tf.layers in TensorFlow
• In PyTorch the graphs runs in the same loop you are executing which makes it easier for debugging. This is called a dynamic graph.
• In PyTorch you can define your own autograd functions by writing forward and backward for tensors. Most of the times it will implemented for you.
• torch.nn is a high level api like keras in TensorFlow. You can create the models and go on and on.
• You can define your own nn module!
• Also Pytorch contains optimizers like TensorFlow.
• It contains a data loader that wraps a Dataset and provides minbatches, shuffling and multithreading.
• PyTorch contains the best and super easy to use pretrained models
• PyTorch contains Visdom that are like tensorboard. but Tensorboard seems to be more powerful.
• PyTorch is new and still evolving compared to Torch. Its still in beta state.
• PyTorch is best for research.
• TensorFlow builds the graph once, then run them many times (called static graph)

• In each PyTorch iteration, we build a new graph (called dynamic graph)

## Static vs dynamic graphs

• Optimization:
• With static graphs, framework can optimize the graph for you before it runs.
• Serialization
• Static: Once graph is built, can serialize it and run it without the code that built the graph. Ex use the graph in c++
• Dynamic: Always need to keep the code around.
• Conditional

• Is easier in dynamic graphs. And more complicated in static graphs.
• Loops:

• Is easier in dynamic graphs. And more complicated in static graphs.
• TensorFlow fold make dynamic graphs easier in TensorFlow through dynamic batching.

• Dynamic graph applications include: recurrent networks and recursive networks.

• Caffe2 uses static graphs and can train model in python also works on iOS and Android.

• TensorFlow/Caffe2 are used a lot in production especially on mobile.

## Citation

If you found our work useful, please cite it as:

@article{Chadha2020DeepLearningHardwareAndSoftware,
title   = {Deep Learning Hardware and Software},
`