Hardware for Deep Learning

  • Deep ConvNets, Recurrent nets, and deep reinforcement learning are shaping a lot of applications and changing a lot of our lives.
    • Like self driving cars, machine translations, alphaGo and so on.
  • But the trend now says that if we want a high accuracy we need a larger (Deeper) models.
    • The model size in ImageNet competition from 2012 to 2015 has increased 16x to achieve a high accuracy.
    • Deep speech 2 has 10x training operations than deep speech 1 and that’s in only one year! # At Baidu
  • There are three challenges we got from this
    • Model Size
      • Its hard to deploy larger models on our PCs, mobiles, or cars.
    • Speed
      • ResNet152 took 1.5 weeks to train and give the 6.16% accuracy!
      • Long training time limits ML researcher’s productivity
    • Energy Efficiency
      • AlphaGo: 1920 CPUs and 280 GPUs. $3000 electric bill per game
      • If we use this on our mobile it will drain the battery.
      • Google mentioned in their blog if all the users used google speech for 3 minutes, they have to double their data-center!
      • Where is the Energy Consumed?
        • larger model => more memory reference => more energy
  • We can improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design.
    • From both the hardware and the algorithm perspectives.
  • Hardware 101: the Family
    • General Purpose # Used for any hardware
      • CPU # Latency oriented, Single strong threaded like a single elephant
        • GPU # Throughput oriented, So many small threads like a lot of ants
      • GPGPU
        • Specialized HW #Tuned for a domain of applications
          • FPGA# Programmable logic, Its cheaper but less efficient`
          • ASIC# Fixed logic, Designed for a certain applications (Can be designed for deep learning applications)
  • Hardware 101: Number Representation
    • Numbers in computer are represented with a discrete memory.
    • Its very good and energy efficient for hardware to go from 32 bit to 16 bit in float point operations.

Algorithms for Efficient Inference

  • Pruning neural networks
    • Idea is can we remove some of the weights/neurons and the NN still behave the same?
    • In 2015 Han made AlexNet parameters from 60 million to 6 Million! by using the idea of Pruning.
    • Pruning can be applied to CNN and RNN, iteratively it will reach the same accuracy as the original.
    • Pruning actually happens to humans:
      • Newborn(50 Trillion Synapses) ==> 1 year old(1000 Trillion Synapses) ==> Adolescent(500 Trillion Synapses)
    • Algorithm:
      1. Get Trained network.
      2. Evaluate importance of neurons.
      3. Remove the least important neuron.
      4. Fine tune the network.
      5. If we need to continue Pruning we go to step 2 again else we stop.
  • Weight Sharing
    • The idea is that we want to make the numbers is our models less.
    • Trained Quantization:
      • Example: all weight values that are 2.09, 2.12, 1.92, 1.87 will be replaced by 2
      • To do that we can make k means clustering on a filter for example and reduce the numbers in it. By using this we can also reduce the number of operations that are used from calculating the gradients.
      • After Trained Quantization the Weights are Discrete.
      • Trained Quantization can reduce the number of bits we need for a number in each layer significantly.
    • Pruning + Trained Quantization can Work Together to reduce the size of the model.
    • Huffman Coding
      • We can use Huffman Coding to reduce/compress the number of bits of the weight.
      • In-frequent weights: use more bits to represent.
      • Frequent weights: use less bits to represent.
    • Using Pruning + Trained Quantization + Huffman Coding is called deep compression. ![]((assets/deeplearning-HW-SW/37.png) ![]((assets/deeplearning-HW-SW/38.png)
      • SqueezeNet
        • All the models we have talked about till now was using a pretrained models. Can we make a new architecture that saves memory and computations?
        • SqueezeNet gets the AlexNet accuracy with 50x fewer parameters and 0.5 model size.
      • SqueezeNet can even be further compressed by applying deep compression on them.
      • Models are now more energy efficient and has speed up a lot.
      • Deep compression was applied in Industry through Facebook and Baidu.
  • Quantization
    • Algorithm (Quantizing the Weight and Activation):
      • Train with float.
      • Quantizing the weight and activation:
        • Gather the statistics for weight and activation.
        • Choose proper radix point position.
      • Fine-tune in float format.
      • Convert to fixed-point format.
  • Low Rank Approximation
    • Is another size reduction algorithm that are used for CNN.
    • Idea is decompose the conv layer and then try both of the composed layers.
  • Binary / Ternary Net
    • Can we only use three numbers to represent weights in NN?
    • The size will be much less with only -1, 0, 1.
    • This is a new idea that was published in 2017 “Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR’17”
    • Works after training.
    • They have tried it on AlexNet and it has reached almost the same error as AlexNet.
    • Number of operation will increase per register: https://xnor.ai/
  • Winograd Transformation
    • Based on 3x3 WINOGRAD Convolutions which makes less operations than the ordiany convolution
    • cuDNN 5 uses the WINOGRAD Convolutions which has improved the speed.

Hardware for Efficient Inference

  • There are a lot of ASICs that we developed for deep learning. All in which has the same goal of minimize memory access.
    • Eyeriss MIT
    • DaDiannao
    • TPU Google (Tensor processing unit)
      • It can be put to replace the disk in the server.
      • Up to 4 cards per server.
      • Power consumed by this hardware is a lot less than a GPU and the size of the chip is less.
    • EIE Standford
      • By Han at 2016 [et al. ISCA’16]
      • We don’t save zero weights and make quantization for the numbers from the hardware.
      • He says that EIE has a better Throughput and energy efficient.

Algorithms for Efficient Training

  • Parallelization
    • Data Parallel – Run multiple inputs in parallel
      • Ex. Run two images in the same time!
      • Run multiple training examples in parallel.
      • Limited by batch size.
      • Gradients have to be applied by a master node.
    • Model Parallel
      • Split up the Model – i.e. the network
      • Split model over multiple processors By layer.
    • Hyper-Parameter Parallel
      • Try many alternative networks in parallel.
      • Easy to get 16-64 GPUs training one model in parallel.
  • Mixed Precision with FP16 and FP32
    • We have discussed that if we use 16 bit real numbers all over the model the energy cost will be less by 4x.
    • Can we use a model entirely with 16 bit number? We can partially do this with mixed FP16 and FP32. We use 16 bit everywhere but at some points we need the FP32.
    • By example in multiplying FP16 by FP16 we will need FP32.
    • After you train the model you can be a near accuracy of the famous models like AlexNet and ResNet.
  • Model Distillation
    • The question is can we use a senior (Good) trained neural network(s) and make them guide a student (New) neural network?
    • For more information look at Hinton et al. Dark knowledge / Distilling the Knowledge in a Neural Network
  • DSD: Dense-Sparse-Dense Training
    • Han et al. “DSD: Dense-Sparse-Dense Training for Deep Neural Networks”, ICLR 2017
    • Has a better regularization.
    • The idea is Train the model lets call this the Dense, we then apply Pruning to it lets call this sparse.
    • DSD produces same model architecture but can find better optimization solution arrives at better local minima, and achieves higher prediction accuracy.
    • After the above two steps we go connect the remain connection and learn them again (To dense again).
    • This improves the performance a lot in many deep learning models.
  • Part 4: Hardware for Efficient Training
    • GPUs for training:
      • Nvidia PASCAL GP100 (2016)
      • Nvidia Volta GV100 (2017)
        • Can make mixed precision operations!
        • So powerful.
        • The new nuclear bomb!
    • Google Announced “Google Cloud TPU” on May 2017!
      • Cloud TPU delivers up to 180 teraflops to train and run machine learning models.
      • One of our new large-scale translation models used to take a full day to train on 32 of the best commercially-available GPUs—now it trains to the same accuracy in an afternoon using just one eighth of a TPU pod.
  • We have moved from PC Era ==> Mobile-First Era ==> AI-First Era

Deep Learning Software

  • This section changes a lot every year in CS231n due to rapid changes in the deep learning software.
  • CPU vs GPU
    • GPU The graphics card was developed to render graphics to play games or make 3D media,. etc.
      • NVIDIA vs AMD
        • Deep learning choose NVIDIA over AMD GPU because NVIDIA is pushing research forward deep learning also makes it architecture more suitable for deep learning.
    • CPU has fewer cores but each core is much faster and much more capable; great at sequential tasks. While GPUs has more cores but each core is much slower “dumber”; great for parallel tasks.
    • GPU cores needs to work together. and has its own memory.
    • Matrix multiplication is from the operations that are suited for GPUs. It has MxN independent operations that can be done on parallel.
    • Convolution operation also can be paralyzed because it has independent operations.
    • Programming GPUs frameworks:
      • CUDA (NVIDIA only)
        • Write c-like code that runs directly on the GPU.
        • Its hard to build a good optimized code that runs on GPU. That’s why they provided high level APIs.
        • Higher level APIs: cuBLAS, cuDNN, etc
        • CuDNN has implemented back prop. , convolution, recurrent and a lot more for you!
        • In practice you won’t write a parallel code. You will use the code implemented and optimized by others!
      • OpenCl
        • Similar to CUDA, but runs on any GPU.
        • Usually Slower.
        • Haven’t much support yet from all deep learning software.
    • There are a lot of courses for learning parallel programming.
    • If you aren’t careful, training can bottleneck on reading data and transferring to GPU. So the solutions are:
      • Read all the data into RAM. # If possible
      • Use SSD instead of HDD
      • Use multiple CPU threads to prefetch data!
        • While the GPU are computing, a CPU thread will fetch the data for you.
        • A lot of frameworks implemented that for you because its a little bit painful!

Deep learning Frameworks

  • Its super fast moving!
  • Currently available frameworks:
    • TensorFlow (Google)
    • Caffe (UC Berkeley)
    • Caffe2 (Facebook)
    • Torch (NYU / Facebook)
    • PyTorch (Facebook)
    • Theano (U monteral)
    • Paddle (Baidu)
    • CNTK (Microsoft)
    • MXNet (Amazon)
  • The instructor thinks that you should focus on TensorFlow and PyTorch.
  • The point of deep learning frameworks:
    • Easily build big computational graphs.
    • Easily compute gradients in computational graphs.
    • Run it efficiently on GPU (cuDNN - cuBLAS)
  • NumPy doesn’t run on GPU.
  • Most of the frameworks tries to be like NUMPY in the forward pass and then they compute the gradients for you.
  • TensorFlow (Google)
    • Code are two parts:
      1. Define computational graph.
      2. Run the graph and reuse it many times.
    • TensorFlow uses a static graph architecture.
    • TensorFlow variables live in the graph. while the placeholders are feed each run.
    • Global initializer function initializes the variables that lives in the graph.
    • Use predefined optimizers and losses.
    • You can make a full layers with layers.dense function.
    • Keras (High level wrapper):
      • Keras is a layer on top pf TensorFlow, makes common things easy to do.
      • So popular!
      • Trains a full deep NN in a few lines of codes.
    • There are a lot high level wrappers:
      • Keras
      • TFLearn
      • TensorLayer
      • tf.layers #Ships with TensorFlow
      • tf-Slim #Ships with TensorFlow
      • tf.contrib.learn #Ships with TensorFlow
      • Sonnet # New from deep mind
    • TensorFlow has pretrained models that you can use while you are using transfer learning.
    • Tensorboard adds logging to record loss, stats. Run server and get pretty graphs!
    • It has distributed code if you want to split your graph on some nodes.
    • TensorFlow is actually inspired from Theano. It has the same inspirations and structure.
  • PyTorch (Facebook)

    • Has three layers of abstraction:
      • Tensor: ndarray but runs on GPU #Like numpy arrays in TensorFlow
        • Variable: Node in a computational graphs; stores data and gradient #Like Tensor, Variable, Placeholders
      • Module: A NN layer; may store state or learnable weights#Like tf.layers in TensorFlow
    • In PyTorch the graphs runs in the same loop you are executing which makes it easier for debugging. This is called a dynamic graph.
    • In PyTorch you can define your own autograd functions by writing forward and backward for tensors. Most of the times it will implemented for you.
    • torch.nn is a high level api like keras in TensorFlow. You can create the models and go on and on.
      • You can define your own nn module!
    • Also Pytorch contains optimizers like TensorFlow.
    • It contains a data loader that wraps a Dataset and provides minbatches, shuffling and multithreading.
    • PyTorch contains the best and super easy to use pretrained models
    • PyTorch contains Visdom that are like tensorboard. but Tensorboard seems to be more powerful.
    • PyTorch is new and still evolving compared to Torch. Its still in beta state.
    • PyTorch is best for research.
  • TensorFlow builds the graph once, then run them many times (called static graph)

  • In each PyTorch iteration, we build a new graph (called dynamic graph)

Static vs dynamic graphs

  • Optimization:
    • With static graphs, framework can optimize the graph for you before it runs.
  • Serialization
    • Static: Once graph is built, can serialize it and run it without the code that built the graph. Ex use the graph in c++
    • Dynamic: Always need to keep the code around.
  • Conditional

    • Is easier in dynamic graphs. And more complicated in static graphs.
  • Loops:

    • Is easier in dynamic graphs. And more complicated in static graphs.
  • TensorFlow fold make dynamic graphs easier in TensorFlow through dynamic batching.

  • Dynamic graph applications include: recurrent networks and recursive networks.

  • Caffe2 uses static graphs and can train model in python also works on iOS and Android.

  • TensorFlow/Caffe2 are used a lot in production especially on mobile.

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DeepLearningHardwareAndSoftware,
  title   = {Deep Learning Hardware and Software},
  author  = {Chadha, Aman},
  journal = {Distilled Notes for Stanford CS231n: Convolutional Neural Networks for Visual Recognition},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}