Review of NeurIPS 2019 for Model Compression

Jiayi (Jason) Liu

2020-01-07 20:00

Source

Review of NeurIPS 2019 for Model Compression¶

Sessions¶

Workshop keynotes¶

Pruning¶

Global Sparse Momentum SGD for Pruning Very Deep Neural Networks: proposes to combine momentum SGD with weight importance (determined by Taylor expansion) to update and prune model weights automatically. The momentum is critical to accelerate the weight decay and pruning efficiency.
Channel Gating Neural Networks: proposes an effective way to create dynamically pruned network for inference by channel gating. It provides a clear instruction to design the network architecture and to train such network. The inference stage requires carefully implemented gating function to reduce the computation (therefore better efficiency).
AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters: trains an additional set of parameters to prune the underlying network weights (similar to gating).
Gate Decorator: Global Filter Pruning Method for Accelerating Deep Convolutional Neural Networks: provides a detailed solution for effective channel pruning. Using a trainable weight factor to guide pruning has been studied in Liu et al. 2017. Using Taylor expansion to predict the importance has been explored in Molchanov et al. 2016. This paper combined the techniques together along with tweaks on skip connection and batch normalization to form a concrete solution for filter/channel pruning.
Network Pruning via Transformable Architecture Search: searches for the best width and depth using NAS. It is not a typical NAS, instead, it represents the network depth and width probabilistically and learn them gradually via validation set to form the final pruned model with distillation.

Low rank¶

Singleshot : a scalable Tucker tensor decomposition: proposes efficient algorithms for Tucker decomposition. The decomposition can improve model inference speed and reduce model size.
[w]Trained Rank Pruning for Efficient Deep Neural Networks: applies SGD on decomposed matrices periodically to obtain a low-rank network.
[w] Pushing the limits of RNN Compression: uses Kronecker products to decompose matrices in RNN layers and obtains 15x or more compression.

Quantization¶

Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks: proposes a new float 8 representation to effectively quantize DNN.
The Synthesis of XNOR Recurrent Neural Networks with Stochastic Logic: fully adopts XNOR operations for the LSTM-based RNN model.
MetaQuant: Learning to Quantize by Learning to Penetrate Non-differentiable Quantization: learns the mapping for gradient update for quantized model instead of using a pre-defined schema (e.g. straight-through-estimator).
A Mean Field Theory of Quantized Deep Networks: The Quantization-Depth Trade-Off: provides a guidance to balance the activation quantization (N) and the depth (L) of a model, $L\propto N^{1.87}$.
[w]Progressive Stochastic Binarization of Deep Networks: uses binary weights stochastically to present float weights in a statistically sounding approach.
[w] Regularized Binary Network Training: improves the training procedure for binary network.
[w] Neural Networks Weights Quantization: Target None-retraining Ternary (TNT): uses cosine similarity to convert network weights into tenary values.
[w] Instant Quantization of Neural Networks using Monte Carlo Methods: proposes to quantize model weights by Monte Carlo sampling to reduce retraining. The model size is controlled by the number of samples.
[w] Discovering Low-Precision Networks Close to Full-Precision Networks for Efficient Inference: provides a method to estimate the optimal quantization mapping and to efficiently finetune the quantized model.

Architecture¶

CondConv: Conditionally Parameterized Convolutions for Efficient Inference: learns the dynamic (input-dependent) weights to combine different kernels for convolutions. It shows better accuracy/speed on mobilenet models.
More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation: passes video sequences through big-little network along with temporal-aggregation module to handle time domain sequence. [corrected - don't confuse with another more is less paper].
SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers: demonstrates (first time?) to apply CNN architecture with dozen KB memory and model size.
Constrained deep neural network architecture search for IoT devices accounting for hardware calibration: applies genetic evolution to mutate neural architecture (mainly number of neurons). It also includes realistic measure of inference time for the resource constraint.
Einconv: Exploring Unexplored Tensor Network Decompositions for Convolutional Neural Networks: provides an interesting view to decompose the convolustion as tensor network and leverage it to search for efficient neural architecture.
[w] AutoSlim: Towards One-Shot Architecture Search for Channel Numbers: searches slim network (subnet) during greedy training process to find the best model.
[w] Energy-Aware Neural Architecture Optimization With Splitting Steepest Descent: proposes to split neurons along its steepest descent direction to grow neural network (NAS). It provides a theoretical sounding procedure to assign new neurons after splitting.

[w] Exploring Bit-Slice Sparsity in Deep Neural Networks for Efficient ReRAM-Based Deployment: proposes an l1 norm for ReRAM-based device for efficient computation.
[w] Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization: reduces the energy consumption by regularizing the datapath with the proposed operands hamming distance optimization.

Others¶

Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask: illustrates the importance of mask strategy and provide a better way to learn the mask (for training pruned model).
Model Compression with Adversarial Robustness: A Unified Optimization Framework: proposes adversarial attack-aware model compression.
[w] On hardware-aware probabilistic frameworks for resource constrained embedded applications: targets compression for probabilistic models.
[w] Doubly Sparse: Sparse Mixture of Sparse Experts for Efficient Softmax Inference: speeds up the search for top-k classes for softmax layer.
[w] Algorithm-hardware Co-design for Deformable Convolution: improves the deformable convolution by rounding, bounding and squared shape for efficient inference on FPGA devices.

Application¶

Point-Voxel CNN for Efficient 3D Deep Learning: solves the efficiency challenges of applying DL to 3D point cloud applications.
LiteEval: A Coarse-to-Fine Framework for Resource Efficient Video Recognition: uses two LSTM networks (fine/coarse) for video recognition. The coarse network is always-on, whereas the fine network is activated by the output of the hidden layers in the coarse network. The gates control the activation is a trainable 1-layer dense layer with probabilistic nature. Also, the hidden information is shared between the coarse and fine networks.
[w] YOLO Nano: a Highly Compact You Only Look Once Convolutional Neural Network for Object Detection: demonstrates a highly efficient YOLO implementation.
[w] Q8BERT: Quantized 8Bit BERT: applies 8bit quantization to deep language models, such as BERT.
[w] Training Compact Models for Low Resource Entity Tagging using Pre-trained Language Models: distills transformer model to train compact model.
[w] Fully Quantized Transformer for Improved Translation: applies quantization to transformer network for language translation.
[w] Spoken Language Understanding on the Edge: presents the system design of language understanding device for edge applications.

Review of NeurIPS 2019 for Model Compression¶

Sessions¶

Workshop keynotes¶

Pruning¶

Low rank¶

Quantization¶

Architecture¶

Hardware-related¶

Others¶

Application¶

Comments