You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Executor and ParallelExecutor interfaces are unified so that users just need to convert the single card model into multi-card model through CompiledProgram, and use Executor for training or inference.
This version officially releases AnalysisConfig inference interface, which supports optimization of computational graph analysis, operator fusion, etc., and supports the acceleration of third-party libraries such as Intel MKLDNN and Nvidia TensorRT sub-graph engine.
The model library has initially released PaddlePaddle video model library, which provides 5 classic video classification models and generic structure code suitable for video classification tasks. Users can configure and evaluate the model with efficient configuration in one-click.
We added support for NLP semantic representation model BERT, which supports multi-card training on multiple machines and mixed-precision training. It improves training speed by 50%+ compared with mainstream implementation, and a complete deployment example is available.
Large-scale sparse parameter server Benchmark has been released. Asynchronous multi-machine training on CPU releases a built-in reader to significantly improve the IO throughput of click-rate estimation tasks. Performance of multi-machine multi-card training has enhanced in various aspects.
We added support for Intel Deep Learning Boost(VNNI) on next generation of Intel Xeon Scalable Processors . With that, INT8 inference performance could be improved by 200% over FP32 on some models.
Updates on Basic framework
Installation
Chinese version of the auxiliary installation script is available for Linux and MacOS with an interactive installation method to help users quickly complete PaddlePaddle installation in complex environments.
Better support for Windows:cuda8, cudnn7 GPU support and new AVX instruction set, MKLDNN, mnist dataset support are incorporated. The problem is fixed which is incurred when Windows loads the training model with the paddle of the same version from Linux/Mac platform.
New basic functions for Dynamic Computational Graphs
tracer, autograd, python Layer/PyLayer can be carried out for Dynamic Computational Graphs. Dynamic Computational Graphs can run models of MLP, GAN, ptbRNN, Resnet. Dynamic Computational Graphs can perform training through Optimizer and support GPU training.
Reformed interfaces of Executor and ParallelExecutor
The Executor and ParallelExecutor interfaces are unified. Users only need to convert the single card model into a multi-card model through CompiledProgram and use Executor for training or inference.
Improved ParallelExecutor
Reconstructing MultiDevSSAGraphBuilder makes MultiDevSSAGraphBuilder easier to extend.
The improved has removed device locks in ParallelExecutor to promote performance of multi-card scheduling on ParallelExecutor.
Optimization for intermediate expression IR and Pass
Improve the Python interface of C++ IR graph and the Python interface of C++ IR pass.
IRGraph class is created in framework.py to prepare for writing IR Pass in Python layer.
The new Pass is added which supports unpinned network updates.
QuantizationTransformPass is introduced, which is the graph transformations phase performed before the quantization training mode of Quantization Aware Training.
Optimization of memory and video memory
Jemalloc is integrated as a dynamic link library at compile time, to improve performance of memory management and reduce overhead of underlying framework memory management.
New video memory optimization strategies are accepted such as memory optimization, inplace pass, memory pool early deletion.
A new Pass is supported which supports unpinned network updates.
QuantizationTransformPass is introduced, which is the graph transformations phase performed before the quantization training mode of Quantization Aware Training.
Overall optimization for Operator
Each op only does a single scope query before execution, reducing the read-write lock operations (originally it needs to do 1~5 scope queries)
Temporary Allocator is integrated to reduce synchronization in op.
py_func operator is realised to accept python op. Users can quickly carry out custom unique operations with the aid of py_func Operator.
Reconstruct DDim, Variable Type and more to reduce the underlying framework scheduling overhead.
Optimization for INTEL FP32 computing related aspects
Optimize the density_prior_box operator with a speed 3 times quicker of single op on four threads.
Optimized Stack operator, single op speed up to 16 times as quick as the previous version.
Three MKLDNN-based kernels of Transpose, Concat and Conv3d, have been developed.
Precision bug happened to MKLDNN kernel of lrn operator is corrected, while single op speed is 1.3 times faster.
Fix the problem that MKLDNN initialization takes up 5G memory, and the current initialization takes up 500MB.
Reduce unnecessary reorders from the MKLDNN OP kernel to the non-MKLDNN OP kernel.
Improve CPU JitKernel
Improve the jitkernel of sequence pooling. The efficiency of pure op is increased by 2 times.
Improve softmax jitkernel. The performance of pure op is twice higher, while performance of CPU inference for the Bert model increases by 26%.
Common basic logics: computing square value of each element -- kVSquare, matrix multiplication -- kMatMul, vector maximum -- kHMax, the sum of all the elements in the vector -- kHSum.
Inference Engine
Server-side Inference
The inference engine AnalysisConfig is officially released,with support for optimization of computational graph analysis, operator fusion, and acceleration of third-party libraries such as Intel MKLDNN and Nvidia TensorRT sub-graph engine.
Pre-release INT8 inference off-line quantization scheme on Intel Xeon Scalable Processors
* Four INT8 kernel based on Intel MKL-DNN have been developed, namely Conv2D,Pool2D,Quantize,and Dequantize.
* Pre-release the 3 core Python APIs for the Calibration (paddle.fluid.contrib.Calibrator).
* The Calibration tool is developed to ensure the accuracy loss within 1% between FP32 and INT8 on ResNet-50 and MobileNet-V1 on the ImageNet validation dataset.
* Intel Xeon Scalable Processors with Intel Deep Learning Boost (VNNI) are supported. Inference performance on INT8 could be improved by 2 times on some models.
Accelerated CPU inference
fuse sequence pooling concat op supports N (<200) sequence_pooling ops to concatenate into a new op, which overall improves the CPU inference of seqpool model by 56%.
fuse continuously repeating fc ops into a large op, which expedite CPU inference for the seqpool model by 15%.
fuse scalar op combination with logic ((X * Y).^2 - (X.^2 * Y.^2) ) .* , which accelerates the seqpool model CPU inference by 8.2%.
Optimize the CPU Kernel of compare_op for the case where the number of elements in the input tensor is 1.
New Paddle-TRT support for Calibration INT8 and faster GPU prediction
Speed of VGG and Resnet50 model inference reaches the performance twice as high as Paddle-TRT float32
Accuracy of VGG and Resnet50 model tested on imagenet dataset is reduced by less than 0.3%.
Operator fusion
Fusion of fc and con, to be applied to the conv_op CUDNN kernel.
Pass for fusion of Conv+Affine Channels is added, and Faster RCNN performance increases by 26.8%.
Pass for fusion of Transpose+Flatten+Concat is added,and the performance of the MobilenetSSD model is increased by 15%.
Implement the CUDA Kernel of the beam_search operator and fuse the corresponding top-k, elementwise_add, reshape, and log calculations into the beam_search operator.
Improved functionality and ease of use
New Python interfaces for C++ IR graph.
New Python interfaces to inference library.新增预测库的Python接口。
Server-side inference supports loading models from memory.
Miscellaneous Updates
Remove the legacy V2 code. From version 1.3, functions in the V1 and V2 legacy version are no longer supported.
Fixed a bug in the Paddle-TRT elementwise-mul model.
Fixed a bug where the model output was abnormal when Paddle-TRT trt_engine stream accepts multiple consecutive inputs.
Mobile inference
Enhance Efficiency, increase common model inference speed
int8 inference supports automatic kernel fusion performed by dequectize and other ops (batch normalization/relu/elementwise add).
The transpose2 operator is optimized for shuffle channel operations.
The gru operator is optimized for the batch size of 1 by the neon instructions.
Optimize and implement pooling to support arbitrary padding.
Optimize and implement batch normalization, softmax, elementwise add.
New model inference is added which supports multiple inputs and multiple outputs.
Implementation of prelu6 operator、cast operator、top_k operator。
Fixed an issue that the int8 off-line quantization overflows and the result is incorrect.
Fixed a bug that the winograd might return a 0 when the height and width of the feature map are not equal.
Models
PaddleCV Intelligent Vision
Release PaddlePaddle video model library, including five video classification models: Attention Cluster, NeXtVLAD, LSTM, stNet, TSN. It provides generic structure(infrastructure) code for video classification tasks, including data reading and preprocessing, training and inference, network models, and metric calculations. Users add their own network models as needed, directly reuse the code of other modules, and quickly deploy models.
Support Target Detection Mask R-CNN model, the effect is on the same level with the mainstream implementation.
Semantic segmentation DeepLabV3+ model, depthwise_conv op fusion, video memory optimization. Compared with the previous version, memory consumption reduces by 40%.
PaddleNLP Intelligent Text Processing
Integrate BERT model for NLP semantic representation, which supports multi-machine multi-card training, mixed-precision training, and the training speed is 50%+ more rapid than mainstream implementation. A complete deployment example is available.
The machine translation Transformer model optimizes the decoding calculation. The cache of the result from the encoder output is added into the decoder and the inference speed is doubled.
PaddleRec Intelligent Recommendation
Sequence Semantic Retrieval is incorporated with a single-node multi-threaded example and a single node multi-card example, and also predictive features, data pre-processing optimization. The complete deployment example given is improved.
GRU4Rec adopts a negative sampling function, and the effect of using bpr loss and cross entropy loss is equal to the original test.
Distributed Training
Release Large-scale sparse parameter server Benchmark
Under the real business scenario, the click rate prediction task with feature size of 10 billion and 1k average sample features, and batch=512, acceleration ratio of 100 worker is 90.5, throughput is 1.36M/s.
Asynchronous multi-node training on CPU
Released a built-in reader for the click-rate inference task, which increased IO total throughput by 1300% in the Criteo dataset.
Enhance performance of multi-machine multi-card GPU horizontal expansion
New parallel mode:PG(ParallelGraph)、MP(Multi-Process). They are calculations on independent GPU cards, improving performance without affecting model accuracy.
In the ResNet50 model, with the single-node 8 card V100, PG, MP mode improves training performance by more than 30%; 4 machines with 32 cards, PG mode speed up 46%, MP mode speed up 60%.
In the BERT model, with 8 card V100, PG, MP mode improves training performance by 26%.
Multi-Process mode is less sensitive to speed of Reader than Parallel-Graph mode.
Enhance performance of multi-machine multi-card GPU vertical expansion
New features: fp16 and mixed precision training
Fp16 single-node single-card acceleration: speed of ResNet50 is about 87% higher; speed of BERT is about 70% higher.
BERT simultaneously turns on PG and mixed-precision, and throughput per unit time is increased by 120% in a single node with 8 cards.
ResNet50 simultaneously starts the mixed-precision training and MP mode. On the V100 single-node with 8 cards, 4 nodes with 32 card, the throughput per unit time is increased by 100%.
Speed up convergence of classical model
New features: Dynamic Batch Size, Dynamic Image Resize method.
Resnet50 on Imagenet dataset: The number of training rounds before convergence drops to about 1/3 of that of the standard training method.
VisualDL
VisualDL graph supports visual demonstration for models saved by Paddle fluid.