Skip to content

PaddlePaddle 1.3.0

Compare
Choose a tag to compare
@panyx0718 panyx0718 released this 21 Feb 05:48
· 59 commits to release/1.3 since this release
4b3f9e5

Release Notes

重要更新

  • 统一Executor和ParallelExecutor接口,用户只需通过CompiledProgram将单卡模型转化多卡模型,并利用Executor进行训练或者预测。
  • 正式发布AnalysisConfig 预测接口,支持计算图分析、算子融合等优化,并支持利用 Intel MKLDNN、Nvidia TensorRT 子图引擎等第三方库的加速.
  • 模型库新增发布PaddlePaddle视频模型库,提供5个视频分类经典模型以及适合视频分类任务的通用骨架代码,用户可一键式高效配置模型完成训练和评测。
  • 新增支持NLP语义表示BERT模型,支持多机多卡训练,支持混合精度训练,训练速度对比主流实现提升50%+,提供完整部署示例。
  • 大规模稀疏参数服务器Benchmark发布, CPU多机异步训练发布显著提升点击率预估任务IO吞吐的built-in reader,多机多卡训练性能多方面提升。
  • 新增支持Intel Deep Learning Boost(VNNI指令集)。在新一代的Intel Xeon Scalable Processor上,使用这个特性的一些模型,INT8预测性能可以达到FP32的2倍。

基础框架

  • 安装
    • 新增Linux和MacOS下的中文版本辅助安装脚本,提供交互式安装方式,协助用户在复杂环境下快速完成PaddlePaddle安装。
    • Windows支持优化:新增cuda8,cudnn7的GPU支持,新增AVX指令集、MKLDNN、mnist数据集支持。修复Windows加载Linux/Mac下同版本paddle训练模型的问题。
  • 增加动态图基础功能
    • 动态图tracer、 autograd、python Layer/PyLayer,动态图支持MLP、GAN、ptbRNN、Resnet模型,动态图支持Optimizer、GPU训练。
  • Executor和ParallelExecutor接口优化
    • 对Executor和ParallelExecutor接口进行统一,用户只需通过CompiledProgram将单卡模型转化多卡模型,并利用Executor进行训练或者预测。
    • ParallelExecutor优化
      对MultiDevSSAGraphBuilder进行重构,使得MultiDevSSAGraphBuilder更易扩展。
      去除ParallelExecutor中的设备锁,提升ParallelExecutor多卡调度性能。
  • 中间表达IR和Pass方面的优化
    • 完善C++ IR graph的python接口以及C++ IR pass的python接口。
    • 在framework.py中新增IRGraph类,为在Python层编写IR Pass做准备。
    • 新增支持网络无锁更新的Pass。
    • 新增QuantizationTransformPass,此为Quantization Aware Training量化模式训练前的图修改操作部分。
  • 内存和显存方面的优化
    • 新增支持在编译时加入 Jemalloc 作为动态链接库,提升内存管理的性能,降低基础框架内存管理开销
    • 新增memory optimize,inplace pass, memory pool early deletion等显存优化策略。
    • 新增支持网络无锁更新的Pass。
    • 新增QuantizationTransformPass,此为Quantization Aware Training量化模式训练前的图修改操作部分。
  • Operator整体层面的优化
    • 每个op在执行前只做一次scope查询,减少读写锁操作(原来需要做1~5次scope查询)
    • 新增Temporary Allocator,减少op中的同步操作
    • 新增py_func operator,支持python op接入,用户可以借助py_func Operator快速实现所需要的特有操作
  • 重构DDim,Variable Type等,降低基础框架调度开销。
  • INTEL FP32计算相关优化
    • 优化density_prior_box operator,单op四线程提速3倍。
    • 优化Stack operator,单op提速16倍。
    • 开发Transpose,Concat和Conv3d三个基于MKLDNN的kernel。
    • 修复lrn operator中MKLDNN kernel精度bug,同时单op提速1.3倍。
    • 修复MKLDNN初始化占用5G内存的问题,目前初始化占用500MB。
    • 减少从MKLDNN OP kernel到非MKLDNN OP kernel时不必要的reorder。
  • 完善CPU JitKernel
    • sequence pooling 的jitkernel,纯op提升2倍。
    • softmax 的jitkernel,纯op提升2倍,同时使得Bert模型CPU预测提升26%。
    • 常见的基本逻辑:向量的每个元素求平方kVSquare、矩阵乘法kMatMul、向量的最大值kHMax、向量所有元素的和kHSum。

预测引擎

服务器预测

  • 正式发布AnalysisConfig 预测接口,支持计算图分析、算子融合等优化,并支持利用 Intel MKLDNN、Nvidia TensorRT 子图引擎等第三方库的加速。
  • 预发布 intel CPU上的 预测 INT8 离线量化方案
    • 开发Conv2D,Pool2D,Quantize,Dequantize四个基于MKL-DNN的INT8 kernel。
    • 预发布Calibration的3个核心Python API(paddle.fluid.contrib.Calibrator)。
    • 开发Calibration工具,保证FP32和INT8的精度在ResNet-50和MobileNet-V1在ImageNet验证数据集上相差在1%内。
    • 支持Intel Xeon CascadeLake Server(VNNI指令)及Intel Xeon SkyLake Server,性能提升约为1.33倍。
  • CPU预测速度提升
    • fuse sequence pooling concatop,支持N (<200)个sequence_pooling op concat起来组成一个新op,整体使得seqpool模型 CPU预测提升56%。
    • fuse 连续重复的fc op为一个大op,使得seqpool模型CPU预测速度提升15%。
    • fuse 逻辑为((X * Y).^2 - (X.^2 * Y.^2) ) .* scalar的op组合 , 使得seqpool模型CPU预测速度提升8.2%。
    • 针对输入tensor元素个数为1的情况,优化compare_op的CPU Kernel。
  • 新增Paddle-TRT 对Calibration INT8的支持,GPU预测速度提升
    • 模型VGG,Resnet50上预测速度达到了Paddle-TRT float32的两倍性能。
    • 模型VGG,Resnet50在imagenet数据集上测试,精度下降0.3%以内。
  • 算子融合
    • 增加 fc和 con 相关两个 fuse,作用于 conv_op CUDNN kernel。
    • 新增Conv+Affine Channel的融合pass,Faster RCNN运行的性能提升26.8%。
    • 新增Transpose+Flatten+Concat 融合pass,MobilenetSSD模型性能提升15%。
    • 实现beam_search operator的CUDA Kernel,并且将相应的top-k、elementwise_add、reshape、log计算融合到beam_search operator中。
  • 功能完善及易用性提升
    • 新增C++ IR graph的Python接口。
    • 新增预测库的Python接口。
    • 服务端预测支持从内存加载模型。
  • 其他
    • 删除legacy V2代码。从1.3版本起,不再支持V1&V2老版本功能。
    • 修复Paddle-TRT elementwise-mul模型运行出现问题的bug。
    • 修复Paddle-TRT trt_engine stream多个连续输入情况下模型输出结果异常的bug。

移动端预测

  • 效率优化,常见模型预测速度提升
    • int8预测支持dequantize和其他op(batch normalization/relu/elementwise add)进行自动kernel融合。
    • transpose2 operator对于shuffle channel操作进行优化。
    • gru operator使用neon指令进行优化,并针对batch size为1时进行优化。
    • 优化和实现pooling,支持任意的padding。
    • 优化和实现batch normalization、softmax、elementwise add。
  • 新增支持多个输入和多个输出的模型预测。
  • 新增实现prelu6 operator、cast operator、top_k operator。
  • 修复int8 offline量化溢出结果不对的问题。
  • 修复winograd实现在输入feature map的height和width不相等时结果可能为0的bug。

模型建设

  • PaddleCV 智能视觉
    • 新增发布PaddlePaddle视频模型库,包括五个视频分类模型:Attention Cluster、NeXtVLAD、LSTM,、stNet、TSN。提供适合视频分类任务的通用骨架代码,包括数据读取和预处理、训练和预测、网络模型以及指标计算等多个模块。用户根据需要添加自己的网络模型,直接复用其他模块的代码,快速部署模型。
    • 新增支持目标检测Mask R-CNN模型,效果与主流实现打平。
    • 语义分割DeepLabV3+模型,depthwise_conv op融合,显存优化,显存占用对比上一版本减少40%。
  • PaddleNLP 智能文本处理
    • 新增支持NLP语义表示BERT模型,支持多机多卡训练,支持混合精度训练,训练速度对比主流实现提升50%+,提供完整部署示例。
    • 机器翻译Transformer模型优化解码计算,decoder中加入对encoder output计算结果的cache,预测速度提升一倍。
  • PaddleRec 智能推荐
    • Sequence Semantic Retrieval 新增单机多线程、单机多卡运行示例,添加预测功能、数据预处理优化,完善部署示例。
    • GRU4Rec新增负采样功能,使用bpr loss和cross entropy loss的效果与原作打平。

分布式训练

  • 大规模稀疏参数服务器Benchmark发布
    • 测试真实业务场景下,特征规模百亿、样本平均特征数1k的点击率预估任务,在batch=512情况下,100worker加速比90.5,吞吐量1.36M/s 。
  • CPU多机异步训练
    • 发布面向点击率预估任务的built-in reader,Criteo数据集下IO总吞吐提升1300%。
  • GPU多机多卡水平扩展性能提升
    • 新增并行模式:PG(ParallelGraph)、MP(Multi-Process),独立GPU卡之间的计算,提升性能同时,不影响模型精度。
    • 在ResNet50模型,单机8卡V100下,PG, MP模式提升训练性能30%以上;4机32卡,PG模式提速46%,MP模式提速60%。
    • 在BERT模型,8卡V100下,PG, MP模式提升训练性能26%。
    • Multi-Process模式相比Parallel-Graph模式对Reader速度敏感度不高。
  • GPU多机多卡垂直扩展性能提升
    • 新增功能:fp16和混合精度训练
    • Fp16单机单卡加速情况:ResNet50提速约87%,BERT提速约70%。
    • BERT同时开启PG和混合精度,单机8卡下单位时间吞吐提升120%。
    • ResNet50同时开启混合精度训练和MP模式,在V100单机8卡、4机32卡下,单位时间吞吐提升100%。
  • 典型模型收敛速度优化
    • 新增功能:动态Batch Size,动态Image Resize方法。
    • Resnet50 on Imagenet数据集:训练收敛轮数下降为标准训练方法的1/3左右。

VisualDL

  • VisualDL graph支持Paddle fluid保存的模型可视化展示。

Release Notes

Highlights

  • Executor and ParallelExecutor interfaces are unified so that users just need to convert the single card model into multi-card model through CompiledProgram, and use Executor for training or inference.
  • This version officially releases AnalysisConfig inference interface, which supports optimization of computational graph analysis, operator fusion, etc., and supports the acceleration of third-party libraries such as Intel MKLDNN and Nvidia TensorRT sub-graph engine.
  • The model library has initially released PaddlePaddle video model library, which provides 5 classic video classification models and generic structure code suitable for video classification tasks. Users can configure and evaluate the model with efficient configuration in one-click.
  • We added support for NLP semantic representation model BERT, which supports multi-card training on multiple machines and mixed-precision training. It improves training speed by 50%+ compared with mainstream implementation, and a complete deployment example is available.
  • Large-scale sparse parameter server Benchmark has been released. Asynchronous multi-machine training on CPU releases a built-in reader to significantly improve the IO throughput of click-rate estimation tasks. Performance of multi-machine multi-card training has enhanced in various aspects.
  • We added support for Intel Deep Learning Boost(VNNI) on next generation of Intel Xeon Scalable Processors . With that, INT8 inference performance could be improved by 200% over FP32 on some models.

Updates on Basic framework

  • Installation
    • Chinese version of the auxiliary installation script is available for Linux and MacOS with an interactive installation method to help users quickly complete PaddlePaddle installation in complex environments.
    • Better support for Windows:cuda8, ​​cudnn7 GPU support and new AVX instruction set, MKLDNN, mnist dataset support are incorporated. The problem is fixed which is incurred when Windows loads the training model with the paddle of the same version from Linux/Mac platform.
  • New basic functions for Dynamic Computational Graphs
    • tracer, autograd, python Layer/PyLayer can be carried out for Dynamic Computational Graphs. Dynamic Computational Graphs can run models of MLP, GAN, ptbRNN, Resnet. Dynamic Computational Graphs can perform training through Optimizer and support GPU training.
  • Reformed interfaces of Executor and ParallelExecutor
    • The Executor and ParallelExecutor interfaces are unified. Users only need to convert the single card model into a multi-card model through CompiledProgram and use Executor for training or inference.
    • Improved ParallelExecutor
      • Reconstructing MultiDevSSAGraphBuilder makes MultiDevSSAGraphBuilder easier to extend.
      • The improved has removed device locks in ParallelExecutor to promote performance of multi-card scheduling on ParallelExecutor.
  • Optimization for intermediate expression IR and Pass
    • Improve the Python interface of C++ IR graph and the Python interface of C++ IR pass.
    • IRGraph class is created in framework.py to prepare for writing IR Pass in Python layer.
    • The new Pass is added which supports unpinned network updates.
    • QuantizationTransformPass is introduced, which is the graph transformations phase performed before the quantization training mode of Quantization Aware Training.
  • Optimization of memory and video memory
    • Jemalloc is integrated as a dynamic link library at compile time, to improve performance of memory management and reduce overhead of underlying framework memory management.
    • New video memory optimization strategies are accepted such as memory optimization, inplace pass, memory pool early deletion.
    • A new Pass is supported which supports unpinned network updates.
    • QuantizationTransformPass is introduced, which is the graph transformations phase performed before the quantization training mode of Quantization Aware Training.
  • Overall optimization for Operator
    • Each op only does a single scope query before execution, reducing the read-write lock operations (originally it needs to do 1~5 scope queries)
    • Temporary Allocator is integrated to reduce synchronization in op.
    • py_func operator is realised to accept python op. Users can quickly carry out custom unique operations with the aid of py_func Operator.
  • Reconstruct DDim, Variable Type and more to reduce the underlying framework scheduling overhead.
  • Optimization for INTEL FP32 computing related aspects
    • Optimize the density_prior_box operator with a speed 3 times quicker of single op on four threads.
    • Optimized Stack operator, single op speed up to 16 times as quick as the previous version.
    • Three MKLDNN-based kernels of Transpose, Concat and Conv3d, have been developed.
    • Precision bug happened to MKLDNN kernel of lrn operator is corrected, while single op speed is 1.3 times faster.
    • Fix the problem that MKLDNN initialization takes up 5G memory, and the current initialization takes up 500MB.
    • Reduce unnecessary reorders from the MKLDNN OP kernel to the non-MKLDNN OP kernel.
  • Improve CPU JitKernel
    • Improve the jitkernel of sequence pooling. The efficiency of pure op is increased by 2 times.
    • Improve softmax jitkernel. The performance of pure op is twice higher, while performance of CPU inference for the Bert model increases by 26%.
    • Common basic logics: computing square value of each element -- kVSquare, matrix multiplication -- kMatMul, vector maximum -- kHMax, the sum of all the elements in the vector -- kHSum.

Inference Engine

Server-side Inference

  • The inference engine AnalysisConfig is officially released,with support for optimization of computational graph analysis, operator fusion, and acceleration of third-party libraries such as Intel MKLDNN and Nvidia TensorRT sub-graph engine.
  • Pre-release INT8 inference off-line quantization scheme on Intel Xeon Scalable Processors
    * Four INT8 kernel based on Intel MKL-DNN have been developed, namely Conv2D,Pool2D,Quantize,and Dequantize.
    * Pre-release the 3 core Python APIs for the Calibration (paddle.fluid.contrib.Calibrator).
    * The Calibration tool is developed to ensure the accuracy loss within 1% between FP32 and INT8 on ResNet-50 and MobileNet-V1 on the ImageNet validation dataset.
    * Intel Xeon Scalable Processors with Intel Deep Learning Boost (VNNI) are supported. Inference performance on INT8 could be improved by 2 times on some models.
  • Accelerated CPU inference
    • fuse sequence pooling concat op supports N (<200) sequence_pooling ops to concatenate into a new op, which overall improves the CPU inference of seqpool model by 56%.
    • fuse continuously repeating fc ops into a large op, which expedite CPU inference for the seqpool model by 15%.
    • fuse scalar op combination with logic ((X * Y).^2 - (X.^2 * Y.^2) ) .* , which accelerates the seqpool model CPU inference by 8.2%.
    • Optimize the CPU Kernel of compare_op for the case where the number of elements in the input tensor is 1.
  • New Paddle-TRT support for Calibration INT8 and faster GPU prediction
    • Speed of VGG and Resnet50 model inference reaches the performance twice as high as Paddle-TRT float32
    • Accuracy of VGG and Resnet50 model tested on imagenet dataset is reduced by less than 0.3%.
  • Operator fusion
    • Fusion of fc and con, to be applied to the conv_op CUDNN kernel.
    • Pass for fusion of Conv+Affine Channels is added, and Faster RCNN performance increases by 26.8%.
    • Pass for fusion of Transpose+Flatten+Concat is added,and the performance of the MobilenetSSD model is increased by 15%.
    • Implement the CUDA Kernel of the beam_search operator and fuse the corresponding top-k, elementwise_add, reshape, and log calculations into the beam_search operator.
  • Improved functionality and ease of use
    • New Python interfaces for C++ IR graph.
    • New Python interfaces to inference library.新增预测库的Python接口。
    • Server-side inference supports loading models from memory.
  • Miscellaneous Updates
    • Remove the legacy V2 code. From version 1.3, functions in the V1 and V2 legacy version are no longer supported.
    • Fixed a bug in the Paddle-TRT elementwise-mul model.
    • Fixed a bug where the model output was abnormal when Paddle-TRT trt_engine stream accepts multiple consecutive inputs.

Mobile inference

  • Enhance Efficiency, increase common model inference speed
    • int8 inference supports automatic kernel fusion performed by dequectize and other ops (batch normalization/relu/elementwise add).
    • The transpose2 operator is optimized for shuffle channel operations.
    • The gru operator is optimized for the batch size of 1 by the neon instructions.
    • Optimize and implement pooling to support arbitrary padding.
    • Optimize and implement batch normalization, softmax, elementwise add.
  • New model inference is added which supports multiple inputs and multiple outputs.
  • Implementation of prelu6 operator、cast operator、top_k operator。
  • Fixed an issue that the int8 off-line quantization overflows and the result is incorrect.
  • Fixed a bug that the winograd might return a 0 when the height and width of the feature map are not equal.

Models

  • PaddleCV Intelligent Vision
    • Release PaddlePaddle video model library, including five video classification models: Attention Cluster, NeXtVLAD, LSTM, stNet, TSN. It provides generic structure(infrastructure) code for video classification tasks, including data reading and preprocessing, training and inference, network models, and metric calculations. Users add their own network models as needed, directly reuse the code of other modules, and quickly deploy models.
    • Support Target Detection Mask R-CNN model, the effect is on the same level with the mainstream implementation.
    • Semantic segmentation DeepLabV3+ model, depthwise_conv op fusion, video memory optimization. Compared with the previous version, memory consumption reduces by 40%.
  • PaddleNLP Intelligent Text Processing
    • Integrate BERT model for NLP semantic representation, which supports multi-machine multi-card training, mixed-precision training, and the training speed is 50%+ more rapid than mainstream implementation. A complete deployment example is available.
    • The machine translation Transformer model optimizes the decoding calculation. The cache of the result from the encoder output is added into the decoder and the inference speed is doubled.
  • PaddleRec Intelligent Recommendation
    • Sequence Semantic Retrieval is incorporated with a single-node multi-threaded example and a single node multi-card example, and also predictive features, data pre-processing optimization. The complete deployment example given is improved.
    • GRU4Rec adopts a negative sampling function, and the effect of using bpr loss and cross entropy loss is equal to the original test.

Distributed Training

  • Release Large-scale sparse parameter server Benchmark
    • Under the real business scenario, the click rate prediction task with feature size of 10 billion and 1k average sample features, and batch=512, acceleration ratio of 100 worker is 90.5, throughput is 1.36M/s.
  • Asynchronous multi-node training on CPU
    • Released a built-in reader for the click-rate inference task, which increased IO total throughput by 1300% in the Criteo dataset.
  • Enhance performance of multi-machine multi-card GPU horizontal expansion
    • New parallel mode:PG(ParallelGraph)、MP(Multi-Process). They are calculations on independent GPU cards, improving performance without affecting model accuracy.
    • In the ResNet50 model, with the single-node 8 card V100, PG, MP mode improves training performance by more than 30%; 4 machines with 32 cards, PG mode speed up 46%, MP mode speed up 60%.
    • In the BERT model, with 8 card V100, PG, MP mode improves training performance by 26%.
    • Multi-Process mode is less sensitive to speed of Reader than Parallel-Graph mode.
  • Enhance performance of multi-machine multi-card GPU vertical expansion
    • New features: fp16 and mixed precision training
    • Fp16 single-node single-card acceleration: speed of ResNet50 is about 87% higher; speed of BERT is about 70% higher.
    • BERT simultaneously turns on PG and mixed-precision, and throughput per unit time is increased by 120% in a single node with 8 cards.
    • ResNet50 simultaneously starts the mixed-precision training and MP mode. On the V100 single-node with 8 cards, 4 nodes with 32 card, the throughput per unit time is increased by 100%.
  • Speed up convergence of classical model
    • New features: Dynamic Batch Size, Dynamic Image Resize method.
    • Resnet50 on Imagenet dataset: The number of training rounds before convergence drops to about 1/3 of that of the standard training method.

VisualDL

  • VisualDL graph supports visual demonstration for models saved by Paddle fluid.