Select pybind11 to expose the C++ API

Ting PAN
Commit d0fa332c authored Mar 09, 2019 by Ting PAN
Showing with 1363 additions and 1377 deletions
.gitmodules
CHANGES
Docs/api/python/contents/config.rst
Docs/api/python/contents/core.rst
Docs/api/python/contents/core/cuda.rst
Docs/api/python/contents/core/tensor_utils.rst
Docs/api/python/contents/core/workspace.rst
Docs/api/python/contents/ops.rst
Docs/api/python/contents/tools.rst
Docs/api/python/contents/tools/summary_writer.rst
Docs/api/python/contents/utils.rst
Docs/api/python/contents/tools/im2db.rst → Docs/api/python/contents/utils/vision/database.rst
Docs/api/python/contents/vm.rst
Docs/api/python/contents/vm/caffe/layer.rst
Docs/api/python/contents/vm/torch.rst
Dragon/CMakeLists.txt
Dragon/include/core/context.h
Dragon/include/core/context_cnml.h
Dragon/include/core/context_cuda.h
Dragon/include/core/graph.h
--- a/.gitmodules
+++ b/.gitmodules
@@ -10,3 +10,6 @@
 [submodule "ThirdParty/cub"]
 	path = ThirdParty/cub
 	url = https://github.com/NVlabs/cub
+[submodule "ThirdParty/pybind11"]
+	path = ThirdParty/pybind11
+	url = https://github.com/pybind/pybind11
--- a/CHANGES
+++ b/CHANGES
 ------------------------------------------------------------------------
 The list of most significant changes made over time in Dragon.
-Dragon 0.3.0.0 (20190110)
+Dragon 0.3.0.0 (20190309)
 DRAGON_VERSION == 3000
 Changes (w.r.t. Dragon 0.2.2.13):
@@ -24,6 +24,8 @@ Preview Features:
 - Use ``Eigen`` as the default cpu math library instead of ``OpenBLAS``.
+- Use ``PyBind11`` as the default python module exporter.
 - Integer data types support for common operators,
  see the documentation for more detail information.
@@ -32,6 +34,8 @@ Preview Features:
  which unifies the naming of static and dynamic computation graph.
+- The behavior of accumulating gradients have been canceled.
 Bugs fixed:

--- a/Docs/api/python/contents/config.rst
+++ b/Docs/api/python/contents/config.rst
@@ -8,23 +8,22 @@
 Quick Reference
 ---------------
-==========================   =============================================================================
+===============================    =============================================================================
-List                         Brief
+List                               Brief
-==========================   =============================================================================
+===============================    =============================================================================
-`EnableCPU`_                 Enable CPU mode globally.
+`EnableCPU`_                       Enable CPU mode globally.
-`IsCUDADriverSufficient`_    Is CUDADriver sufficient?
+`EnableCUDA`_                      Enable CUDA mode globally.
-`EnableCUDA`_                Enable CUDA mode globally.
+`SetRandomSeed`_                   Set the global random seed.
-`SetRandomSeed`_             Set the global random seed.
+`GetRandomSeed`_                   Get the global random seed.
-`GetRandomSeed`_             Get the global random seed.
+`SetGPU`_                          Set the global id GPU.
-`SetGPU`_                    Set the global id GPU.
+`GetGPU`_                          Get the global id of GPU.
-`GetGPU`_                    Get the global id of GPU.
+`SetGraphOptimizationLevel`_       Set the default level of graph optimization.
-`SetDebugMode`_              Enable Debug mode globally.
+`LogMetaGraph`_                    Enable to log meta graph globally.
-`LogMetaGraph`_              Enable to log meta graph globally.
+`LogOptimizedGraph`_               Enable to log optimized graph globally.
-`LogOptimizedGraph`_         Enable to log optimized graph globally.
+`ExportMetaGraph`_                 Enable to export all runnable meta graphs into text files.
-`ExportMetaGraph`_           Enable to export all runnable meta graphs into text files.
+`SetLoggingLevel`_                 Set the minimum level of Logging.
-`SetLoggingLevel`_           Set the minimum level of Logging.
+`SetLoggingFile`_                  Redirect the logging into the specific file.
-`SetLoggingFile`_            Redirect the logging into the specific file.
+===============================    =============================================================================
-==========================   =============================================================================
 API Reference
 -------------
@@ -33,13 +32,12 @@ API Reference
    :members:
 .. _EnableCPU: #dragon.config.EnableCPU
-.. _IsCUDADriverSufficient: #dragon.config.IsCUDADriverSufficient
 .. _EnableCUDA: #dragon.config.EnableCUDA
 .. _SetRandomSeed: #dragon.config.SetRandomSeed
 .. _GetRandomSeed: #dragon.config.GetRandomSeed
 .. _SetGPU: #dragon.config.SetGPU
 .. _GetGPU: #dragon.config.GetGPU
-.. _SetDebugMode: #dragon.config.SetDebugMode
+.. _SetGraphOptimizationLevel: #dragon.config.SetGraphOptimizationLevel
 .. _LogMetaGraph: #dragon.config.LogMetaGraph
 .. _LogOptimizedGraph: #dragon.config.LogOptimizedGraph
 .. _ExportMetaGraph: #dragon.config.ExportMetaGraph

--- a/Docs/api/python/contents/core.rst
+++ b/Docs/api/python/contents/core.rst
@@ -27,6 +27,7 @@ C++ Binding Wrapper
   core/workspace
   core/tensor_utils
   core/mpi
+   core/cuda
   core/gradient_maker
 ==============================      =======================================================================
@@ -34,11 +35,13 @@ List                                Brief
 ==============================      =======================================================================
 `dragon.core.workspace`_            The interfaces of Workspace, mostly are the wrappers of C++.
 `dragon.core.gradient_maker`_       The generator of GradientOps.
-`dragon.core.tensor_utils`_         The Tensor utilities.
+`dragon.core.tensor_utils`_         List some extended Tensor C++ API.
-`dragon.core.mpi`_                  The MPI utilities.
+`dragon.core.mpi`_                  List some useful MPI C++ API.
+`dragon.core.cuda`_                 List some useful CUDA C++ API.
 ==============================      =======================================================================
 .. _dragon.core.mpi: core/mpi.html
+.. _dragon.core.cuda: core/cuda.html
 .. _dragon.core.scope: core/scope.html
 .. _dragon.core.tensor: core/tensor.html
 .. _dragon.core.tensor_utils: core/tensor_utils.html

--- a/Docs/api/python/contents/core/cuda.rst
+++ b/Docs/api/python/contents/core/cuda.rst
+===========
+:mod:`CUDA`
+===========
+.. toctree::
+   :hidden:
+Quick Reference
+---------------
+==============================    =============================================================================
+List                              Brief
+==============================    =============================================================================
+`IsCUDADriverSufficient`_         Is cuda driver sufficient?
+`GetDevice`_                      Get the current active cuda device.
+`SynchronizeStream`_              Synchronize the specified cuda stream.
+==============================    =============================================================================
+.. automodule:: dragon.core.cuda
+    :members:
+.. _IsCUDADriverSufficient: #dragon.core.cuda.IsCUDADriverSufficient
+.. _GetDevice: #dragon.core.cuda.GetDevice
+.. _SynchronizeStream: #dragon.core.cuda.SynchronizeStream
\ No newline at end of file
--- a/Docs/api/python/contents/core/tensor_utils.rst
+++ b/Docs/api/python/contents/core/tensor_utils.rst
@@ -16,10 +16,9 @@ List                              Brief
 `FromPyArray`_                    Create a Tensor from a existing Array.
 `SetPyArray`_                     Set a Tensor from a existing Array.
 `ToPyArray`_                      Create a Array from a existing Tensor.
-`ToPyArrayEx`_                    Create a const Array from a existing Tensor.
+`GetStorage`_                     Get the storage of a existing Tensor.
 `ToCPUTensor`_                    Switch the storage of a existing Tensor on cpu memory.
 `ToCUDATensor`_                   Switch the storage of a existing Tensor on cuda memory.
-`GetTensorInfo`_                  Get the info of a existing Tensor.
 ==============================    =============================================================================
 API Reference
@@ -33,7 +32,6 @@ API Reference
 .. _FromPyArray: #dragon.core.tensor_utils.FromPyArray
 .. _SetPyArray: #dragon.core.tensor_utils.SetPyArray
 .. _ToPyArray: #dragon.core.tensor_utils.ToPyArray
-.. _ToPyArrayEx: #dragon.core.tensor_utils.ToPyArrayEx
+.. _GetStorage: #dragon.core.tensor_utils.GetStorage
 .. _ToCPUTensor: #dragon.core.tensor_utils.ToCPUTensor
 .. _ToCUDATensor: #dragon.core.tensor_utils.ToCUDATensor
-.. _GetTensorInfo: #dragon.core.tensor_utils.GetTensorInfo
\ No newline at end of file
\ No newline at end of file
--- a/Docs/api/python/contents/core/workspace.rst
+++ b/Docs/api/python/contents/core/workspace.rst
@@ -14,7 +14,7 @@ List                              Brief
 `HasTensor`_                      Query whether tensor has registered in current workspace.
 `CreateFiller`_                   Create the filler in the backend.
 `GetTensorName`_                  Query the name represented in current workspace.
-`RenameTensor`_                   Rename a tensor in current workspace.
+`SetTensorAlias`_                 Bind a alias to a existed tensor.
 `FeedTensor`_                     Feed the values to the given tensor.
 `FetchTensor`_                    Fetch the values of given tensor.
 `ResetTensor`_                    Reset the memory of given tensor.
@@ -27,7 +27,7 @@ Operator
 ==============================    =============================================================================
 List                              Brief
 ==============================    =============================================================================
-`RunOperator`_                    Create and Run the operator in the VM backend.
+`RunOperator`_                    Run the operator in the VM backend.
 ==============================    =============================================================================
@@ -39,7 +39,6 @@ List                              Brief
 ==============================    =============================================================================
 `CreateGraph`_                    Create the graph in the backend.
 `RunGraph`_                       Run the specific graph.
-`RunGraphEx`_                     Run the graph from the meta definition.
 ==============================    =============================================================================
 Misc
@@ -73,14 +72,13 @@ API Reference
 .. _CreateGraph: #dragon.core.workspace.CreateGraph
 .. _HasTensor: #dragon.core.workspace.HasTensor
 .. _GetTensorName: #dragon.core.workspace.GetTensorName
-.. _RenameTensor: #dragon.core.workspace.RenameTensor
+.. _SetTensorAlias: #dragon.core.workspace.SetTensorAlias
 .. _CreateFiller: #dragon.core.workspace.CreateFiller
 .. _FetchTensor: #dragon.core.workspace.FetchTensor
 .. _FeedTensor: #dragon.core.workspace.FeedTensor
 .. _ResetTensor: #dragon.core.workspace.ResetTensor
 .. _RunOperator: #dragon.core.workspace.RunOperator
 .. _RunGraph: #dragon.core.workspace.RunGraph
-.. _RunGraphEx: #dragon.core.workspace.RunGraphEx
 .. _Snapshot: #dragon.core.workspace.Snapshot
 .. _Restore: #dragon.core.workspace.Restore
 .. _LogMetaGraph: #dragon.core.workspace.LogMetaGraph

--- a/Docs/api/python/contents/ops.rst
+++ b/Docs/api/python/contents/ops.rst
@@ -42,7 +42,6 @@ List                   Brief
 `NNResize`_            Resize the image with *Nearest-Neighbor* method.
 `BilinearResize`_      Resize the image with *Bi-Linear* method.
 `BiasAdd`_             Add the bias across channels to a *NCHW* or *NHWC* input.
-`DenseConcat`_         Memory-efficient concatenation for DenseNet. `[Huang et.al, 2017] <http://arxiv.org/abs/1608.06993>`_.
 `DropBlock2d`_         Randomly drop the outputs according to the spatial blocks. `[Ghiasi et.al, 2018] <https://arxiv.org/abs/1810.12890>`_.
 ===================    ======================================================================
@@ -113,7 +112,9 @@ List                  Brief
 `Eltwise`_            Element-wise Sum or Product the arbitrary number of inputs.
 `Affine`_             Calculate *Y = Ax + b* along the given range of axes.
 `GramMatrix`_         Calculate the gram matrix. `[Gatys et.al, 2016] <https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Gatys_Image_Style_Transfer_CVPR_2016_paper.pdf>`_.
-`Moments`_            Compute the mean and variance of inputs along the given axes.
+`Moments`_            Calculate the mean and variance of inputs along the given axes.
+`Accumulate`_         Calculate *y = alpha * x + beta * y*
+`MovingAverage`_      Calculate the *y = (1 - decay) * x + decay * y*
 ==================    ======================================================================
 Normalization
@@ -174,12 +175,11 @@ Misc
 =================    ======================================================================
 List                 Brief
 =================    ======================================================================
-`AsType`_            Cast the data type of inputs to a specific one.
+`Cast`_              Cast the data type of inputs to a specific one.
 `Run`_               Run a custom operator. (Without GradientFlow)
 `Template`_          Run a custom operator. (With GradientFlow)
 `Accuracy`_          Calculate the Top-K accuracy.
 `StopGradient`_      Return the identity of input with truncated gradient flow.
-`MovingAverage`_     Calculate the moving average.
 =================    ======================================================================
 Contrib
@@ -268,6 +268,8 @@ List                 Brief
 .. _Affine: operators/arithmetic.html#dragon.operators.arithmetic.Affine
 .. _GramMatrix: operators/arithmetic.html#dragon.operators.arithmetic.GramMatrix
 .. _Moments: operators/arithmetic.html#dragon.operators.arithmetic.Moments
+.. _Accumulate: operators/arithmetic.html#dragon.operators.arithmetic.Accumulate
+.. _MovingAverage: operators/arithmetic.html#dragon.operators.arithmetic.MovingAverage
 .. _BatchNorm: operators/norm.html#dragon.operators.norm.BatchNorm
 .. _GroupNorm: operators/norm.html#dragon.operators.norm.GroupNorm
@@ -304,12 +306,11 @@ List                 Brief
 .. _Less: operators/control_flow.html#dragon.operators.control_flow.Less
 .. _Greater: operators/control_flow.html#dragon.operators.control_flow.Greater
-.. _AsType: operators/misc.html#dragon.operators.misc.AsType
+.. _Cast: operators/misc.html#dragon.operators.misc.Cast
 .. _Run: operators/misc.html#dragon.operators.misc.Run
 .. _Template: operators/misc.html#dragon.operators.misc.Template
 .. _Accuracy: operators/misc.html#dragon.operators.misc.Accuracy
 .. _StopGradient: operators/misc.html#dragon.operators.misc.StopGradient
-.. _MovingAverage: operators/misc.html#dragon.operators.misc.MovingAverage
 .. _Proposal: operators/contrib/rcnn.html#dragon.operators.contrib.rcnn.ops.Proposal

--- a/Docs/api/python/contents/tools.rst
+++ b/Docs/api/python/contents/tools.rst
@@ -19,16 +19,12 @@ ToolBox
   :hidden:
   tools/db
-   tools/im2db
-   tools/summary_writer
   tools/tensorboard
 ====================    ====================================================================================
 List                    Brief
 ====================    ====================================================================================
 `LMDB`_                 A wrapper of LMDB package.
-`IM2DB`_                Make the sequential database for images.
-`SummaryWriter`_        Write summaries for DragonBoard.
 `TensorBoard`_          Write summaries for TensorBoard.
 ====================    ====================================================================================
@@ -38,8 +34,5 @@ List                    Brief
    <p style="text-indent:1.5em; font-size: 18px; max-width: 830px;">
 .. _pip: https://pypi.python.org/pypi/pip
 .. _LMDB: tools/db.html
-.. _IM2DB: tools/im2db.html
-.. _SummaryWriter: tools/summary_writer.html
 .. _TensorBoard: tools/tensorboard.html
--- a/Docs/api/python/contents/tools/summary_writer.rst
+++ b/Docs/api/python/contents/tools/summary_writer.rst
-====================
-:mod:`SummaryWriter`
-====================
-.. toctree::
-   :hidden:
-Quick Reference
---------------
-====================    =============================================================================
-List                    Brief
-====================    =============================================================================
-`ScalarSummary`_        Write scalar summary.
-====================    =============================================================================
-API Reference
-------------
-.. currentmodule:: dragon.tools.summary_writer
-.. autoclass:: ScalarSummary
-    :members:
-    .. automethod:: __init__
-.. _ScalarSummary: #dragon.tools.summary_writer.ScalarSummary
\ No newline at end of file
--- a/Docs/api/python/contents/utils.rst
+++ b/Docs/api/python/contents/utils.rst
@@ -2,40 +2,30 @@
 :mod:`dragon.utils`
 ===================
-Wrapper
+Vision
-------
+------
 .. toctree::
   :hidden:
+   utils/vision/database
   utils/vision/data_batch
-===================================    =====================================================================
-List                                   Brief
-===================================    =====================================================================
-`dragon.utils.vision.data_batch`_      Efficient Batch data provider based on `LMDB`_.
-===================================    =====================================================================
-Component
---------
-.. toctree::
-   :hidden:
   utils/vision/data_reader
   utils/vision/data_transformer
   utils/vision/blob_fetcher
-==========================================      =====================================================================
+=========================================    =====================================================================
-List                                            Brief
+List                                         Brief
-==========================================      =====================================================================
+=========================================    =====================================================================
-`dragon.utils.vision.data_reader`_              Queue encoded string from `LMDB`_.
+`dragon.utils.vision.im2db`_                 Make the sequential database for images.
-`dragon.utils.vision.data_transformer`_         Queue transformed images from `DataReader`_.
+`dragon.utils.vision.data_batch`_            Efficient Batch data provider based on `LMDB`_.
-`dragon.utils.vision.blob_fetcher`_             Queue blobs from `DataTransformer`_.
+`dragon.utils.vision.data_reader`_           Queue encoded string from `LMDB`_.
-==========================================      =====================================================================
+`dragon.utils.vision.data_transformer`_      Queue transformed images from `DataReader`_.
+`dragon.utils.vision.blob_fetcher`_          Queue blobs from `DataTransformer`_.
+=========================================    =====================================================================
 .. _LMDB: http://lmdb.readthedocs.io/en/release
+.. _dragon.utils.vision.im2db: utils/vision/database.html
 .. _DataReader: utils/vision/data_reader.html#dragon.utils.vision.data_reader
 .. _DataTransformer: utils/vision/data_transformer.html#dragon.utils.vision.data_transformer
 .. _dragon.utils.vision.data_batch: utils/vision/data_batch.html

--- a/Docs/api/python/contents/tools/im2db.rst
+++ b/Docs/api/python/contents/tools/im2db.rst
-============
+===============
-:mod:`IM2DB`
+:mod:`Database`
-============
+===============
 .. toctree::
   :hidden:
@@ -19,8 +19,8 @@ List                    Brief
 API Reference
 -------------
-.. automodule:: dragon.tools.im2db
+.. automodule:: dragon.utils.vision.im2db
    :members:
-.. _resize_image: #dragon.tools.im2db.resize_image
+.. _resize_image: #dragon.utils.vision.im2db.resize_image
-.. _make_db: #dragon.tools.im2db.make_db
+.. _make_db: #dragon.utils.vision.im2db.make_db
\ No newline at end of file
--- a/Docs/api/python/contents/vm.rst
+++ b/Docs/api/python/contents/vm.rst
@@ -20,20 +20,23 @@ VirtualBox
   vm/caffe
   vm/theano
+   vm/torch
 ====================    ====================================================================================
 List                    Brief
 ====================    ====================================================================================
 `Theano`_               **Theano** is an inception of the modern deep learning frameworks.
 `Caffe`_                **Caffe** is one of the most famous deep learning framework for Computer Vision.
+`PyTorch`_              **PyTorch** provides straight-forward operations on research prototyping.
 ====================    ====================================================================================
 .. |para| raw:: html
    <p style="text-indent:1.5em; font-size: 18px; max-width: 830px;">
 .. _TinyDragon: ../index.html#tinydragon
 .. _Theano:  vm/theano.html
 .. _Caffe: vm/caffe.html
+.. _PyTorch: vm/torch.html
 .. _TensorFlow: ../index.html#tensorflow
--- a/Docs/api/python/contents/vm/caffe/layer.rst
+++ b/Docs/api/python/contents/vm/caffe/layer.rst
@@ -66,7 +66,6 @@ List                        Brief
 `AddLayer`_                 The extended implementation of ``EltwiseLayer``.
 `ConcatLayer`_              The implementation of ``ConcatLayer``.
 `SliceLayer`_               The implementation of ``SliceLayer``.
-`DenseConcatLayer`_         The implementation for `DenseNet`_.
 `CropLayer`_                The implementation of ``CropLayer``.
 `ReshapeLayer`_             The implementation of ``ReshapeLayer``.
 `PermuteLayer`_             The implementation of ``PermuteLayer``.
@@ -180,7 +179,6 @@ API Reference
 .. _AddLayer: #dragon.vm.caffe.layers.common.AddLayer
 .. _ConcatLayer: #dragon.vm.caffe.layers.common.ConcatLayer
 .. _SliceLayer: #dragon.vm.caffe.layers.common.SliceLayer
-.. _DenseConcatLayer: #dragon.vm.caffe.layers.common.DenseConcatLayer
 .. _CropLayer: #dragon.vm.caffe.layers.common.CropLayer
 .. _ReshapeLayer: #dragon.vm.caffe.layers.common.ReshapeLayer
 .. _PermuteLayer: #dragon.vm.caffe.layers.common.PermuteLayer
@@ -210,12 +208,10 @@ API Reference
 .. _MPIBroadcastLayer: #dragon.vm.caffe.layers.mpi.MPIBroadcastLayer
 .. _MPIGatherLayer: #dragon.vm.caffe.layers.mpi.MPIGatherLayer
 .. _Layer.Setup: #dragon.vm.caffe.layer.Layer.Setup
 .. _Layer.Fill: #dragon.vm.caffe.layer.Layer.Fill
 .. _LMDB: http://lmdb.readthedocs.io/en/release
-.. _DenseNet: http://arxiv.org/abs/1608.06993
 .. _LayerSetUp(layer.hpp, L91): https://github.com/BVLC/caffe/blob/effcdb0b62410b2a6a54f18f23cf90733a115673/include/caffe/layer.hpp#L91
 .. _DataParameter.source: https://github.com/BVLC/caffe/blob/effcdb0b62410b2a6a54f18f23cf90733a115673/src/caffe/proto/caffe.proto#L647
 .. _DataParameter.prefetch: https://github.com/BVLC/caffe/blob/effcdb0b62410b2a6a54f18f23cf90733a115673/src/caffe/proto/caffe.proto#L672

--- a/Docs/api/python/contents/vm/torch.rst
+++ b/Docs/api/python/contents/vm/torch.rst
+============
+:mod:`Torch`
+============
+Abstraction
+-----------
+|para| `PyTorch`_ provides straight-forward operations on research prototyping.
+|para| We are aware that **Dragon** is a graph-based framework with strictly naming
+for tensors, operators, and workspaces, while `Torch`_ is not.
+A simple way to bridge their differences is **JIT**, which traces the anonymous expressions,
+indicates a series of executions to the backend. If so, **AutoGrad** will just be a trick(Remember the *Chain Rule*).
+|para| Rewriting the GC(*Garbage Collection*) is crucial in this role,
+as the costly deconstruction on memories and operators must be avoided.
+We could either persist a Operator(i.e. **Module**),
+or reuse the several memories by turns(i.e. **MemoryPool**), if naming them formally.
+|para| We are still working hard to cover the original PyTorch operators,
+however, a bunch of extended operators in many other frameworks can be used.
+Our **PyTorch** will be unique and more powerful than the official one.
+Related Work
+------------
+|paratitle| **Proto-based Intermediate Representation**
+|para| Recent years, several powerful frameworks choose the ProtocolBuffer to
+describe the operators with various arguments, including `Caffe`_, `Caffe2`_, `TensorFlow`_, and `ONNX`_.
+The most important reason is that, these descriptors can be easily serialized and sent to the backend.
+With the help of **Factory Pattern**, we have had an elegant way to dispatch the executions, while not
+call them imperatively. This way is also known as the **Declarative Programming**.
+|para| Attaching the IR(Intermediate Representation) takes the following advantages:
+* Traceable pipelines, much helpful for visualizing and debugging.
+* Deterministic executions, detailed optimization can be applied.
+* Efficient deployments, data-flows has been well organized.
+|para| A good news is that, we can reduce the overhead of IR below 5% of computation time,
+which means the dynamic graph could work as fast as the static graph while retain the flexibility.
+|paratitle| **Caffe2**
+|para| We have noticed that some developers discouraged the **Declarative Programming** in 2017 and early 2018,
+due to the counter-intuitive building of computation graph. Actually, `Caffe2`_ had published Operator-Wise execution
+(a.k.a, *workspace.RunOperator()*) since 2016. In other words, **Imperative Programming** is the subset of **Declarative Programming**,
+if we process the declaration implicitly. This mechanism is sometimes called **JIT** by someone.
+Architectures
+-------------
+.. toctree::
+   :hidden:
+.. _Torch: http://torch.ch
+.. _PyTorch: https://pytorch.org
+.. _Caffe: http://caffe.berkeleyvision.org
+.. _Caffe2: http://caffe2.ai
+.. _TensorFlow: https://www.tensorflow.org
+.. _ONNX: https://onnx.ai
+.. |nbsp| raw:: html
+    &nbsp
+.. |br| raw:: html
+    <br />
+.. |paratitle| raw:: html
+    <p style="font-size: 20px">
+.. |sectitle| raw:: html
+    <p style="text-indent:1em; font-size: 18px">
+.. |para| raw:: html
+    <p style="text-indent:1.5em; font-size: 18px; max-width: 830px;">
+.. |context| raw:: html
+    <p style="font-size: 18px; max-width: 830px;">
--- a/Dragon/CMakeLists.txt
+++ b/Dragon/CMakeLists.txt
@@ -97,6 +97,7 @@ include_directories(${PROJECT_SOURCE_DIR}/src)
 if (BUILD_PYTHON_API)
    include_directories(${PYTHON_INCLUDE_DIRS})
    include_directories(${NUMPY_INCLUDE_DIR})
+    include_directories(${THIRD_PARTY_DIR}/pybind11/include)
 endif()
 if (WITH_CUDA)
    include_directories(${CUDA_INCLUDE_DIRS})

--- a/Dragon/include/core/context.h
+++ b/Dragon/include/core/context.h
@@ -38,7 +38,7 @@ class CPUContext {
    void SwitchToDevice() {}
    /*! \brief Switch to the device with the given stream */
-    void SwitchToDevice(int stream_id) {}
+    void SwitchToDevice(const int stream_id) {}
    /*! \brief Synchronize the dispatched operations */
    void FinishDeviceCompution() {}
@@ -106,6 +106,9 @@ class CPUContext {
    /*! \brief Return the device id */
    int device_id() const { return 0; }
+    /*! \brief Return the stream id */
+    int stream_id() const { return 0; }
    /*! \brief Set the stream id */
    void set_stream_id(int stream_id) {}

--- a/Dragon/include/core/context_cnml.h
+++ b/Dragon/include/core/context_cnml.h
@@ -32,6 +32,7 @@ class CNRTObject;
 class CNMLContext {
 public:
+     /*! \brief Default Constructor */
     CNMLContext(const DeviceOption& option)
        : device_id_(option.device_id()),
        random_seed_(option.has_random_seed() ?
@@ -39,34 +40,43 @@ class CNMLContext {
        CHECK_EQ(option.device_type(), PROTO_CNML);
    }
+    /*! \brief Constructor with the specified device id */
    CNMLContext(const int device_id = 0)
        : device_id_(device_id),
          random_seed_(DEFAULT_RNG_SEED) {}
+    /*! \brief Switch to the device with the given stream */
    void SwitchToDevice(int stream_id);
-    inline void SwitchToDevice() { SwitchToDevice(1); }
+    /*! \brief Switch to the device of this context */
+    inline void SwitchToDevice() { SwitchToDevice(0); }
+    /*! \brief Synchronize the dispatched operations */
    void FinishDeviceCompution();
+    /*! \brief Malloc the memory */
    static void* New(size_t nbytes);
+    /*! \brief Zero-Reset the memory */
    static void Memset(
        size_t              nbytes,
        void*               ptr);
+    /*! \brief Zero-Reset the memory asynchronously */
    inline void MemsetAsync(
        size_t              nbytes,
        void*               ptr) {
        Memset(nbytes, ptr);
    }
+    /*! \brief Copy the memory */
    template<class DstContext, class SrcContext>
    static void Memcpy(
        size_t              nbytes,
        void*               dst,
        const void*         src);
+    /*! \brief Copy the memory with given type asynchronously */
    template<class DstContext, class SrcContext>
    inline void MemcpyAsync(
        size_t              nbytes,
@@ -75,23 +85,33 @@ class CNMLContext {
        Memcpy<DstContext, SrcContext>(dst, src, nbytes);
    }
+    /*! \brief Free the memory */
    static void Delete(void* data);
-    inline int device_id() const { return device_id_; }
+    /*! \brief Return the device id */
+    int device_id() const { return device_id_; }
-    inline void set_stream_id(int stream_id) { stream_id_ = stream_id; }
+    /*! \brief Return the stream id */
+    int stream_id() const { return stream_id_; }
+    /*! \brief Set the stream id */
+    void set_stream_id(int stream_id) { stream_id_ = stream_id; }
-    inline cnrtStream_t cnrt_stream() {
+    /*! \brief Return the internal cnrt stream */
+    cnrtStream_t cnrt_stream() {
        return cnrt_stream(device_id_, stream_id_);
    }
+    /*! \brief Return the specified cnrt stream */
    static cnrtStream_t cnrt_stream(
        int                 device_id,
        int                 stream_id);
+    /*! \brief Return the global context locker */
    static std::mutex& mutex() { static std::mutex m; return m; }
-    static CNRTObject* cuda_object();
+    /*! \brief Return the thread local cnrt object */
+    static CNRTObject* cnrt_object();
 private:
    int device_id_, stream_id_ = 1, random_seed_;

--- a/Dragon/include/core/context_cuda.h
+++ b/Dragon/include/core/context_cuda.h
@@ -80,11 +80,16 @@ class CUDAObject {
        } return dev_streams[stream_id];
    }
-    /*! \brief Return the default cuda stream */
+    /*! \brief Return the default cuda stream of current device */
    cudaStream_t GetDefaultStream() {
        return GetStream(CUDA_GET_DEVICE(), 0);
    }
+    /*! \brief Return the default cuda stream of given device */
+    cudaStream_t GetDefaultStream(int device_id) {
+        return GetStream(device_id, 0);
+    }
    /*! \brief Return the specified cublas handle */
    cublasHandle_t GetCuBLASHandle(int device_id, int stream_id) {
        vector<cublasHandle_t>& dev_handles = cublas_handles[device_id];
@@ -141,13 +146,13 @@ class CUDAContext {
          random_seed_(DEFAULT_RNG_SEED) {}
    /*! \brief Switch to the device with the given stream */
-    void SwitchToDevice(int stream_id) {
+    void SwitchToDevice(const int stream_id) {
        CUDA_CHECK(cudaSetDevice(device_id_));
        stream_id_ = stream_id;
    }
    /*! \brief Switch to the device of this context */
-    void SwitchToDevice() { SwitchToDevice(1); }
+    void SwitchToDevice() { SwitchToDevice(0); }
    /*! \brief Synchronize the dispatched operations */
    void FinishDeviceCompution() {
@@ -191,8 +196,19 @@ class CUDAContext {
        size_t              nbytes,
        void*               dst,
        const void*         src) {
+        MemcpyEx<DstContext, SrcContext>(
+            nbytes, dst, src, active_device_id());
+    }
+    /*! \brief Copy the memory [Extended] */
+    template<class DstContext, class SrcContext>
+    static void MemcpyEx(
+        size_t              nbytes,
+        void*               dst,
+        const void*         src,
+        int                 device_id) {
        cudaStream_t stream = CUDAContext::
-            cuda_object()->GetDefaultStream();
+            cuda_object()->GetDefaultStream(device_id);
        CUDA_CHECK(cudaMemcpyAsync(dst, src, nbytes,
            cudaMemcpyDefault, stream));
        cudaError_t error = SynchronizeStream(stream);
@@ -230,9 +246,15 @@ class CUDAContext {
        return cudaGetLastError();
    }
-    /*! \brief Return the device id */
+    /*! \brief Return the device id of this context */
    int device_id() const { return device_id_; }
+    /*! \brief Return the active device id of current thread */
+    static int active_device_id() { return CUDA_GET_DEVICE(); }
+    /*! \brief Return the stream id */
+    int stream_id() const { return stream_id_; }
    /*! \brief Set the stream id */
    void set_stream_id(int stream_id) { stream_id_ = stream_id; }
@@ -292,85 +314,48 @@ class CUDAContext {
    }
 private:
-    int device_id_, stream_id_ = 1, random_seed_;
+    int device_id_, stream_id_ = 0, random_seed_;
    unique_ptr<std::mt19937> rand_generator_;
    curandGenerator_t curand_generator_ = nullptr;
 };
-template <class Context>
-class CUDAClosure {
- public:
-     /*! \brief Default Constructor */
-    CUDAClosure() {}
-    /*! \brief Constructor with the given context */
-    explicit CUDAClosure(Context* ctx): ctx_(ctx) {}
-    /*! \brief Synchronize the dispatched operations */
-    void Sync() {
-        for (auto stream_id : active_streams_) {
-            cudaStreamSynchronize(cuda_object_
-                .GetStream(ctx_->device_id(), stream_id));
-            cudaError_t error = cudaGetLastError();
-            CHECK_EQ(error, cudaSuccess)
-                << "\nCUDA Error: " << cudaGetErrorString(error);
-        }
-        active_streams_.clear();
-    }
-    /*! \brief Return the specified cuda stream */
-    cudaStream_t cuda_stream(int stream_id) {
-        active_streams_.push_back(stream_id);
-        return cuda_object_.GetStream(
-            ctx_->device_id(), stream_id);
-    }
-    /*! \brief Return the specified cublas handle */
-    cublasHandle_t cublas_handle(int stream_id) {
-        active_streams_.push_back(stream_id);
-        return cuda_object_.GetCuBLASHandle(
-            ctx_->device_id(), stream_id);
-    }
-    /*! \brief Return the specified cudnn handle */
-#ifdef WITH_CUDNN
-    cudnnHandle_t cudnn_handle(int stream_id) {
-        active_streams_.push_back(stream_id);
-        return cuda_object_.GetCuDNNHandle(
-            ctx_->device_id(), stream_id);
-    }
-#endif
- protected:
-    Context* ctx_;
-    CUDAObject cuda_object_;
-    vector<int> active_streams_;
-};
 #else  // WITH_CUDA
 class CUDAContext {
 public:
+    /*! \brief Default Constructor */
    CUDAContext(const DeviceOption& option) { CUDA_NOT_COMPILED; }
+    /*! \brief Constructor with the specified device id */
    CUDAContext(const int device_id = 0) { CUDA_NOT_COMPILED; }
-    void SwitchToDevice() { CUDA_NOT_COMPILED; }
+    /*! \brief Switch to the device with the given stream */
    void SwitchToDevice(int stream_id) { CUDA_NOT_COMPILED; }
+    /*! \brief Switch to the device of this context */
+    void SwitchToDevice() { CUDA_NOT_COMPILED; }
+    /*! \brief Synchronize the dispatched operations */
    void FinishDeviceCompution() { CUDA_NOT_COMPILED; }
+    /*! \brief Malloc the memory */
+    static void* New(size_t nbytes) { CUDA_NOT_COMPILED; }
+    /*! \brief Zero-Reset the memory */
    static void Memset(
        size_t              nbytes,
        void*               ptr) {
        CUDA_NOT_COMPILED;
    }
+    /*! \brief Zero-Reset the memory asynchronously */
    void MemsetAsync(
        size_t              nbytes,
        void*               ptr) {
        CUDA_NOT_COMPILED;
    }
+    /*! \brief Copy the memory */
    template<class DstContext, class SrcContext>
    static void Memcpy(
        size_t              nbytes,
@@ -379,6 +364,17 @@ class CUDAContext {
        CUDA_NOT_COMPILED;
    }
+    /*! \brief Copy the memory [Extended] */
+    template<class DstContext, class SrcContext>
+    static void MemcpyEx(
+        size_t              nbytes,
+        void*               dst,
+        const void*         src,
+        int                 device_id) {
+        CUDA_NOT_COMPILED;
+    }
+    /*! \brief Copy the memory asynchronously */
    template<class DstContext, class SrcContext>
    void MemcpyAsync(
        size_t              nbytes,
@@ -387,7 +383,16 @@ class CUDAContext {
        CUDA_NOT_COMPILED;
    }
+    /*! \brief Return the device id */
    int device_id() const { return 0; }
+    /*! \brief Return the active device id of current thread */
+    static int active_device_id() { return 0; }
+    /*! \brief Return the stream id */
+    int stream_id() const { return 0; }
+    /*! \brief Set the stream id */
    void set_stream_id(int stream_id) {}
 };

--- a/Dragon/include/core/graph.h
+++ b/Dragon/include/core/graph.h
@@ -20,80 +20,69 @@ namespace dragon {
 class GraphBase {
 public:
-    struct Node {
+    /*! \brief Default constructor */
-        vector<string> parents;
-        vector<string> childs;
-        int op_idx = -1;
-        OperatorDef op_def;
-    };
    GraphBase(
        const GraphDef&         meta_graph,
        Workspace*              ws);
+    /*! \brief Default deconstructor */
    virtual ~GraphBase() {}
+    GraphDef BuildUpdateOps(const GraphDef& input_def);
+    /*! \brief Create a graph from the optimized def */
    virtual bool Create(
        const GraphDef&         optimized_graph,
        Workspace*              ws) = 0;
+    /*! \brief Run the graph once synchronously */
    virtual bool Run(
        const string&           include,
        const string&           exclude,
-        const int               stream_id = 1) = 0;
+        int                     stream_id = 0) = 0;
+    /*! \brief Return the name of this graph */
    string name() const { return name_; }
 protected:
+    /*! \brief Store the name and running phase */
    string name_, phase_;
+    /*! \brief Store the defined arguments */
    Map<string, Argument> args_;
+    /*! \brief Store the parent workspace */
    Workspace* ws_;
 };
 class Graph : public GraphBase {
 public:
+    /*! \brief Default constructor */
    Graph(const GraphDef& meta_graph, Workspace* ws);
+    /*! \brief Default deconstructor */
    virtual ~Graph() { for (auto* op : ops_) delete op; }
+    /*! \brief Create a graph from the optimized def */
    bool Create(
        const GraphDef&         optimized_graph,
        Workspace*              ws) override;
+    /*! \brief Run the graph once synchronously */
    bool Run(
        const string&           include,
        const string&           exclude,
-        const int               stream_id = 1) override;
+        int                     stream_id = 0) override;
-    GraphDef Prune(const GraphDef& meta_graph);
-    GraphDef Share(const GraphDef& optimized_graph);
-    void ShareGrads(GraphDef& optimized_graph);
-    GraphDef BuildUpdateOps(const GraphDef& meta_graph);
-    void RecomputingAware(
-        const GraphDef&         optimized_graph,
-        Workspace*              ws);
+    /*! \brief Return the parent workspace */
    Workspace* ws() const { return ws_; }
 protected:
-    void ForwardShareDyeing(
+    /*! \brief Store the internal operators */
-        const string&               u,
-        const string&               ancestor);
-    void ForwardPruneDyeing(
-        const string&               u,
-        const string&               leaf,
-        const vector<string>&       path);
-    void BackwardPruneDyeing(string v);
    vector<OperatorBase*> ops_;
-    Map<string, Node> dag_;
-    Map<string, bool> visited_, colored_;
-    Map<string, string> renamed_;
-    Set<string> targets_;
 };
+/*! \brief Create a graph from the raw def */
 GraphBase* NewGraph(
    const GraphDef&             meta_graph,
    Workspace*                  ws);

--- a/Dragon/include/core/graph_gradient.h
+++ b/Dragon/include/core/graph_gradient.h
@@ -19,14 +19,19 @@ namespace dragon {
 class GraphGradientMaker {
 public:
-    GraphGradientMaker(): cur_op_idx_(0) {}
+    GraphGradientMaker()
+        : cur_op_idx_(0) {}
    void Make(
-        const GraphDef&         forward_def,
+        const vector<OperatorDef*>&  forward_def,
-        const vector<string>&   targets,
+        const vector<string>&        targets,
-        GraphDef&               new_def);
+        GraphDef&                    new_def);
-    void Share(const string& grads_prefix, GraphDef& graph);
+    void Make(
+        const GraphDef&              forward_def,
+        GraphDef&                    backward_def);
+    void Share(GraphDef& graph);
    void SetTerms(const Map<string, string>& terms) { terms_ = terms; }
    void SetOperatorPrefix(const string& prefix) { op_prefix_ = prefix; }

--- a/Dragon/include/core/graph_optimizer.h
+++ b/Dragon/include/core/graph_optimizer.h
+/*!
+ * Copyright (c) 2017-present, SeetaTech, Co.,Ltd.
+ *
+ * Licensed under the BSD 2-Clause License.
+ * You should have received a copy of the BSD 2-Clause License
+ * along with the software. If not, See,
+ *
+ *      <https://opensource.org/licenses/BSD-2-Clause>
+ *
+ * ------------------------------------------------------------
+ */
+#ifndef DRAGON_CORE_GRAPH_OPTIMIZER_H_
+#define DRAGON_CORE_GRAPH_OPTIMIZER_H_
+#include "core/common.h"
+namespace dragon {
+class Workspace;
+class GraphOptimizer {
+ public:
+    /*! \brief The simple node structure */
+    struct Node {
+        vector<string> parents;
+        vector<string> childs;
+        int op_idx = -1;
+        OperatorDef op_def;
+    };
+    /*! \brief Default constructor */
+    GraphOptimizer(Workspace* ws) : ws_(ws) {}
+    /*! \brief Prune the redundant nodes (-O1) */
+    GraphDef PruneNodes(const GraphDef& input_def);
+    /*! \brief Add the inplace for outputs (-O2) */
+    GraphDef AddInplace(const GraphDef& input_def);
+    /*! \brief Plan the recomputing for inputs (-O3) */
+    GraphDef MirrorStage(
+        const GraphDef&                   input_def,
+        Map< string, vector<int> >&       op_indices);
+    /*! \brief Allocate the buffer for outputs (-O3) */
+    GraphDef SimulateGC(const GraphDef& input_def);
+ protected:
+    /*! \brief Traverse from input gradients to dying the nodes */
+    void ForwardPruneTraversal(
+        const string&               u,
+        const string&               leaf,
+        const vector<string>&       path);
+    /*! \brief Traverse from targets to dying the nodes */
+    void BackwardPruneTraversal(const string& v);
+    /*! \brief Traverse from inputs to find the available inplace chain */
+    void InplaceTraversal(
+        const string&               u,
+        const string&               ancestor);
+    /* \brief Store the workspace of parent graph */
+    Workspace* ws_;
+    /* \brief Store the DAG */
+    Map<string, Node> dag_;
+    /* \brief Store the traversal flags */
+    Map<string, bool> visited_, colored_;
+    /* \brief Store the inplace relations */
+    Map<string, string> renamed_;
+};
+}  // namespace dragon
+#endif  // DRAGON_CORE_GRAPH_OPTIMIZER_H_
\ No newline at end of file
--- a/Dragon/include/core/mixedmem.h
+++ b/Dragon/include/core/mixedmem.h
@@ -35,8 +35,6 @@ class MixedMemory {
        STATE_AT_CUDA,
        /*! \brief Memory could be modified by CNMLContext last time */
        STATE_AT_CNML,
-        /*! \brief Memory should be copied to another device next time */
-        SWITCHED,
        /*! \brief Host and Device now hold the same contents */
        SYNCED,
    } State;
@@ -46,7 +44,7 @@ class MixedMemory {
          cuda_ptr_(nullptr), cnml_ptr_(nullptr) {}
    /*! \brief Constructor with the known meta and size */
-    MixedMemory(const TypeMeta& meta, const size_t nbytes)
+    MixedMemory(const TypeMeta& meta, size_t nbytes)
        : meta_(meta), nbytes_(nbytes), cpu_ptr_(nullptr),
          cuda_ptr_(nullptr), cnml_ptr_(nullptr) {}
@@ -54,19 +52,19 @@ class MixedMemory {
    ~MixedMemory();
    /*! \brief Return the const data pointer on CPUContext */
-    const void* cpu_data();
+    const void* cpu_data(size_t nbytes = 0);
    /*! \brief Return the const data pointer on CUDAContext */
-    const void* cuda_data();
+    const void* cuda_data(size_t nbytes = 0);
    /*! \brief Return the const data pointer on CNMLContext */
    const void* cnml_data();
    /*! \brief Return the mutable data pointer on CPUContext */
-    void* mutable_cpu_data();
+    void* mutable_cpu_data(size_t nbytes = 0);
    /*! \brief Return the mutable data pointer on CUDAContext */
-    void* mutable_cuda_data();
+    void* mutable_cuda_data(size_t nbytes = 0);
    /*! \brief Return the mutable data pointer on CNMLContext */
    void* mutable_cnml_data();
@@ -85,11 +83,11 @@ class MixedMemory {
    /*! \brief Set the cpu data pointer from external context */
    void set_cpu_data(void* cpu_ptr, size_t nbytes);
-    /*! \brief Switch to the device set by Context before */
-    void SwitchToDevice();
    /*! \brief Switch to the specified device */
+    void SwitchToDevice(int device_id);
+    /*! \brief Switch to the specified cuda device */
    void SwitchToCUDADevice(int device_id);
    /*! \brief Return the total bytes of this memory */
@@ -110,14 +108,17 @@ class MixedMemory {
    /*! \brief Set the storage order */
    void set_order(StorageOrder order) { order_ = order; }
+    /*! \brief Return the device id of the memory on device */
+    int device_id() const { return ptr_device_; }
    /*! \brief Return a string to describe the internal structure */
    const Map<string, string> info() const;
    /*! \brief Control the state machine to CPUContext */
-    void ToCPU();
+    void ToCPU(size_t nbytes = 0);
    /*! \brief Control the state machine to CUDAContext */
-    void ToCUDA();
+    void ToCUDA(size_t nbytes = 0);
 private:
    /*! \brief The type meta to call the deconstructor */
@@ -137,7 +138,7 @@ class MixedMemory {
    /*! \brief Whether this memory owns the cpu data pointer */
    int own_cpu_ptr_ = 1;
    /*! \brief Store the device id for some data pointers */
    int ptr_device_ = 0;

--- a/Dragon/include/core/operator.h
+++ b/Dragon/include/core/operator.h
@@ -30,10 +30,10 @@ class Workspace;
 class OperatorBase {
 public:
-    /*! Default constructor */
+    /*! \brief Default constructor */
    OperatorBase(const OperatorDef& def, Workspace* ws);
-    /*! Default deconstructor */
+    /*! \brief Default deconstructor */
    virtual ~OperatorBase() {}
    /*! \brief Return the specified input tensor */
@@ -49,19 +49,13 @@ class OperatorBase {
    int OutputSize() { return (int)outputs_.size(); }
    /*! \brief Modify this operator according to the given def  */
-    void MutableOp(const OperatorDef& def);
+    void UpdateFrom(const OperatorDef& def);
-    /*! \brief Modify this operator according to the given properties */
-    void MutableOp(
-        const vector<string>&       inputs,
-        const vector<string>&       outputs,
-        const string&               anchor);
    /*! \brief Switch the internal running phase */
    void SwitchToPhase(const string& phase) { phase_ = phase; }
    /*! \brief Run this operator on the specified stream */
-    virtual void Run(int stream_id = 1) { NOT_IMPLEMENTED; }
+    virtual void Run(int stream_id = 0) { NOT_IMPLEMENTED; }
    /*! \brief Fusion this operator into the specified graph */
    virtual void Fusion(void* graph) { NOT_IMPLEMENTED; }
@@ -100,14 +94,14 @@ class OperatorBase {
    /*! \brief Return the specified argument */
    const Argument& arg(const string& name) { return *(args_[name]); }
-    typedef Map<string, vector<OperatorBase*> > RecomputeMap;
+    typedef Map<string, vector<OperatorBase*> > SubGraph;
-    /*! \brief Return the recomputing map of this operator */
+    /*! \brief Return the recomputing subgraph of this operator */
-    RecomputeMap& recompute_map() { return recompute_map_; }
+    SubGraph& subgraph() { return subgraph_; }
-    /*! \brief Set the given recomputing map */
+    /*! \brief Set the given recomputing subgraph */
-    void set_recompute_map(RecomputeMap recompute_map) {
+    void set_subgraph(SubGraph subgraph) {
-        recompute_map_ = recompute_map; 
+        subgraph_ = subgraph;
    }
    /*! \brief Return the stored operator def */
@@ -129,7 +123,7 @@ class OperatorBase {
 protected:
    string phase_, anchor_;
    Map<std::string, const Argument*> args_;
-    Map<string, vector<OperatorBase*> > recompute_map_;
+    SubGraph subgraph_;
    vector<Tensor*> inputs_, outputs_;
    OperatorDef def_;
    Workspace* ws_;
@@ -138,50 +132,66 @@ class OperatorBase {
 template <class Context>
 class Operator : public OperatorBase {
 public:
+    /*! \brief Default constructor */
    Operator(const OperatorDef& def, Workspace* ws)
        : OperatorBase(def, ws), ctx_(def.device_option()),
-          allow_recompute_(OperatorBase::Arg<bool>(
+          allow_recomputing_(OperatorBase::Arg<bool>(
-              "recomputing_aware", false)),
+              "allow_recomputing", false)),
          do_sync_(OperatorBase::Arg<bool>(
-              "do_sync", true)) {
+              "do_sync", false)) {
        allow_run_ = true;
-        allow_run_ &= _MPICheck();
+        allow_run_ &= MPICheck();
        allow_run_ &= (!(OutputSize() == 1 &&
            Output(0)->name() == "ignore"));
    }
-    void Run(int stream_id = 1) final {
+    /*! \brief Run this operator on the specified stream */
+    void Run(int stream_id = 0) final {
        if (!allow_run_) return;
-        if (allow_recompute_) MakeResource();
+        if (allow_recomputing_) PrepareResource();
        ctx()->SwitchToDevice(stream_id);
        MemorySwitch();
        RunOnDevice();
-        if (do_sync_) ctx()->FinishDeviceCompution();
+        if (do_sync_ || stream_id > 0) {
-        if (allow_recompute_) CleanResource();
+            // We will sync the stream 0 at the specific time
+            ctx()->FinishDeviceCompution();
+        }
+        if (allow_recomputing_) ReleaseResource();
    }
-    virtual void ElimateCorruption();
+    /*! \brief Prepare the content of inputs */
-    virtual void MakeResource();
+    virtual void PrepareResource();
-    virtual void CleanResource();
+    /*! \brief Release the ownership of inputs */
+    virtual void ReleaseResource();
+    /*! \brief Coordinate the context of inputs and outputs */
    virtual void MemorySwitch() {
-        for (auto* I : inputs_)
+        for (auto* e : inputs_)
-            if(I->name() != "ignore") I->SwitchToDevice();
+            if(e->name() != "ignore")
-        for (auto* O : outputs_) 
+                e->SwitchToDevice(ctx()->device_id());
-            if(O->name() != "ignore") O->SwitchToDevice();
+        for (auto* e : outputs_)
+            if(e->name() != "ignore")
+                e->SwitchToDevice(ctx()->device_id());
    }
+    /*! \brief Implement the detailed execution */
    virtual void RunOnDevice() = 0;
+    /*! \brief Return the internal context */
    Context* ctx() { return &ctx_; }
+    /*! \brief Whether this operator can be ignored */
    bool AllowRun() { return allow_run_; }
 protected:
+    /*! \brief Store the internal context */
    Context ctx_;
-    bool allow_run_, allow_recompute_, do_sync_;
+    bool allow_run_, allow_recomputing_, do_sync_;
 private:
-    bool _MPICheck() {
+    /*! \brief Check the MPI conditions */
+    bool MPICheck() {
 #ifndef WITH_MPI
        return true;
 #else
@@ -197,7 +207,13 @@ class Operator : public OperatorBase {
    }
 };
-OperatorBase* CreateOperator(const OperatorDef& def, Workspace* ws);
+/*! \brief New a operator from the raw def */
+OperatorBase* NewOperator(
+    const OperatorDef&          def,
+    Workspace*                  ws);
+/*! Macros */
 #define USE_SIMPLE_CTOR_DTOR(name) \
    name(const OperatorDef& def, Workspace* ws) \
@@ -350,7 +366,9 @@ DECLARE_REGISTRY(
            << "\nExcepted the size of " << #argument \
            << " > " << idx << ". (Got " \
            << argument##_desc.size() << ")."; \
-        Tensor* argument##_tensor = ws()->GetTensor(argument##_desc[idx]); \
+        Tensor* argument##_tensor = ws()->GetTensor( \
+            str::replace_first(argument##_desc[idx], \
+                "${ANCHOR}", anchor())); \
        CHECK(argument##_tensor->IsType<type>()) \
            << "\nThe type of " << #argument << " should be " << #type << "."; \
        CHECK_EQ(argument##_tensor->count(), 1) \

--- a/Dragon/include/core/operator_gradient.h
+++ b/Dragon/include/core/operator_gradient.h
@@ -46,10 +46,17 @@ class GradientMakerBase {
    virtual Gradient Make() {
        vector<OperatorDef> new_defs = MakeDefs();
-        Argument anchor;
+        if (def.has_uid()) {
-        anchor.set_name("anchor"); anchor.set_s(def.name());
+            // Attach the anchor to the name if having UID
-        for (int i = 0; i < new_defs.size(); i++)
+            for (int i = 0; i < new_defs.size(); i++)
-            new_defs[i].add_arg()->CopyFrom(anchor);
+                new_defs[i].set_name(def.name());
+        } else {
+            // Otherwise, just put it into the arguments
+            Argument anchor;
+            anchor.set_name("anchor"); anchor.set_s(def.name());
+            for (int i = 0; i < new_defs.size(); i++)
+                new_defs[i].add_arg()->CopyFrom(anchor);
+        }
        return Gradient(new_defs, g_inputs_, DefaultValues());
    };
@@ -117,10 +124,10 @@ class NoGradient : public GradientMakerBase {
 class SimpleGradientMaker final : public GradientMakerBase {
 public:
    /*!
-     *        <SimpleMaker>
+     *              <SimpleMaker>
     *
-     * Inputs: X1, X2, ..., Xn, dY
+     *    Inputs: X1, X2, ..., Xn, dY
-     * Outputs: dX1, dX2, ..., dXn
+     *    Outputs: dX1, dX2, ..., dXn
     *
     */
    GRADIENT_MAKER_CTOR(SimpleGradientMaker);
@@ -141,12 +148,12 @@ class SimpleGradientMaker final : public GradientMakerBase {
 class InplaceGradientMaker final : public GradientMakerBase {
 public:
    /*!
-    *               <InplaceMaker>
+     *               <InplaceMaker>
-    *
+     *
-    * Inputs:           Y, dY
+     *    Inputs:           Y, dY
-    * Outputs:           dX
+     *    Outputs:           dX
-    *
+     *
-    */
+     */
    GRADIENT_MAKER_CTOR(InplaceGradientMaker);
    vector<OperatorDef> MakeDefs() override {
        return SingleDef(

--- a/Dragon/include/core/tensor.h
+++ b/Dragon/include/core/tensor.h
@@ -80,7 +80,7 @@ class Tensor {
    int ndim() const { return (int)dims_.size(); }
    /*! \brief Return the dimension of given axis */
-    int64_t dim(const int64_t i) const{ return dims_[axis(i)]; }
+    int64_t dim(int64_t i) const{ return dims_[axis(i)]; }
    /*! \brief Return all the dimensions */
    const vector<int64_t>& dims() const { return dims_; }
@@ -95,7 +95,7 @@ class Tensor {
    size_t capacity() const { return capacity_; }
    /*! \brief Return the number of elements along the [start, end) axes */
-    int64_t count(const int64_t start, const int64_t end) const {
+    int64_t count(int64_t start, int64_t end) const {
        int64_t nelements = 1;
        for (int64_t i = start; i < end; i++) nelements *= dim(i);
        return nelements;
@@ -105,10 +105,10 @@ class Tensor {
    int64_t count() const { return (int64_t)size_; }
    /*! \brief Return the number of elements from the start axis */
-    int64_t count(const int64_t start) const { return count(start, ndim()); }
+    int64_t count(int64_t start) const { return count(start, ndim()); }
    /*! \brief Return the stride of given axis */
-    int64_t stride(const int64_t i) const { return strides_[axis(i)]; }
+    int64_t stride(int64_t i) const { return strides_[axis(i)]; }
    /*! \brief Return all the strides */
    const vector<int64_t>& strides() const { return strides_; }
@@ -128,11 +128,11 @@ class Tensor {
    /*! \brief Return a string to describe the dimensions of this tensor */
    string DimString() const { return DimString(dims_); }
-    /*! \brief Whether the memory of this tensor is unstable */
+    /*! \brief Return the version of this tensor */
-    bool is_corrupted() const { return is_corrupted_; }
+    int version() const { return version_; }
-    /*! \brief Mark the internal memory to be unstable */
+    /*! \brief Set the version of this tensor */
-    void Corrupt() { is_corrupted_ = true; }
+    void set_version(int version) { version_ = version; }
    /*! \brief Whether this tensor holds a valid memory */
    bool has_memory() const { return memory_ || ex_memory_ != nullptr; }
@@ -152,10 +152,10 @@ class Tensor {
        return memory()->state();
    }
-    /*! \brief Switch the memory to device set by Context before */
+    /*! \brief Switch the memory to the specific device */
-    void SwitchToDevice() {
+    void SwitchToDevice(int device_id) {
        MixedMemory* mem = memory();
-        if (mem) mem->SwitchToDevice();
+        if (mem) mem->SwitchToDevice(device_id);
    }
    /*! \brief Return the type meta of this tensor */
@@ -177,10 +177,10 @@ class Tensor {
        } else {
            if (TypeMeta::Id<Context>() ==
                TypeMeta::Id<CPUContext>()) {
-                *data_ptr = mem->mutable_cpu_data();
+                *data_ptr = mem->mutable_cpu_data(nbytes());
            } else if (TypeMeta::Id<Context>() ==
                TypeMeta::Id<CUDAContext>()) {
-                *data_ptr = mem->mutable_cuda_data();
+                *data_ptr = mem->mutable_cuda_data(nbytes());
            } else if (TypeMeta::Id<Context>() ==
                TypeMeta::Id<CNMLContext>()) {
                *data_ptr = mem->mutable_cnml_data();
@@ -198,10 +198,10 @@ class Tensor {
        CHECK(mem) << "\nMemory access before allowcating.";
        if (TypeMeta::Id<Context>() ==
            TypeMeta::Id<CPUContext>()) {
-            return mem->cpu_data();
+            return mem->cpu_data(nbytes());
        } else if (TypeMeta::Id<Context>() ==
            TypeMeta::Id<CUDAContext>()) {
-            return mem->cuda_data();
+            return mem->cuda_data(nbytes());
        } else if (TypeMeta::Id<Context>() ==
            TypeMeta::Id<CNMLContext>()) {
            return mem->cnml_data();
@@ -258,10 +258,18 @@ class Tensor {
    T* mutable_data() {
        void* data_ptr;
        mutable_data_ptr<Context>(&data_ptr);
-        if (data_ptr && meta_ == TypeMeta::Make<T>())
+        if (data_ptr) {
-            return static_cast<T*>(data_ptr);
+            auto meta = TypeMeta::Make<T>();
-        return static_cast<T*>(
+            if (meta_ == meta) {
-            raw_mutable_data<Context>(TypeMeta::Make<T>()));
+                return static_cast<T*>(data_ptr);
+            } else if (capacity_ >=
+                size_ * meta.itemsize()) {
+                meta_ = meta;
+                return static_cast<T*>(data_ptr);
+            }
+        }
+        return static_cast<T*>(raw_mutable_data
+            <Context>(TypeMeta::Make<T>()));
    }
    /*! \brief Get the typed const data pointer */
@@ -325,6 +333,9 @@ class Tensor {
    /*! \brief Store the size and capacity */
    size_t size_ = 0, capacity_ = 0;
+    /*! \brief Store the version for shared tensor */
+    int version_ = -1;
    /*! \brief Store the dimensions and strides */
    vector<int64_t> dims_, strides_;
@@ -335,7 +346,7 @@ class Tensor {
    MixedMemory* ex_memory_ = nullptr;
    /*! \brief External memory indicators */
-    bool is_corrupted_ = false, is_shared_ = false, own_mem_ = true;
+    bool is_shared_ = false, own_mem_ = true;
 };
 }  // namespace dragon

--- a/Dragon/include/core/typeid.h
+++ b/Dragon/include/core/typeid.h
@@ -52,12 +52,12 @@ class TypeMeta {
        return *this;
    }
-    bool operator == (const TypeMeta& other) const { 
+    bool operator == (const TypeMeta& other) const {
-        return (id_ == other.id_); 
+        return (id_ == other.id_);
    }
-    bool operator != (const TypeMeta& other) const { 
+    bool operator != (const TypeMeta& other) const {
-        return (id_ != other.id_); 
+        return (id_ != other.id_);
    }
    const TypeId& id() const { return id_; }
@@ -69,8 +69,8 @@ class TypeMeta {
    template <typename T>
    static TypeId Id() {
-        //  return T's id
+        // Return T's id
-        //  using a intptr_t as hash key
+        // Using a intptr_t as hash key
        return TypeRegister<T>::id();
    }
@@ -78,7 +78,7 @@ class TypeMeta {
    static size_t Itemsize() { return sizeof(T); }
    template <typename T>
-    bool Match() const { return (id_ == Id<T>()); } 
+    bool Match() const { return (id_ == Id<T>()); }
    template <typename T>
    static void Ctor(void* ptr, size_t n) {

--- a/Dragon/include/core/workspace.h
+++ b/Dragon/include/core/workspace.h
@@ -19,14 +19,12 @@
 namespace dragon {
-#define WORKSPACE_MAX_CORRUPTED_SIZE 2
 class Workspace {
 public:
    typedef Map<string, Map<string, int64_t> > DummyNameMap;
    typedef Map<string, unique_ptr<Tensor> > TensorMap;
-    typedef Map<string, string> TensorProxyMap;
+    typedef Map<string, string> TensorAliasMap;
    typedef Map<string, TensorFillerProto> TensorFillerMap;
    typedef Map<string, unique_ptr<OperatorBase> > OperatorMap;
@@ -73,7 +71,7 @@ class Workspace {
    /* \brief Whether the specified filler is in this workspace */
    bool HasFiller(const string& name, bool use_remote = true) const;
    /*! \brief Create the specified filler */
    void CreateFiller(const TensorFillerProto filler);
@@ -107,19 +105,15 @@ class Workspace {
        return Tcaches;
    }
-    /*! \brief Creathe a persistent operator in this workspace */
+    /*! \brief Create a operator in this workspace */
-    void CreatePersistentOp(const OperatorDef& def);
+    OperatorBase* CreateOperator(const OperatorDef& def);
    /*! \brief Run the specified persistent operator */
-    void RunPersistentOp(
-        const string&               key,
-        const string&               anchor,
-        const vector<string>&       inputs,
-        const vector<string>&       outputs);
-    /*! \brief Try to run the operator in a adaptive mode */
    void RunOperator(const OperatorDef& def);
+    /*! \brief Try to run the operator in a adaptive mode */
+    void RunOperatorOnce(const OperatorDef& def);
    /*! \brief Create a Graph in this workspace */
    GraphBase* CreateGraph(const GraphDef& def);
@@ -128,13 +122,13 @@ class Workspace {
        const string&               graph_name,
        const string&               include,
        const string&               exclude,
-        const int                   stream_id = 1);
+        int                         stream_id = 0);
    /*! \brief Return all the stored graph names */
    vector<string> GetGraphs() const;
-    /* \brief Set a proxy name for the tensor */
+    /* \brief Set an alias for the tensor */
-    bool SetTensorProxy(const string& key, const string& proxy);
+    bool SetTensorAlias(const string& name, const string& alias);
    /* \brief Return a unique dummy name within this workspace */
    string GetDummyName(
@@ -157,7 +151,7 @@ class Workspace {
    TensorFillerMap tensor_filler_map_;
    /*! \brief Store the proxy name of tensors */
-    TensorProxyMap tensor_proxy_map_;
+    TensorAliasMap tensor_alias_map_;
    /*! \brief Store the registered operators for dynamic graph */
    OperatorMap operator_map_;

--- a/Dragon/include/operators/activation/softmax_op.h
+++ b/Dragon/include/operators/activation/softmax_op.h
@@ -99,6 +99,6 @@ class CuDNNSoftmaxGradientOp final : public Operator<Context> {
 #endif  // WITH_CUDNN
-}
+}  // namespace dragon
 #endif  // DRAGON_OPERATORS_ACTIVATION_SOFTMAX_OP_H_
\ No newline at end of file
--- a/Dragon/include/operators/update/moving_average_op.h
+++ b/Dragon/include/operators/update/moving_average_op.h
@@ -10,29 +10,29 @@
 * ------------------------------------------------------------
 */
-#ifndef DRAGON_OPERATORS_UPDATE_MOVING_AVERAGE_OP_H_
+#ifndef DRAGON_OPERATORS_ARITHMETIC_ACCUMULATE_OP_H_
-#define DRAGON_OPERATORS_UPDATE_MOVING_AVERAGE_OP_H_
+#define DRAGON_OPERATORS_ARITHMETIC_ACCUMULATE_OP_H_
 #include "core/operator.h"
 namespace dragon {
 template <class Context>
-class MovingAverageOp final : public Operator<Context> {
+class AccumulateOp final : public Operator<Context> {
 public:
-    MovingAverageOp(const OperatorDef& def, Workspace* ws)
+    AccumulateOp(const OperatorDef& def, Workspace* ws)
        : Operator<Context>(def, ws),
-          decay(OperatorBase::Arg<float>("decay", 1.f)) {}
+          alpha(OperatorBase::Arg<float>("alpha", 1.f)),
+          beta(OperatorBase::Arg<float>("beta", 1.f)) {}
    USE_OPERATOR_FUNCTIONS;
    void RunOnDevice() override;
-    template <typename T> void RunWithType();
+    template <typename T> void RunWithType(Tensor* X, Tensor* Y);
 protected:
-    float decay;
+    float alpha, beta;
 };
-}   // namespace dragon
+}  // namespace dragon
+#endif  // DRAGON_OPERATORS_ARITHMETIC_ACCUMULATE_OP_H_
-#endif  // DRAGON_OPERATORS_UPDATE_MOVING_AVERAGE_OP_H_
\ No newline at end of file
\ No newline at end of file
--- a/Dragon/include/operators/arithmetic/affine_op.h
+++ b/Dragon/include/operators/arithmetic/affine_op.h
@@ -46,12 +46,12 @@ class AffineGradientOp final : public Operator<Context> {
    void RunOnDevice() override;
    template <typename T> void BiasRunWithType();
    template <typename T> void ScaleRunWithType();
+    template <typename T> void ComputeScaleGradient(T* dYxX, T* dA);
    template <typename T> void RunWithType();
 protected:
    int64_t axis, num_axes;
    int64_t outer_dim, inner_dim, scale_dim, sum_dim, dim;
-    Tensor sum_result;
 };
 #ifdef WITH_CUDNN
@@ -125,18 +125,12 @@ public:
    template <typename DT, typename CT>
    void ComputeScaleGradient(DT* dYxX, DT* dA);
-    template <typename DT, typename CT>
-    void ComputeBiasGradient(const DT* dY, DT* dB);
    template <typename T> void ComputeScaleGradient_v2(T* dYxX, T* dA);
-    template <typename T> void ComputeBiasGradient_v2(const T* dY, T* dB);
    template <typename DT, typename CT> void RunWithType();
 protected:
    USE_CUDNN_AFFINE_FUCNTIONS;
    int64_t outer_dim, inner_dim, scale_dim, dim, sum_dim;
-    Tensor sum_result;
 };
 #endif

--- a/Dragon/include/operators/vision/dense_concat_op.h
+++ b/Dragon/include/operators/vision/dense_concat_op.h
@@ -10,36 +10,33 @@
 * ------------------------------------------------------------
 */
-#ifndef DRAGON_OPERATORS_VISION_DENSE_CONCAT_OP_H_
+#ifndef DRAGON_OPERATORS_ARITHMETIC_SQRT_OP_H_
-#define DRAGON_OPERATORS_VISION_DENSE_CONCAT_OP_H_
+#define DRAGON_OPERATORS_ARITHMETIC_SQRT_OP_H_
-#include "operators/ndarray/concat_op.h"
+#include "core/operator.h"
 namespace dragon {
 template <class Context>
-class DenseConcatOp final : public ConcatOp<Context> {
+class SqrtOp final : public Operator<Context> {
 public:
-    DenseConcatOp(const OperatorDef& def, Workspace* ws)
+    USE_SIMPLE_CTOR_DTOR(SqrtOp);
-        : ConcatOp<Context>(def, ws) {}
    USE_OPERATOR_FUNCTIONS;
+    void RunOnDevice() override;
+    template <typename T> void RunWithType();
 };
 template <class Context>
-class DenseConcatGradientOp final : public ConcatGradientOp<Context> {
+class SqrtGradientOp final : public Operator<Context> {
 public:
-    DenseConcatGradientOp(const OperatorDef& def, Workspace* ws)
+    USE_SIMPLE_CTOR_DTOR(SqrtGradientOp);
-        : ConcatGradientOp<Context>(def, ws),
-          growth_rate(OperatorBase::Arg<int64_t>("growth_rate", 0)) {}
    USE_OPERATOR_FUNCTIONS;
-    void ElimateCorruption() override;
+    void RunOnDevice() override;
-    template <typename T> void RestoreX1();
+    template <typename T> void RunWithType();
- protected:
-    int64_t growth_rate;
 };
 }  // namespace dragon
-#endif  // DRAGON_OPERATORS_VISION_DENSE_CONCAT_OP_H_
+#endif  // DRAGON_OPERATORS_ARITHMETIC_SQRT_OP_H_
\ No newline at end of file
--- a/Dragon/include/operators/arithmetic/square_op.h
+++ b/Dragon/include/operators/arithmetic/square_op.h
@@ -19,7 +19,7 @@ namespace dragon {
 template <class Context>
 class SquareOp final : public Operator<Context> {
-public:
+ public:
    USE_SIMPLE_CTOR_DTOR(SquareOp);
    USE_OPERATOR_FUNCTIONS;
@@ -29,7 +29,7 @@ public:
 template <class Context>
 class SquareGradientOp final : public Operator<Context> {
-public:
+ public:
    USE_SIMPLE_CTOR_DTOR(SquareGradientOp);
    USE_OPERATOR_FUNCTIONS;

--- a/Dragon/include/operators/loss/sigmoid_focal_loss_op.h
+++ b/Dragon/include/operators/loss/sigmoid_focal_loss_op.h
@@ -37,7 +37,7 @@ class SigmoidFocalLossOp
    USE_OPERATOR_FUNCTIONS;
    void RunOnDevice() override;
-    template <typename T> void RunWithType();
+    template <typename Tx, typename Ty> void RunWithType();
 protected:
    float alpha, gamma, pos_alpha, neg_alpha;
@@ -66,7 +66,7 @@ class SigmoidFocalLossGradientOp
    USE_OPERATOR_FUNCTIONS;
    void RunOnDevice() override;
-    template <typename T> void RunWithType();
+    template <typename Tx, typename Ty> void RunWithType();
 protected:
    float alpha, gamma, pos_alpha, neg_alpha;

--- a/Dragon/include/operators/loss/softmax_focal_loss_op.h
+++ b/Dragon/include/operators/loss/softmax_focal_loss_op.h
@@ -37,7 +37,7 @@ class SoftmaxFocalLossOp
    USE_OPERATOR_FUNCTIONS;
    void RunOnDevice() override;
-    template <typename T> void RunWithType();
+    template <typename Tx, typename Ty> void RunWithType();
 protected:
    float alpha, gamma, pos_alpha, neg_alpha;
@@ -66,7 +66,7 @@ class SoftmaxFocalLossGradientOp
    USE_OPERATOR_FUNCTIONS;
    void RunOnDevice() override;
-    template <typename T> void RunWithType();
+    template <typename Tx, typename Ty> void RunWithType();
 protected:
    float alpha, gamma, pos_alpha, neg_alpha;

--- a/Dragon/include/operators/misc/astype_op.h
+++ b/Dragon/include/operators/misc/astype_op.h
@@ -10,29 +10,41 @@
 * ------------------------------------------------------------
 */
-#ifndef DRAGON_OPERATORS_MISC_ASTYPE_OP_H_
+#ifndef DRAGON_OPERATORS_MISC_CAST_OP_H_
-#define DRAGON_OPERATORS_MISC_ASTYPE_OP_H_
+#define DRAGON_OPERATORS_MISC_CAST_OP_H_
 #include "core/operator.h"
 namespace dragon {
 template <class Context>
-class AsTypeOp final : public Operator<Context> {
+class CastOp final : public Operator<Context> {
 public:
-    AsTypeOp(const OperatorDef& def, Workspace* ws)
+    CastOp(const OperatorDef& def, Workspace* ws)
-        : Operator<Context>(def, ws),
+       : Operator<Context>(def, ws),
-          dtype(OperatorBase::Arg<string>("dtype", "float32")),
+         dtype(OperatorBase::Arg<string>("dtype", "float32")),
-          inplace(OperatorBase::Arg<bool>("inplace", false)) {}
+         inplace(OperatorBase::Arg<bool>("inplace", false)) {}
-     USE_OPERATOR_FUNCTIONS;
+    USE_OPERATOR_FUNCTIONS;
-     void RunOnDevice() override;
+    void RunOnDevice() override;
 protected:
    string dtype;
    bool inplace;
 };
+template <class Context>
+class CastGradientOp final : public Operator<Context> {
+ public:
+    USE_SIMPLE_CTOR_DTOR(CastGradientOp);
+    USE_OPERATOR_FUNCTIONS;
+    void RunOnDevice() override;
+ protected:
+    string dtype;
+};
 }  // namespace dragon
-#endif  // DRAGON_OPERATORS_MISC_ASTYPE_OP_H_
+#endif  // DRAGON_OPERATORS_MISC_CAST_OP_H_
\ No newline at end of file
--- a/Dragon/include/operators/misc/initialize_op.h
+++ b/Dragon/include/operators/misc/initialize_op.h
@@ -128,7 +128,7 @@ public:
 template <class Context>
 class TruncatedNormalOp final : public InitializeOp<Context> {
-public:
+ public:
    TruncatedNormalOp(const OperatorDef& def, Workspace* ws)
        : InitializeOp<Context>(def, ws) {
        this->filler_proto.set_type("truncated_normal");

--- a/Dragon/include/operators/update/adam_update_op.h
+++ b/Dragon/include/operators/update/adam_update_op.h
@@ -25,8 +25,7 @@ class AdamUpdateOp final : public UpdateOpBase<Context> {
    USE_OPERATOR_FUNCTIONS;
    USE_UPDATER_FUNCTIONS(Context);
-    void ComputeRunWithFloat32() override;
+    void ComputeUpdates(Tensor* dX) override;
-    void ComputeRunWithFloat16() override;
 protected:
    int t; float lr, beta1, beta2, eps;

--- a/Dragon/include/operators/update/collective_update_op.h
+++ b/Dragon/include/operators/update/collective_update_op.h
@@ -75,7 +75,6 @@ class CollectiveUpdateOp final : public Operator<Context> {
 #ifdef WITH_NCCL
    ncclComm_t nccl_comm;
-    CUDAClosure<Context> closure;
 #endif
 };

--- a/Dragon/include/operators/update/nesterov_update_op.h
+++ b/Dragon/include/operators/update/nesterov_update_op.h
@@ -25,8 +25,7 @@ class NesterovUpdateOp final : public UpdateOpBase<Context> {
    USE_OPERATOR_FUNCTIONS;
    USE_UPDATER_FUNCTIONS(Context);
-    void ComputeRunWithFloat32() override;
+    void ComputeUpdates(Tensor* dX) override;
-    void ComputeRunWithFloat16() override;
 protected:
    float lr, momentum;

--- a/Dragon/include/operators/update/rmsprop_update_op.h
+++ b/Dragon/include/operators/update/rmsprop_update_op.h
@@ -25,8 +25,7 @@ class RMSPropUpdateOp final : public UpdateOpBase<Context> {
    USE_OPERATOR_FUNCTIONS;
    USE_UPDATER_FUNCTIONS(Context);
-    void ComputeRunWithFloat32() override;
+    void ComputeUpdates(Tensor* dX) override;
-    void ComputeRunWithFloat16() override;
 protected:
    float lr, decay, eps;

--- a/Dragon/include/operators/update/sgd_update_op.h
+++ b/Dragon/include/operators/update/sgd_update_op.h
@@ -26,8 +26,7 @@ class SGDUpdateOp final : public UpdateOpBase<Context> {
    USE_OPERATOR_FUNCTIONS;
    USE_UPDATER_FUNCTIONS(Context);
-    void ComputeRunWithFloat32() override;
+    void ComputeUpdates(Tensor* dX) override;
-    void ComputeRunWithFloat16() override;
 protected:
    float old_lr, lr, momentum, correction;

--- a/Dragon/include/operators/update/update_op_base.h
+++ b/Dragon/include/operators/update/update_op_base.h
@@ -24,29 +24,29 @@ class UpdateOpBase : public Operator<Context> {
        : Operator<Context>(def, ws),
          lr_mult(OperatorBase::Arg<float>("lr_mult", 1.f)),
          decay_mult(OperatorBase::Arg<float>("decay_mult", 1.f)),
-          slot(OperatorBase::Arg<string>("slot", "")),
+          slot(OperatorBase::Arg<string>("slot", "")) {
-          zero_grad(OperatorBase::Arg<bool>("zero_grad", true)) {
        CHECK(!slot.empty()) << "\nRequired a non-empty slot";
    }
    USE_OPERATOR_FUNCTIONS;
+    string Slot() { return slot + "/" + Output(0)->name(); }
    float Param(const string& name) const;
-    string Slot();
-    void RunOnDevice() override;
+    template <typename T>
-    template <typename T> void PreprocessRunWithType();
+    void ProcessGradients(Tensor* dX, Tensor* X);
-    virtual void ComputeRunWithFloat32() = 0;
+    virtual void ComputeUpdates(Tensor* dX) = 0;
-    virtual void ComputeRunWithFloat16() = 0;
-    void UpdateRunWithFloat32();
+    template <typename T>
-    void UpdateRunWithFloat16();
+    void ApplyUpdates(Tensor* dX, Tensor* X);
+    void RunOnDevice() override;
 protected:
    float lr_mult, decay_mult;
    float l2_decay, clip_thresh, scale_factor;
    string slot;
-    bool zero_grad;
 };
 #define USE_UPDATER_FUNCTIONS(context) \

--- a/Dragon/include/operators/vision/conv_op.h
+++ b/Dragon/include/operators/vision/conv_op.h
@@ -88,6 +88,7 @@ class CuDNNConv2dOp final : public Conv2dOp<Context> {
    }
    void RunOnDevice() override;
+    void SetConvDescFromInputs();
    template <typename T> void ResetDesc();
    template <typename T> void RunWithType();
@@ -101,7 +102,7 @@ class CuDNNConv2dOp final : public Conv2dOp<Context> {
    cudnnFilterDescriptor_t filter_desc;
    size_t fwd_data_size;
    int64_t cudnn_group;
-    vector<int64_t> input_dims;
+    vector<int64_t> input_dims, filter_dims;
    bool enable_tensor_core;
 };
@@ -142,6 +143,7 @@ class CuDNNConv2dGradientOp final : public Conv2dGradientOp<Context> {
    }
    void RunOnDevice() override;
+    void SetConvDescFromInputs();
    template <typename T> void ResetDesc();
    template <typename T> void RunWithType();
@@ -156,7 +158,7 @@ class CuDNNConv2dGradientOp final : public Conv2dGradientOp<Context> {
    cudnnFilterDescriptor_t filter_desc;
    size_t bwd_filter_size, bwd_data_size;
    int64_t cudnn_group;
-    vector<int64_t> input_dims;
+    vector<int64_t> input_dims, filter_dims;
    bool enable_tensor_core;
 };

--- a/Dragon/include/operators/vision/conv_transpose_op.h
+++ b/Dragon/include/operators/vision/conv_transpose_op.h
@@ -20,10 +20,10 @@ namespace dragon {
 template <class Context>
 class ConvTranspose2dOp : public ConvOpBase<Context> {
 public:
-    ConvTranspose2dOp(const OperatorDef& def, Workspace* ws) 
+    ConvTranspose2dOp(const OperatorDef& def, Workspace* ws)
        : ConvOpBase<Context>(def, ws) {
        this->num_spatial_axes = 2;
-        Setup(); 
+        Setup();
    }
    USE_OPERATOR_FUNCTIONS;
    USE_CONVOLUTION_FUNCTIONS;
@@ -95,6 +95,7 @@ class CuDNNConvTranspose2dOp final
    }
    void RunOnDevice() override;
+    void SetConvDescFromInputs();
    template <typename T> void ResetDesc();
    template <typename T> void RunWithType();
@@ -108,7 +109,7 @@ class CuDNNConvTranspose2dOp final
    cudnnFilterDescriptor_t filter_desc;
    size_t fwd_data_size;
    int64_t cudnn_group;
-    vector<int64_t> input_dims;
+    vector<int64_t> output_dims, filter_dims;
    bool enable_tensor_core;
 };
@@ -152,6 +153,7 @@ public:
    }
    void RunOnDevice() override;
+    void SetConvDescFromInputs();
    template <typename T> void ResetDesc();
    template <typename T> void RunWithType();
@@ -166,7 +168,7 @@ public:
    cudnnFilterDescriptor_t filter_desc;
    size_t bwd_filter_size, bwd_data_size;
    int64_t cudnn_group;
-    vector<int64_t> input_dims;
+    vector<int64_t> output_dims, filter_dims;
    bool enable_tensor_core;
 };

--- a/Dragon/include/operators/vision/nn_resize_op.h
+++ b/Dragon/include/operators/vision/nn_resize_op.h
@@ -55,6 +55,7 @@ class NNResizeGradientOp final : public Operator<Context> {
    USE_OPERATOR_FUNCTIONS;
    void RunOnDevice() override;
+    void RunWithFloat16();
    template <typename T> void RunWithType();
 protected:

--- a/Dragon/include/operators/vision/pool_op.h
+++ b/Dragon/include/operators/vision/pool_op.h
@@ -26,7 +26,7 @@ class Pool2dOp : public Operator<Context> {
          data_format(OperatorBase::Arg<string>("data_format", "NCHW")),
          padding(OperatorBase::Arg<string>("padding", "VALID")),
          global_pooling(OperatorBase::Arg<bool>("global_pooling", false)),
-          ceil_mode(OperatorBase::Arg<bool>("ceil", true)) {
+          ceil_mode(OperatorBase::Arg<bool>("ceil_mode", true)) {
        auto ks = OperatorBase::Args<int64_t>("kernel_shape");
        auto s = OperatorBase::Args<int64_t>("strides");
        auto p = OperatorBase::Args<int64_t>("pads");
@@ -68,7 +68,7 @@ class Pool2dGradientOp : public Operator<Context> {
          data_format(OperatorBase::Arg<string>("data_format", "NCHW")),
          padding(OperatorBase::Arg<string>("padding", "VALID")),
          global_pooling(OperatorBase::Arg<bool>("global_pooling", false)),
-          ceil_mode(OperatorBase::Arg<bool>("ceil", true)) {
+          ceil_mode(OperatorBase::Arg<bool>("ceil_mode", true)) {
        auto ks = OperatorBase::Args<int64_t>("kernel_shape");
        auto s = OperatorBase::Args<int64_t>("strides");
        auto p = OperatorBase::Args<int64_t>("pads");

--- a/Dragon/include/operators/vision/roi_align_op.h
+++ b/Dragon/include/operators/vision/roi_align_op.h
@@ -54,6 +54,7 @@ class ROIAlignGradientOp final : public Operator<Context> {
    USE_OPERATOR_FUNCTIONS;
    void RunOnDevice() override;
+    void RunWithFloat16();
    template <typename T> void RunWithType();
 protected:

--- a/Dragon/include/operators/vision/roi_pool_op.h
+++ b/Dragon/include/operators/vision/roi_pool_op.h
@@ -49,6 +49,7 @@ class ROIPoolGradientOp final : public Operator<Context> {
    USE_OPERATOR_FUNCTIONS;
    void RunOnDevice() override;
+    void RunWithFloat16();
    template <typename T> void RunWithType();
 protected:

--- a/Dragon/include/utils/cub_device.h
+++ b/Dragon/include/utils/cub_device.h
@@ -12,7 +12,7 @@ namespace dragon {
 template <typename T>
 using BlockReduce = cub::BlockReduce<T, CUDA_THREADS>;
-}
+}  // namespace dragon
 #endif  // WITH_CUDA

--- a/Dragon/include/utils/math_functions.h
+++ b/Dragon/include/utils/math_functions.h
@@ -102,7 +102,7 @@ template <typename T, class Context>
 void Set(
    const int               n,
    const T                 alpha,
-    T*                      x,
+    T*                      y,
    Context*                ctx);
 template <typename T, class Context>
@@ -122,6 +122,15 @@ void Axpy(
    Context*                ctx);
 template<typename T, class Context>
+void Axpby(
+    const int               n,
+    const float             alpha,
+    const T*                x,
+    const float             beta,
+    T*                      y,
+    Context*                ctx);
+template<typename T, class Context>
 void AddScalar(
    const int               n,
    const float             alpha,
@@ -141,17 +150,8 @@ void AddScalar(
 template<typename T, class Context>
 void InvStd(
    const int               n,
-    float                   eps,
+    const float             eps,
-    const T*                x,
-    T*                      y,
-    Context*                ctx);
-template<typename T, class Context>
-void Axpby(
-    const int               n,
-    float                   alpha,
    const T*                x,
-    float                   beta,
    T*                      y,
    Context*                ctx);

--- a/Dragon/include/utils/op_kernel.h
+++ b/Dragon/include/utils/op_kernel.h
@@ -378,8 +378,8 @@ void NLLLoss(
    const Tx*               log_prob,
    const Ty*               labels,
    const int*              ignores,
-    float*                  losses,
+    Tx*                     losses,
-    float*                  flags,
+    int*                    flags,
    Context*                ctx);
 template <typename Tx, typename Ty, class Context>
@@ -392,7 +392,7 @@ void NLLLossGrad(
    const Ty*               labels,
    const int*              ignores,
    Tx*                     dx,
-    float*                  flags,
+    int*                    flags,
    Context*                ctx);
 /*! loss.sigmoid_ce_loss */
@@ -403,7 +403,7 @@ void SigmoidCrossEntropy(
    const T*                logits,
    const T*                targets,
    T*                      losses,
-    T*                      flags,
+    int*                    flags,
    Context*                ctx);
 template <typename T, class Context>
@@ -412,12 +412,12 @@ void SigmoidCrossEntropyGrad(
    const T*                logits,
    const T*                targets,
    T*                      dlogits,
-    T*                      flags,
+    int*                    flags,
    Context*                ctx);
 /*! loss.sigmoid_focal_loss */
-template <typename T, class Context>
+template <typename Tx, typename Ty, class Context>
 void SigmoidFocalLoss(
    const int               outer_dim,
    const int               axis_dim,
@@ -426,13 +426,13 @@ void SigmoidFocalLoss(
    const float             neg_alpha,
    const float             gamma,
    const int               neg_id,
-    const float*            logits,
+    const Tx*               logits,
-    const float*            targets,
+    const Ty*               targets,
-    float*                  losses,
+    Tx*                     losses,
-    float*                  flags,
+    int*                    flags,
    Context*                ctx);
-template <typename T, class Context>
+template <typename Tx, typename Ty, class Context>
 void SigmoidFocalLossGrad(
    const int               outer_dim,
    const int               axis_dim,
@@ -441,10 +441,10 @@ void SigmoidFocalLossGrad(
    const float             neg_alpha,
    const float             gamma,
    const int               neg_id,
-    const float*            logits,
+    const Tx*               logits,
-    const float*            targets,
+    const Ty*               targets,
-    float*                  dlogits,
+    Tx*                     dlogits,
-    float*                  flags,
+    int*                    flags,
    Context*                ctx);
 /*! loss.smooth_l1_loss */
@@ -477,7 +477,7 @@ void SoftmaxCrossEntropy(
 /*! loss.softmax_focal_loss */
-template <typename T, class Context>
+template <typename Tx, typename Ty, class Context>
 void SoftmaxFocalLoss(
    const int               outer_dim,
    const int               axis_dim,
@@ -487,14 +487,14 @@ void SoftmaxFocalLoss(
    const float             neg_alpha,
    const float             gamma,
    const int               neg_id,
-    const T*                prob,
+    const Tx*               prob,
-    const T*                labels,
+    const Ty*               labels,
    const int*              ignores,
-    T*                      losses,
+    Tx*                     losses,
-    T*                      flags,
+    int*                    flags,
    Context*                ctx);
-template <typename T, class Context>
+template <typename Tx, typename Ty, class Context>
 void SoftmaxFocalLossGrad(
    const int               outer_dim,
    const int               axis_dim,
@@ -504,11 +504,11 @@ void SoftmaxFocalLossGrad(
    const float             neg_alpha,
    const float             gamma,
    const int               neg_id,
-    const T*                prob,
+    const Tx*               prob,
-    const T*                labels,
+    const Ty*               labels,
    const int*              ignores,
-    T*                      dx,
+    Tx*                     dx,
-    T*                      flags,
+    int*                    flags,
    Context*                ctx);
 /*! loss.sparse_softmax_cross_entropy */
@@ -522,8 +522,8 @@ void SparseSoftmaxCrossEntropy(
    const Tx*               prob,
    const Ty*               labels,
    const int*              ignores,
-    float*                  losses,
+    Tx*                     losses,
-    float*                  flags,
+    int*                    flags,
    Context*                ctx);
 template <typename Tx, typename Ty, class Context>
@@ -536,7 +536,7 @@ void SparseSoftmaxCrossEntropyGrad(
    const Ty*               labels,
    const int*              ignores,
    Tx*                     dx,
-    float*                  flags,
+    int*                    flags,
    Context*                ctx);
 /*! misc.astype */
@@ -548,6 +548,16 @@ void TypeA2B(
    Tb*                     b,
    Context*                ctx);
+/*! misc.gradient */
+template <typename T, class Context>
+void GradientTwoSum(
+    const int               count,
+    const T*                dy1,
+    const T*                dy2,
+    T*                      dx,
+    Context*                ctx);
 /*! misc.image_data */
 template <typename Tx, typename Ty, class Context>
@@ -976,11 +986,18 @@ void SGDUpdate(
 /*! update.op_base */
 template <typename T, class Context>
+void MixedPrecisionL2Decay(
+    const int               count,
+    const float             alpha,
+    const T*                w,
+    float*                  dx,
+    Context*                ctx);
+template <typename T, class Context>
 void MixedPrecisionUpdate(
    const int               count,
    const float*            updates,
    T*                      w,
-    T*                      g,
    Context*                ctx);
 /*! vision.bias_add */

--- a/Dragon/include/utils/string.h
+++ b/Dragon/include/utils/string.h
@@ -37,6 +37,20 @@ inline std::vector<std::string> split(
    return ret;
 }
+inline std::string replace_first(
+    const std::string&              str,
+    const std::string&              pattern,
+    const std::string&              excepted) {
+    size_t pos = 0;
+    if ((pos = str.find(pattern)) != std::string::npos) {
+        std::string ret(str);
+        ret.replace(pos, pattern.size(), excepted);
+        return ret;
+    } else {
+        return str;
+    }
+}
 }  // namespace str
 }  // namespace dragon

--- a/Dragon/modules/cxx/dragon.cc
+++ b/Dragon/modules/cxx/dragon.cc
@@ -269,7 +269,7 @@ void LoadONNXModel(
 *                                       *
 * * * * * * * * * * * * * * * * * * * * */
-void SetLogLevel(const std::string& level) {
+void SetLoggingLevel(const std::string& level) {
    SetLogDestination(StrToLogSeverity(level));
 }

--- a/Dragon/modules/cxx/dragon.h
+++ b/Dragon/modules/cxx/dragon.h
@@ -97,7 +97,7 @@ DRAGON_API std::string CreateGraph(
 DRAGON_API void RunGraph(
    const std::string&              graph_name,
    Workspace_t                     ws,
-    const int                       stream_id = 1);
+    int                             stream_id = 0);
 /* * * * * * * * * * * * * * * * * * * * *
 *                                       *
@@ -156,7 +156,7 @@ DRAGON_API void LoadONNXModel(
 *                                       *
 * * * * * * * * * * * * * * * * * * * * */
-DRAGON_API void SetLogLevel(const std::string& level);
+DRAGON_API void SetLoggingLevel(const std::string& level);
 }  // namespace dragon

--- a/Dragon/modules/python/py_autograd.h
+++ b/Dragon/modules/python/py_autograd.h
@@ -19,95 +19,45 @@ namespace dragon {
 namespace python {
-PyObject* CreateGradientDefsCC(PyObject* self, PyObject* args) {
+void AddGradientMethods(pybind11::module& m) {
-    PyObject* def_string = nullptr;
+    m.def("CreateGradientDefs", [](
-    PyObject* py_g_outputs = nullptr;
+        const string&               forward_def,
-    if (!PyArg_ParseTuple(args, "SO!",
+        const vector<string>&       g_outputs) {
-            &def_string, &PyList_Type, &py_g_outputs)) {
+        OperatorDef def;
-        PyErr_SetString(PyExc_ValueError,
+        if (!def.ParseFromString(forward_def))
-            "Excepted a serialized string of OperatorDef "
+            LOG(FATAL) << "Failed to parse the OperatorDef.";
-            "and a list containing outputs of this GradientOp.");
+        if (!GradientRegistry()->Has(def.type()))
-         return nullptr;
+            LOG(FATAL) << def.type() << "Op has no gradients.";
-    }
+        Gradient grad = MakeGradientForOp(def, g_outputs);
-    OperatorDef def;
+        vector<pybind11::bytes> grad_ops;
-    if (!def.ParseFromString(PyBytes_AsStringEx(def_string))) {
+        for (const auto& e : grad.ops)
-        PyErr_SetString(PyExc_ValueError,
+            grad_ops.push_back(e.SerializeAsString());
-            "Failed to parse the OperatorDef.");
+        return std::tuple<
-        return nullptr;
+            vector<pybind11::bytes>, vector<string>, vector<float>
-    }
+        >(grad_ops, grad.g_inputs, grad.defaults);
-    if (!GradientRegistry()->Has(def.type())) {
+    });
-        PyErr_SetString(PyExc_KeyError,
-            "This Operator does not register GradientOp.");
-        return nullptr;
-    }
-    vector<string> g_outputs;
-    PyList_AsVecString(py_g_outputs, g_outputs, "ignore");
-    Gradient grad = MakeGradientForOp(def, g_outputs);
-    PyObject* g_ops = PyList_New(grad.ops.size());
-    PyObject* g_input = PyList_New(grad.g_inputs.size());
-    PyObject* g_defaults = PyList_New(grad.defaults.size());
-    for (int i = 0; i < grad.ops.size(); i++) {
-        PyObject* e = String_AsPyBytes(grad.ops[i].SerializeAsString());
-        SetPyList(g_ops, i, e);
-    }
-    for (int i = 0; i < grad.g_inputs.size(); i++) {
-        PyObject* e = String_AsPyUnicode(grad.g_inputs[i]);
-        SetPyList(g_input, i, e);
-    }
-    for (int i = 0; i < grad.defaults.size(); i++) {
-        PyObject* e = PyFloat_FromDouble(grad.defaults[i]);
-        SetPyList(g_defaults, i, e);
-    }
-    PyObject* pack = PyTuple_Pack(3, g_ops, g_input, g_defaults);
-    Py_XDECREF(g_ops);
-    Py_XDECREF(g_input);
-    Py_XDECREF(g_defaults);
-    return pack;
-}
-PyObject* RunGradientFlowCC(PyObject* self, PyObject* args) {
+    m.def("FlowGradients", [](
-    PyObject* py_fp_ops, *py_targets;
+        const vector<OperatorDef*>&   forward_ops,
-    PyObject* py_input_grads, *py_ignore_grads;
+        const vector<string>&         targets,
-    PyObject* py_share_grads, *py_export_graph;
+        const vector<string>&         input_grads,
-    if (!PyArg_ParseTuple(args, "OOOOOO",
+        const vector<string>&         ignore_grads,
-        &py_fp_ops, &py_targets,
+        const bool                    is_sharing,
-            &py_input_grads, &py_ignore_grads,
+        const bool                    verbose) {
-                &py_share_grads, &py_export_graph)) {
+        // Make => Optimize => Run
-        PyErr_SetString(PyExc_ValueError,
+        GraphDef backward_ops;
-            "Excepted a list of serialized input ops, targets, "
+        GraphGradientMaker maker;
-            "input grads, ignore grads and whehter to share grads or log graph.");
+        for (auto& grad : input_grads) maker.AddExternalGrad(grad);
-        return nullptr;
+        for (auto& grad : ignore_grads) maker.AddIgnoreGrad(grad);
-    }
+        maker.Make(forward_ops, targets, backward_ops);
-    // Make -> Optm -> Run
+        if (is_sharing) maker.Share(backward_ops);
-    vector<string> targets, input_grads, ignore_grads;
+        pybind11::gil_scoped_release g;
-    PyList_AsVecString(py_targets, targets, "");
+        for (auto& op : backward_ops.op()) {
-    PyList_AsVecString(py_input_grads, input_grads, "");
+            if (verbose) std::cout << op.DebugString() << std::endl;
-    PyList_AsVecString(py_ignore_grads, ignore_grads, "");
+            if (op.has_uid()) ws()->RunOperator(op);
-    GraphDef fp_ops, bp_ops;
+            else ws()->RunOperatorOnce(op);
-    if (!fp_ops.ParseFromString(PyBytes_AsStringEx(py_fp_ops))) {
+        }
-        PyErr_SetString(PyExc_RuntimeError, 
+    });
-            "Failed to parse the GraphDef of forward ops.");
-        return nullptr;
-    }
-    GraphGradientMaker maker;
-    for (auto& grad : input_grads) maker.AddExternalGrad(grad);
-    for (auto& grad : ignore_grads) maker.AddIgnoreGrad(grad);
-    maker.Make(fp_ops, targets, bp_ops);
-    bool share_grads = PyObject_IsTrue(py_share_grads) ? true : false;
-    bool export_graph = PyObject_IsTrue(py_export_graph) ? true : false;
-    if (share_grads) maker.Share("/share/buffer/grads", bp_ops);
-    if (export_graph) {
-        Tensor* tensor = ws()->CreateTensor(
-            "/graph_def/dynamic/gradient_flow")->Reshape({ 1 });
-        string* data = tensor->mutable_data<string, CPUContext>();
-        data[0] = bp_ops.SerializeAsString();
-        tensor = ws()->CreateTensor(
-            "/graph_def/dynamic/forward_flow")->Reshape({ 1 });
-        data = tensor->mutable_data<string, CPUContext>();
-        data[0] = fp_ops.SerializeAsString();
-    }
-    for (auto& op : bp_ops.op()) ws()->RunOperator(op);
-    Py_RETURN_TRUE;
 }
 }  // namespace python

--- a/Dragon/modules/python/py_config.h
+++ b/Dragon/modules/python/py_config.h
@@ -19,15 +19,10 @@ namespace dragon {
 namespace python {
-inline PyObject* SetLogLevelCC(PyObject* self, PyObject* args) {
+void AddConfigMethods(pybind11::module& m) {
-    char* cname;
+    m.def("SetLoggingLevel", [](const string& level) {
-    if (!PyArg_ParseTuple(args, "s", &cname)) {
+        SetLogDestination(StrToLogSeverity(level));
-        PyErr_SetString(PyExc_ValueError,
+    });
-            "Excepted the logging level.");
-        return nullptr;
-    }
-    SetLogDestination(StrToLogSeverity(string(cname)));
-    Py_RETURN_TRUE;
 }
 }  // namespace python

--- a/Dragon/modules/python/py_cuda.h
+++ b/Dragon/modules/python/py_cuda.h
@@ -19,15 +19,34 @@ namespace python {
 #include "py_dragon.h"
-inline PyObject* IsCUDADriverSufficientCC(PyObject* self, PyObject* args) {
+void AddCUDAMethods(pybind11::module& m) {
+    m.def("IsCUDADriverSufficient", []() {
 #ifdef WITH_CUDA
-    int count;
+        int count;
-    cudaError_t err = cudaGetDeviceCount(&count);
+        cudaError_t err = cudaGetDeviceCount(&count);
-    if (err == cudaErrorInsufficientDriver) return PyBool_FromLong(0);
+        if (err == cudaErrorInsufficientDriver) false;
-    return PyBool_FromLong(1);
+        return true;
 #else
-    return PyBool_FromLong(0);
+        return false;
 #endif
+    });
+    m.def("cudaGetDevice", []() {
+        return CUDAContext::active_device_id();
+    });
+    m.def("cudaStreamSynchronize", [](
+        int device_id, int stream_id) {
+#ifdef WITH_CUDA
+        if (device_id < 0) device_id =
+            CUDAContext::active_device_id();
+        cudaStreamSynchronize(CUDAContext::cuda_object()
+            ->GetStream(device_id, stream_id));
+        cudaError_t error = cudaGetLastError();
+        CHECK_EQ(error, cudaSuccess)
+            << "\nCUDA Error: " << cudaGetErrorString(error);
+#endif
+    });
 }
 }  // namespace python

--- a/Dragon/modules/python/py_dragon.h
+++ b/Dragon/modules/python/py_dragon.h
@@ -13,8 +13,9 @@
 #ifndef DRAGON_PYTHON_PY_DRAGON_H_
 #define DRAGON_PYTHON_PY_DRAGON_H_
+#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
 #include "py_types.h"
-#include "py_macros.h"
 #include "core/common.h"
 #include "core/registry.h"
 #include "core/context.h"
@@ -25,6 +26,9 @@
 #include "core/workspace.h"
 #include "utils/caffemodel.h"
+#include <pybind11/stl.h>
+#include <pybind11/pybind11.h>
 namespace dragon {
 namespace python {
@@ -32,83 +36,80 @@ namespace python {
 class TensorFetcherBase {
 public:
    virtual ~TensorFetcherBase() {}
-    virtual PyObject* Fetch(const Tensor& tensor) = 0;
+    virtual pybind11::object Fetch(const Tensor& tensor) = 0;
 };
 class TensorFeederBase {
 public:
    virtual ~TensorFeederBase() {}
-    virtual PyObject* Feed(
+    virtual void Feed(
        const DeviceOption&             option,
        PyArrayObject*                  array,
        Tensor*                         tensor) = 0;
 };
 DECLARE_TYPED_REGISTRY(TensorFetcherRegistry, TypeId, TensorFetcherBase);
 #define REGISTER_TENSOR_FETCHER(type, ...) \
    REGISTER_TYPED_CLASS(TensorFetcherRegistry, type, __VA_ARGS__)
 inline TensorFetcherBase* CreateFetcher(TypeId type) {
-    return TensorFetcherRegistry()->Create(type); 
+    return TensorFetcherRegistry()->Create(type);
 }
 DECLARE_TYPED_REGISTRY(TensorFeederRegistry, TypeId, TensorFeederBase);
 #define REGISTER_TENSOR_FEEDER(type, ...) \
    REGISTER_TYPED_CLASS(TensorFeederRegistry, type, __VA_ARGS__)
 class NumpyFetcher : public TensorFetcherBase {
 public:
-    PyObject* Fetch(const Tensor& tensor) override {
+    pybind11::object Fetch(const Tensor& tensor) override {
        CHECK_GT(tensor.count(), 0);
        vector<npy_intp> npy_dims;
        for (const auto dim : tensor.dims()) npy_dims.push_back(dim);
        int npy_type = TypeMetaToNPY(tensor.meta());
        if (npy_type == -1) {
-            string s = "The data type of Tensor(" +
+            LOG(FATAL) <<  "The data type of Tensor(" +
                tensor.name() + ") is unknown. Have you solved it ?";
-            PyErr_SetString(PyExc_RuntimeError, s.c_str());
-            return nullptr;
        }
+        CHECK(tensor.memory()) << "\nIllegal memory access.";
        // Create a empty array with the same shape
        PyObject* array = PyArray_SimpleNew(
            tensor.ndim(), npy_dims.data(), npy_type);
        // Copy the tensor data to the numpy array
        if (tensor.memory_state() == MixedMemory::STATE_AT_CUDA) {
-            CUDAContext::Memcpy<CPUContext, CUDAContext>(tensor.nbytes(),
+            CUDAContext::MemcpyEx<CPUContext, CUDAContext>(tensor.nbytes(),
-                   PyArray_DATA(reinterpret_cast<PyArrayObject*>(array)),
+                     PyArray_DATA(reinterpret_cast<PyArrayObject*>(array)),
-                                         tensor.raw_data<CUDAContext>());
+                                            tensor.raw_data<CUDAContext>(),
+                                             tensor.memory()->device_id());
        } else {
            CPUContext::Memcpy<CPUContext, CPUContext>(tensor.nbytes(),
                 PyArray_DATA(reinterpret_cast<PyArrayObject*>(array)),
                                        tensor.raw_data<CPUContext>());
        }
-        return array;
+        return pybind11::reinterpret_steal<pybind11::object>(array);
    }
 };
 class StringFetcher : public TensorFetcherBase {
 public:
-    PyObject* Fetch(const Tensor& tensor) override {
+    pybind11::object Fetch(const Tensor& tensor) override {
-        CHECK_GT(tensor.count(), 0);
+        CHECK_EQ(tensor.count(), 1);
-        return String_AsPyBytes(*tensor.data<string, CPUContext>());
+        return pybind11::bytes(tensor.data<string, CPUContext>()[0]);
    }
 };
 class NumpyFeeder : public TensorFeederBase {
 public:
-    PyObject* Feed(
+    void Feed(
        const DeviceOption&         option,
        PyArrayObject*              original_array,
        Tensor*                     tensor) override {
        PyArrayObject* array = PyArray_GETCONTIGUOUS(original_array);
        const TypeMeta& meta = TypeNPYToMeta(PyArray_TYPE(array));
-        if (meta.id() == 0) {
+        if (meta.id() == 0) LOG(FATAL) << "Unsupported data type.";
-            PyErr_SetString(PyExc_TypeError, "Unsupported data type.");
+        tensor->SetMeta(meta);
-            return nullptr;
-        }
-        if (meta.id() != tensor->meta().id() && tensor->meta().id() != 0)
-            LOG(WARNING) << "Feed Tensor(" << tensor->name() << ")"
-                         << " with different data type from original one.";
        int ndim = PyArray_NDIM(array);
        npy_intp* npy_dims = PyArray_DIMS(array);
        vector<int64_t> dims;
@@ -116,21 +117,22 @@ class NumpyFeeder : public TensorFeederBase {
        tensor->Reshape(dims);
        if (option.device_type() == PROTO_CUDA) {
 #ifdef WITH_CUDA
-            CUDAContext context(option);
+            CUDAContext::MemcpyEx<CUDAContext, CPUContext>(
-            context.SwitchToDevice();
+                                          tensor->nbytes(),
-            auto* data = tensor->raw_mutable_data<CUDAContext>(meta);
+                   tensor->raw_mutable_data<CUDAContext>(),
-            context.Memcpy<CUDAContext, CPUContext>(tensor->nbytes(),
+                   static_cast<void*>(PyArray_DATA(array)),
-                      data, static_cast<void*>(PyArray_DATA(array)));
+                                       option.device_id());
-#else   
+#else
            LOG(FATAL) << "CUDA was not compiled.";
 #endif
        } else {
-            auto* data = tensor->raw_mutable_data<CPUContext>(meta);
+            auto* data = tensor->raw_mutable_data<CPUContext>();
-            CPUContext::Memcpy<CPUContext, CPUContext>(tensor->nbytes(),
+            CPUContext::Memcpy<CPUContext, CPUContext>(
-                         data, static_cast<void*>(PyArray_DATA(array)));
+                                      tensor->nbytes(),
+                tensor->raw_mutable_data<CPUContext>(),
+              static_cast<void*>(PyArray_DATA(array)));
        }
        Py_XDECREF(array);
-        Py_RETURN_TRUE;
    }
 };

--- a/Dragon/modules/python/py_graph.h
+++ b/Dragon/modules/python/py_graph.h
@@ -19,66 +19,41 @@ namespace dragon {
 namespace python {
-inline PyObject* CreateGraphCC(PyObject* self, PyObject* args) {
+void AddGraphMethods(pybind11::module& m) {
-    PyObject* graph_str, *verbose;
+    /*! \brief Create a graph from the serialized def */
-    if (!PyArg_ParseTuple(args, "S|O", &graph_str, &verbose)) {
+    m.def("CreateGraph", [](
-        PyErr_SetString(PyExc_ValueError,
+        const string&           serialized,
-            "Excepted a serialized string of GraphDef.");
+        const bool              verbose) {
-        return nullptr;
+        GraphDef graph_def;
-    } 
+        if (!graph_def.ParseFromString(serialized))
-    if (verbose == nullptr) verbose = Py_False;
+            LOG(FATAL) << "Failed to parse the GraphDef.";
+        auto* graph = ws()->CreateGraph(graph_def);
-    GraphDef graph_def;
+        if (verbose) {
-    if (!graph_def.ParseFromString(PyBytes_AsStringEx(graph_str))) {
+            // It is not a good design to print the debug string
-        PyErr_SetString(PyExc_RuntimeError,
-            "Failed to parse the GraphDef.");
-        return nullptr;
-    } 
-    auto* graph = ws()->CreateGraph(graph_def);
-    if (!graph) {
-        PyErr_SetString(PyExc_RuntimeError,
-            "Failed to create the Graph.");
-        return nullptr;
-    } else {
-        // It is not a good design to print the debug string
-        if (PyObject_IsTrue(verbose) ? true : false) {
            auto* graph_tensor = ws()->CreateTensor(
                "/graph_def/optimized/" + graph->name());
            if (graph_tensor->count() > 0) {
                auto* data = graph_tensor->mutable_data<string, CPUContext>();
                std::cout << data[0] << std::endl;
            }
-        }
-    }
-    // Return the graph name may be different from the def
-    // We will make a unique dummy name on creating the graph
-    return String_AsPyUnicode(graph->name());
-}
-inline PyObject* RunGraphCC(PyObject* self, PyObject* args) {
-    char* cname, *include, *exclude;
-    if (!PyArg_ParseTuple(args, "sss",
-            &cname, &include, &exclude)) {
-        PyErr_SetString(PyExc_ValueError,
-            "Excepted the graph name, include and exclude rules.");
-        return nullptr;
-    }
-    ws()->RunGraph(
-        string(cname),
-        string(include),
-        string(exclude)
-    );
-    Py_RETURN_TRUE;
-}
-inline PyObject* GraphsCC(PyObject* self, PyObject* args) {
+        }
-    vector<string> graphs = ws()->GetGraphs();
+        // Return the graph name may be different from the def
-    PyObject* list = PyList_New(graphs.size());
+        // We will make a unique dummy name on creating the graph
-    for (int i = 0; i < graphs.size(); i++)
+        return graph->name();
-        CHECK_EQ(PyList_SetItem(list, i, String_AsPyUnicode(graphs[i])), 0);
+    });
-    return list;
+    /*! \brief Run an existing graph */
+    m.def("RunGraph", [](
+        const string&           name,
+        const string&           include,
+        const string&           exclude) {
+        pybind11::gil_scoped_release g;
+        ws()->RunGraph(name, include, exclude);
+    });
+    /*! \brief List all of the existing graphs */
+    m.def("Graphs", []() { ws()->GetGraphs(); });
 }
 }  // namespace python

--- a/Dragon/modules/python/py_io.h
+++ b/Dragon/modules/python/py_io.h
@@ -19,48 +19,42 @@ namespace dragon {
 namespace python {
-inline PyObject* SnapshotCC(PyObject* self, PyObject* args) {
+void AddIOMethods(pybind11::module& m) {
-    char* path; int format;
+    m.def("Snapshot", [](
-    PyObject* names; vector<Tensor*> tensors;
+        const string&       filename,
-    if (!PyArg_ParseTuple(args, "sOi", &path, &names, &format)) {
+        vector<string>&     names,
-        PyErr_SetString(PyExc_ValueError,
+        const int           format) {
-            "Excepted the model path, tensors, and data format.");
+        vector<Tensor*> tensors;
-        return nullptr;
+        switch (format) {
-    }
+            case 0:  // Pickle
-    switch (format) {
+                LOG(FATAL) << "Format depends on Pickle. "
-        case 0:  // Pickle
+                              "Can't be used in C++.";
-            PyErr_SetString(PyExc_NotImplementedError,
+                break;
-                "Format depends on Pickle. Can't be used in C++.");
+            case 1:  // CaffeModel
-            break;
+                for (const auto& e : names)
-        case 1:  // CaffeModel
+                    tensors.emplace_back(ws()->GetTensor(e));
-            for (int i = 0; i < PyList_Size(names); i++)
+                SavaCaffeModel(filename, tensors);
-                tensors.push_back(ws()->GetTensor(
+                break;
-                    PyString_AsString(PyList_GetItem(names, i))));
+            default:
-            SavaCaffeModel(path, tensors);
+                LOG(FATAL) << "Unknwon format, code: " << format;
-            break;
+        }
-        default: LOG(FATAL) << "Unknwon format, code: " << format;
+    });
-   }
-   Py_RETURN_TRUE;
-}
-inline PyObject* RestoreCC(PyObject* self, PyObject* args) {
+    m.def("Restore", [](
-    char* path; int format;
+        const string&       filename,
-    if (!PyArg_ParseTuple(args, "si", &path, &format)) {
+        const int           format) {
-        PyErr_SetString(PyExc_ValueError,
+        switch (format) {
-            "Excepted the model path and data format.");
+            case 0:  // Pickle
-        return nullptr;
+                LOG(FATAL) << "Format depends on Pickle. "
-    }
+                    "Can't be used in C++.";
-    switch (format) {
+                break;
-        case 0:  // Pickle
+            case 1:  // CaffeModel
-            PyErr_SetString(PyExc_NotImplementedError,
+                LoadCaffeModel(filename, ws());
-                "Format depends on Pickle. Can't be used in C++.");
+                break;
-            break;
+            default: 
-        case 1:  // CaffeModel
+                LOG(FATAL) << "Unknwon format, code: " << format;
-            LoadCaffeModel(path, ws());
+        }
-            break;
+    });
-        default: LOG(FATAL) << "Unknwon format, code: " << format;
-    }
-    Py_RETURN_TRUE;
 }
 }  // namespace python

--- a/Dragon/modules/python/py_macros.h
+++ b/Dragon/modules/python/py_macros.h
-/*!
- * Copyright (c) 2017-present, SeetaTech, Co.,Ltd.
- *
- * Licensed under the BSD 2-Clause License.
- * You should have received a copy of the BSD 2-Clause License
- * along with the software. If not, See,
- *
- *      <https://opensource.org/licenses/BSD-2-Clause>
- *
- * ------------------------------------------------------------
- */
-#ifndef DRAGON_PYTHON_PY_MACROS_H_
-#define DRAGON_PYTHON_PY_MACROS_H_
-#include <string>
-#include <sstream>
-#include <Python.h>
-#include <numpy/arrayobject.h>
-namespace dragon {
-namespace python {
-#ifdef WITH_PYTHON3
-#define PyInt_FromLong PyLong_FromLong
-#define _PyInt_AsInt _PyLong_AsInt
-#define PyString_AsString PyUnicode_AsUTF8
-#endif
-/*!
- * ------------------------------------------------------------
- *
- *                  <Having Fun with PyString>
- *
- * For Python3, Get/Return PyUnicode for regular string.
- * For Python3, Get/Return PyBytes for google-protobuf.
- * For Python2, Get/Return PyBytes only.
- *
- * ------------------------------------------------------------
- */
-#define PyBytes_AsStringEx(pystring) \
-    std::string(PyBytes_AsString(pystring), PyBytes_Size(pystring))
-// Return string to Python
-inline PyObject* String_AsPyBytes(const std::string& cstring) {
-    return PyBytes_FromStringAndSize(cstring.c_str(), cstring.size());
-}
-inline PyObject* String_AsPyUnicode(const std::string& cstring) {
-#ifdef WITH_PYTHON3
-    return PyUnicode_FromStringAndSize(cstring.c_str(), cstring.size());
-#else
-    return PyBytes_FromStringAndSize(cstring.c_str(), cstring.size());
-#endif
-}
-// Macors
-#define PyList_AsVecString(plist, vs, defaults) \
-    for (int i = 0; i < PyList_Size(plist); i++) { \
-        PyObject* e = PyList_GetItem(plist, i); \
-        if (e == Py_None) vs.emplace_back(defaults); \
-        else vs.push_back(PyString_AsString(PyObject_Str(e))); \
-    }
-#define SetPyList(plist, ix, e) \
-    PyList_SetItem(plist, ix, e)
-#define SetPyDictS2S(object, key, value) \
-    PyDict_SetItemString(object, key, Py_BuildValue("s", value))
-#define SetPyDictS2I(object, key, value) \
-    PyDict_SetItemString(object, key, Py_BuildValue("i", value))
-// Misc
-template <typename T>
-inline void MakeStringInternal(std::stringstream& ss, const T& t) { ss << t; }
-template <typename T,typename ... Args>
-inline void MakeStringInternal(std::stringstream& ss, const T& t, const Args& ... args) {
-    MakeStringInternal(ss, t);
-    MakeStringInternal(ss, args...);
-}
-template <typename ... Args>
-std::string MakeString(const Args&... args) {
-    std::stringstream ss;
-    MakeStringInternal(ss, args...);
-    return std::string(ss.str());
-}
-inline void PrErr_SetString(PyObject* type, const std::string& str) { 
-    PyErr_SetString(type, str.c_str()); 
-}
-}  // namespace python
-}  // namespace dragon
-#endif  // DRAGON_PYTHON_PY_MACROS_H_
\ No newline at end of file
--- a/Dragon/modules/python/py_module.cc
+++ b/Dragon/modules/python/py_module.cc
--- a/Dragon/modules/python/py_mpi.h
+++ b/Dragon/modules/python/py_mpi.h
@@ -15,125 +15,126 @@
 #include "py_dragon.h"
-namespace dragon {
-namespace python {
 #ifdef WITH_MPI
 #include <mpi.h>
+#endif
-inline PyObject* MPIInitCC(PyObject* self, PyObject* args) {
+namespace dragon {
-    int thread_type;
-    MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, &thread_type);
-    CHECK_EQ(thread_type, MPI_THREAD_MULTIPLE)
-        << "\nRequire to enable <MPI_THREAD_MULTIPLE> support.";
-    Py_RETURN_TRUE;
-}
-inline PyObject* MPIFinalizeCC(PyObject* self, PyObject* args) {
+namespace python {
-    MPI_Finalize();
-    Py_RETURN_TRUE;
-}
-inline PyObject* MPIRankCC(PyObject* self, PyObject* args) {
+void AddMPIMethods(pybind11::module& m) {
-    int world_rank;
+    m.def("MPIInit", []() {
-    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
+#ifdef WITH_MPI
-    return PyInt_FromLong(world_rank);
+        // Enabling the multi-threads for Python is meaningless
-}
+        // While we will still hold this interface here
+        int thread_type;
+        char* mt_is_required = nullptr;
+        mt_is_required = getenv("DRAGON_MPI_THREADS_ENABLE");
+        if (mt_is_required != nullptr && string(mt_is_required) == "1") {
+            MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, &thread_type);
+            CHECK_EQ(thread_type, MPI_THREAD_MULTIPLE)
+                << "\nRequire to enable <MPI_THREAD_MULTIPLE> support.";
+        } else {
+            MPI_Init_thread(NULL, NULL, MPI_THREAD_SINGLE, &thread_type);
+        }
+#else
+        LOG(FATAL) << "MPI was not compiled.";
+#endif
+    });
-inline PyObject* MPISizeCC(PyObject* self, PyObject* args) {
+    m.def("MPIRank", []() {
-    int world_size;
+#ifdef WITH_MPI
-    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
+        int world_rank;
-    return PyInt_FromLong(world_size);
+        MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
-}
+        return world_rank;
+#else
+        LOG(FATAL) << "MPI was not compiled.";
+#endif
+    });
-inline PyObject* MPICreateGroupCC(PyObject* self, PyObject* args) {
+    m.def("MPISize", []() {
-    PyObject *incl, *excl, *ret;
+#ifdef WITH_MPI
-    int local_root, world_size;
+        int world_size;
-    if (!PyArg_ParseTuple(args, "iOO", &local_root, &incl, &excl)) {
+        MPI_Comm_size(MPI_COMM_WORLD, &world_size);
-        PyErr_SetString(PyExc_ValueError,
+        return world_size;
-            "Excepted the local root, include and exclued list.");
+#else
-        return nullptr;
+        LOG(FATAL) << "MPI was not compiled.";
-    }
+#endif
-    MPI_Group world_group, local_group;
+    });
-    MPI_Comm local_comm;
-    int err_code;
+    m.def("MPICreateGroup", [](
-    MPI_Comm_group(MPI_COMM_WORLD, &world_group);
+        const int local_root,
-    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
+        const vector<int>& incl,
-    set<int> all_ranks;
+        const vector<int>& excl) {
-    for (int i = 0; i < world_size; i++) all_ranks.insert(i);
+#ifdef WITH_MPI
-    local_group = world_group;
+        int world_size;
+        MPI_Group world_group, local_group;
-    // Check inclue ranks
+        MPI_Comm local_comm;
-    int size = (int)PyList_Size(incl);
+        int err_code;
-    if (size > 0) {
+        MPI_Comm_group(MPI_COMM_WORLD, &world_group);
-        all_ranks.clear();
+        MPI_Comm_size(MPI_COMM_WORLD, &world_size);
-        unique_ptr<int> incl_ranks(new int[size]);
-        int* ranks = incl_ranks.get();
+        set<int> all_ranks;
-        for (int i = 0; i < size; i++) {
+        for (int i = 0; i < world_size; i++) all_ranks.insert(i);
-            ranks[i] = _PyInt_AsInt(PyList_GetItem(incl, i));
+        local_group = world_group;
-            all_ranks.insert(ranks[i]);
-        }
+        // Check include ranks
-        err_code = MPI_Group_incl(world_group, size, ranks, &local_group);
+        if (!incl.empty()) {
-        CHECK(err_code == MPI_SUCCESS) << "\nFail to create mpi group.";
+            all_ranks.clear();
-    }
+            for (auto e : incl) all_ranks.insert(e);
+            err_code = MPI_Group_incl(world_group,
-    // Check exclude ranks
+                (int)incl.size(), incl.data(), &local_group);
-    size = (int)PyList_Size(excl);
+            CHECK(err_code == MPI_SUCCESS)
-    if (size > 0) {
+                << "\nFail to create MPI Group.";
-        all_ranks.clear(); Set<int> tmp;
-        unique_ptr<int> excl_ranks(new int[size]);
-        int* ranks = excl_ranks.get();
-        for (int i = 0; i < size; i++) {
-            ranks[i] = _PyInt_AsInt(PyList_GetItem(excl, i));
-            tmp.insert(ranks[i]);
        }
-        for (int i = 0; i < world_size; i++)
-            if (!tmp.count(i)) all_ranks.insert(i);
-        err_code = MPI_Group_excl(world_group, size, ranks, &local_group);
-        CHECK(err_code == MPI_SUCCESS) << "Fail to create mpi group.";
-    }
-    err_code = MPI_Comm_create(MPI_COMM_WORLD, local_group, &local_comm);
+        // Check exclude ranks
-    CHECK(err_code == MPI_SUCCESS) << "Fail to create mpi group.";
+        if (!excl.empty()) {
+            all_ranks.clear(); Set<int> tmp;
+            for (auto e : excl) tmp.insert(e);
+            for (int i = 0; i < world_size; i++)
+                if (!tmp.count(i)) all_ranks.insert(i);
+            err_code = MPI_Group_excl(world_group,
+                (int)excl.size(), excl.data(), &local_group);
+            CHECK(err_code == MPI_SUCCESS)
+                << "\nFail to create MPI Group.";
+        }
-    if (local_comm != MPI_COMM_NULL) {
+        err_code = MPI_Comm_create(MPI_COMM_WORLD, local_group, &local_comm);
-        int world_rank, local_size;
+        CHECK(err_code == MPI_SUCCESS) << "\nFail to create MPI Group.";
-        MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
-        if (world_rank == local_root) {
+        if (local_comm != MPI_COMM_NULL) {
-            MPI_Comm_size(local_comm, &local_size);
+            int world_rank, local_size;
-            std::stringstream ss;
+            MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
-            ss << "Rank[" << world_rank << "]: "
+            if (world_rank == local_root) {
-               << "Create a mpi group of " << local_size << " members";
+                MPI_Comm_size(local_comm, &local_size);
-            ss << "\nGroup: [";
+                std::stringstream ss;
-            for (auto rank : all_ranks) {
+                ss << "Rank[" << world_rank << "]: "
-                if (rank != local_root) ss << rank << ", ";
+                    << "Create a mpi group of " << local_size << " members";
-                else ss << rank << "*, ";
+                ss << "\nGroup: [";
+                for (auto rank : all_ranks) {
+                    if (rank != local_root) ss << rank << ", ";
+                    else ss << rank << "*, ";
+                }
+                string log_info = ss.str(); log_info[log_info.size() - 2] = ']';
+                LOG(INFO) << log_info;
            }
-            string log_info = ss.str(); log_info[log_info.size() - 2] = ']';
-            LOG(INFO) << log_info;
        }
-    }
+        return vector<long>({ (long)local_comm, (long)local_group });
-    ret = PyList_New(2);
+#else
-    PyList_SetItem(ret, 0, PyInt_FromLong((long)local_comm));
+        LOG(FATAL) << "MPI was not compiled.";
-    PyList_SetItem(ret, 1, PyInt_FromLong((long)local_group));
+#endif
-    return ret;
+    });
-}
-#else  // WITH_MPI
-#define MPI_NOT_IMPLEMENTED \
-    LOG(FATAL) << "MPI was not compiled."; \
-    Py_RETURN_TRUE
-inline PyObject* MPIInitCC(PyObject* self, PyObject* args) { MPI_NOT_IMPLEMENTED; }
+    m.def("MPIFinalize", []() {
-inline PyObject* MPIFinalizeCC(PyObject* self, PyObject* args) { MPI_NOT_IMPLEMENTED; }
+#ifdef WITH_MPI
-inline PyObject* MPIRankCC(PyObject* self, PyObject* args) { MPI_NOT_IMPLEMENTED; }
+        MPI_Finalize();
-inline PyObject* MPISizeCC(PyObject* self, PyObject* args) { MPI_NOT_IMPLEMENTED; }
+#else
-inline PyObject* MPICreateGroupCC(PyObject* self, PyObject* args) { MPI_NOT_IMPLEMENTED; }
+        LOG(FATAL) << "MPI was not compiled.";
+#endif
-#endif  // WITH_MPI
+    });
+}
 }  // namespace python

--- a/Dragon/modules/python/py_onnx.h
+++ b/Dragon/modules/python/py_onnx.h
 /*!
-* Copyright (c) 2017-present, SeetaTech, Co.,Ltd.
+ * Copyright (c) 2017-present, SeetaTech, Co.,Ltd.
-*
+ *
-* Licensed under the BSD 2-Clause License.
+ * Licensed under the BSD 2-Clause License.
-* You should have received a copy of the BSD 2-Clause License
+ * You should have received a copy of the BSD 2-Clause License
-* along with the Xpensource.org/licenses/BSD-2-Clause>
+ * along with the software. If not, See,
-*
+ *
-* ------------------------------------------------------------
+ *      <https://opensource.org/licenses/BSD-2-Clause>
-*/
+ *
+ * ------------------------------------------------------------
+ */
 #ifndef DRAGON_PYTHON_PY_ONNX_H_
 #define DRAGON_PYTHON_PY_ONNX_H_
@@ -19,21 +21,18 @@ namespace dragon {
 namespace python {
-inline PyObject* ImportONNXModelCC(PyObject* self, PyObject* args) {
+void AddONNXMethods(pybind11::module& m) {
-    char* model_path;
+    m.def("ImportONNXModel", [](
-    if (!PyArg_ParseTuple(args, "s", &model_path)) {
+        const string&           model_path) {
-        PyErr_SetString(PyExc_ValueError,
+        GraphDef init_graph, pred_graph;
-            "Excepted the model path.");
+        onnx::ONNXBackend onnx_backend;
-        return nullptr;
+        onnx_backend.Prepare(model_path, &init_graph, &pred_graph);
-    }
+        // Serializing to Python is intractable
-    GraphDef init_graph, pred_graph;
+        // We should apply the initializer immediately
-    onnx::ONNXBackend onnx_backend;
+        ws()->CreateGraph(init_graph);
-    onnx_backend.Prepare(model_path, &init_graph, &pred_graph);
+        ws()->RunGraph(init_graph.name(), "", "");
-    // Serializing to Python is intractable
+        return pybind11::bytes(pred_graph.SerializeAsString());
-    // We should apply the initializer immediately
+    });
-    ws()->CreateGraph(init_graph);
-    ws()->RunGraph(init_graph.name(), "", "");
-    return String_AsPyBytes(pred_graph.SerializeAsString());
 }
 }  // namespace python

--- a/Dragon/modules/python/py_operator.h
+++ b/Dragon/modules/python/py_operator.h
@@ -19,91 +19,38 @@ namespace dragon {
 namespace python {
-inline PyObject* RegisteredOperatorsCC(PyObject* self, PyObject* args) {
+void AddOperatorMethods(pybind11::module& m) {
-    set<string> all_keys;
+    /*! \brief Return all the registered operators */
-    for (const auto& name : CPUOperatorRegistry()->keys()) all_keys.insert(name);
+    m.def("RegisteredOperators", []() { return CPUOperatorRegistry()->keys(); });
-    PyObject* list = PyList_New(all_keys.size());
-    int idx = 0;
+    /*! \brief Return all the operators without gradients */
-    for (const string& name : all_keys)
+    m.def("NoGradientOperators", []() { return NoGradientRegistry()->keys(); });
-        CHECK_EQ(PyList_SetItem(list, idx++, String_AsPyUnicode(name)), 0);
-    return list;
+    /*! \brief Run a operator from the def reference */
-}
+    m.def("RunOperator", [](
+        OperatorDef*        def,
-inline PyObject* NoGradientOperatorsCC(PyObject* self, PyObject* args) {
+        const bool          verbose) {
-    set<string> all_keys;
+        pybind11::gil_scoped_release g;
-    for (const auto& name : NoGradientRegistry()->keys()) all_keys.insert(name);
+        if (verbose) {
-    PyObject* list = PyList_New(all_keys.size());
+            // It is not a good design to print the debug string
-    int idx = 0;
+            std::cout << def->DebugString() << std::endl;
-    for (const string& name : all_keys)
+        }
-        CHECK_EQ(PyList_SetItem(list, idx++, String_AsPyUnicode(name)), 0);
+        ws()->RunOperator(*def);
-    return list;
+    });
-}
+    /*! \brief Run a operator from the serialized def */
-inline PyObject* RunOperatorCC(PyObject* self, PyObject* args) {
+    m.def("RunOperator", [](
-    PyObject* op_str;
+        const string&       serialized,
-    if (!PyArg_ParseTuple(args, "S", &op_str)) {
+        const bool          verbose) {
-        PyErr_SetString(PyExc_ValueError,
+        OperatorDef def;
-            "Excepted a serialized string of OperatorDef.");
+        CHECK(def.ParseFromString(serialized));
-        return nullptr;
+        pybind11::gil_scoped_release g;
-    }
+        if (verbose) {
-    OperatorDef op_def;
+            // It is not a good design to print the debug string
-    if (!op_def.ParseFromString(PyBytes_AsStringEx(op_str))) {
+            std::cout << def.DebugString() << std::endl;
-        PyErr_SetString(PyExc_RuntimeError,
+        }
-            "Failed to parse the OperatorDef.");
+        ws()->RunOperatorOnce(def);
-        return nullptr;
+    });
-    }
-    ws()->RunOperator(op_def);
-    Py_RETURN_TRUE;
-}
-inline PyObject* RunOperatorsCC(PyObject* self, PyObject* args) {
-    PyObject* py_ops;
-    if (!PyArg_ParseTuple(args, "O", &py_ops)) {
-        PyErr_SetString(PyExc_ValueError,
-            "Excepted a list of serialized string of OperatorDef.");
-        return nullptr;
-    }
-    OperatorDef op_def;
-    for (int i = 0; i < PyList_Size(py_ops); i++) {
-        PyObject* op_str = PyList_GetItem(py_ops, i);
-        CHECK(op_def.ParseFromString(PyBytes_AsStringEx(op_str)));
-        ws()->RunOperator(op_def);
-    }
-    Py_RETURN_TRUE;
-}
-inline PyObject* CreatePersistentOpCC(PyObject* self, PyObject* args) {
-    PyObject* op_str;
-    if (!PyArg_ParseTuple(args, "S", &op_str)) {
-        PyErr_SetString(PyExc_ValueError,
-            "Excepted a serialized string of OperatorDef.");
-        return nullptr;
-    }
-    OperatorDef op_def;
-    if (!op_def.ParseFromString(PyBytes_AsStringEx(op_str))) {
-        PyErr_SetString(PyExc_RuntimeError,
-            "Failed to parse the OperatorDef.");
-        return nullptr;
-    }
-    ws()->CreatePersistentOp(op_def);
-    Py_RETURN_TRUE;
-}
-inline PyObject* RunPersistentOpCC(PyObject* self, PyObject* args) {
-    char* key, *anchor;
-    PyObject* py_inputs, *py_outputs;
-    if (!PyArg_ParseTuple(args, "ssOO",
-            &key, &anchor, &py_inputs, &py_outputs)) {
-        PyErr_SetString(PyExc_ValueError, 
-            "Excepted a persistent key, anchor, "
-            "list of inputs and outputs.");
-        return nullptr;
-    }
-    vector<string> inputs, outputs;
-    PyList_AsVecString(py_inputs, inputs, "");
-    PyList_AsVecString(py_outputs, outputs, "");
-    ws()->RunPersistentOp(key, anchor, inputs, outputs);
-    Py_RETURN_TRUE;
 }
 }  // namespace python

--- a/Dragon/modules/python/py_proto.h
+++ b/Dragon/modules/python/py_proto.h
+/*!
+ * Copyright (c) 2017-present, SeetaTech, Co.,Ltd.
+ *
+ * Licensed under the BSD 2-Clause License.
+ * You should have received a copy of the BSD 2-Clause License
+ * along with the software. If not, See,
+ *
+ *      <https://opensource.org/licenses/BSD-2-Clause>
+ *
+ * ------------------------------------------------------------
+ */
+#ifndef DRAGON_PYTHON_PY_PROTO_H_
+#define DRAGON_PYTHON_PY_PROTO_H_
+#include "py_dragon.h"
+namespace dragon {
+namespace python {
+void AddProtoMethods(pybind11::module& m) {
+    /*! \brief Extented C-Style OperatorDef */
+    pybind11::class_<OperatorDef>(m, "OperatorDef")
+        .def(pybind11::init())
+        .def("CopyFrom", [](
+            OperatorDef*            self,
+            OperatorDef*            other) {
+            self->CopyFrom(*other);
+      }).def("ParseFrom", [](
+            OperatorDef*            self,
+            const string&           serialized) {
+            self->ParseFromString(serialized);
+      }).def("SerializeAs", [](
+            OperatorDef*            self) {
+            return pybind11::bytes(self->SerializeAsString());
+      }).def("add_input", [](
+            OperatorDef*            self,
+            const string&           input) {
+          self->add_input(input);
+      }).def("add_output", [](
+            OperatorDef*            self,
+            const string&           output) {
+          self->add_output(output);
+      }).def_property("name",
+          [](OperatorDef* self) {
+              return self->name(); },
+          [](OperatorDef* self, const string& name) {
+              self->set_name(name);
+      }).def_property("type",
+          [](OperatorDef* self) {
+            return self->type(); },
+          [](OperatorDef* self, const string& type) {
+              self->set_type(type);
+      }).def_property("input",
+          [](OperatorDef* self) -> vector<string> {
+              return { self->input().begin(), self->input().end() }; },
+          [](OperatorDef* self, const vector<string>& input) {
+              *(self->mutable_input()) = { input.begin(), input.end() };
+      }).def_property("output",
+          [](OperatorDef* self) -> vector<string> {
+             return{ self->output().begin(), self->output().end() }; },
+          [](OperatorDef* self, const vector<string>& output) {
+              *(self->mutable_output()) = { output.begin(), output.end() };
+      });
+    m.def("TestOperatorDefs", [](vector<OperatorDef*> defs) {
+        for (auto* def : defs) {
+            std::cout << def->DebugString() << std::endl;
+        }
+    });
+}
+}  // namespace python
+}  // namespace dragon
+#endif DRAGON_PYTHON_PY_PROTO_H_
\ No newline at end of file
--- a/Dragon/modules/python/py_tensor.h
+++ b/Dragon/modules/python/py_tensor.h
--- a/Dragon/modules/python/py_types.h
+++ b/Dragon/modules/python/py_types.h
@@ -13,6 +13,7 @@
 #ifndef DRAGON_PYTHON_PY_TYPES_H_
 #define DRAGON_PYTHON_PY_TYPES_H_
+#include <string>
 #include <numpy/arrayobject.h>
 #include "core/types.h"
@@ -31,6 +32,7 @@ inline const int TypeMetaToNPY(const TypeMeta& meta) {
        { TypeMeta::Id<float16>(), NPY_FLOAT16 },
        { TypeMeta::Id<float>(), NPY_FLOAT32 },
        { TypeMeta::Id<double>(), NPY_FLOAT64 },
+        { TypeMeta::Id<std::string>(), NPY_OBJECT },
    };
    return m2npy_type_map.count(meta.id()) ? m2npy_type_map[meta.id()] : -1;
 }
@@ -45,6 +47,8 @@ inline const TypeMeta& TypeNPYToMeta(int npy_type) {
        { NPY_FLOAT16, TypeMeta::Make<float16>() },
        { NPY_FLOAT32, TypeMeta::Make<float>() },
        { NPY_FLOAT64, TypeMeta::Make<double>() },
+        { NPY_UNICODE, TypeMeta::Make<std::string>() },
+        { NPY_STRING, TypeMeta::Make<std::string>() },
    };
    static TypeMeta unknown_type;
    return npy2m_type_map.count(npy_type) ?

--- a/Dragon/python/dragon/__init__.py
+++ b/Dragon/python/dragon/__init__.py
@@ -24,6 +24,7 @@ from dragon.core.tensor import Tensor
 import dragon.core.workspace as workspace
 import dragon.core.tensor_utils as tensor_utils
 import dragon.core.mpi as mpi
+import dragon.core.cuda as cuda
 import dragon.memonger as memonger
 # Operators

--- a/Dragon/python/dragon/config.py
+++ b/Dragon/python/dragon/config.py
@@ -23,7 +23,7 @@ option = {}
 # The current device, 'CPU', 'CUDA' or 'CNML'
 option['device'] = 'CPU'
-# The device id
+# The device index
 option['device_id'] = 0
 # Whether to use cuDNN if possible
@@ -32,8 +32,8 @@ option['use_cudnn'] = False
 # The global random seed
 option['random_seed'] = 3
-# Disable the memonger if true
+# Set the level of graph optimization
-option['debug_mode'] = False
+option['graph_optimization_level'] = 3
 # Whether to share grads
 option['share_grads'] = True
@@ -76,29 +76,13 @@ def EnableCPU():
    option['device'] = 'CPU'
-def IsCUDADriverSufficient():
-    """Is CUDADriver sufficient?
-    Returns
-    -------
-    boolean
-        ``True`` if your device(s) support CUDA otherwise ``False``.
-    References
-    ----------
-    The wrapper of ``IsCUDADriverSufficientCC``.
-    """
-    return C.IsCUDADriverSufficientCC()
 def EnableCUDA(gpu_id=0, use_cudnn=True):
    """Enable NVIDIA's CUDA mode globally.
    Parameters
    ----------
    gpu_id : int
-        The id of GPU to use.
+        The index of GPU to use.
    use_cudnn : boolean
        Whether to use cuDNN if available.
@@ -119,7 +103,7 @@ def EnableCNML(mlu_id=0):
    Parameters
    ----------
    device_id : int
-        The id of MLU to use.
+        The index of MLU to use.
    Returns
    -------
@@ -161,12 +145,12 @@ def GetRandomSeed():
 def SetGPU(id):
-    """Set the global id GPU.
+    """Set the global index GPU.
    Parameters
    ----------
    id : int
-        The id of GPU to use.
+        The index of GPU to use.
    Returns
    -------
@@ -178,26 +162,26 @@ def SetGPU(id):
 def GetGPU():
-    """Get the global id of GPU.
+    """Get the global index of GPU.
    Returns
    -------
    int
-        The global id of GPU.
+        The global index of GPU.
    """
    return option['device_id']
-def SetDebugMode(enabled=True):
+def SetGraphType(graph_type=''):
-    """Enable Debug mode globally.
+    """Set the graph type.
-    It will disable all memory sharing optimizations.
+    If empty, the default DAG graph will be used.
    Parameters
    ----------
-    enabled : boolean
+    graph_type : str
-        Whether to enable debug mode.
+        The graph type.
    Returns
    -------
@@ -205,18 +189,28 @@ def SetDebugMode(enabled=True):
    """
    global option
-    option['debug_mode'] = enabled
+    option['graph_type'] = graph_type
-def SetGraphType(graph_type=''):
+def SetGraphOptimizationLevel(level=3):
-    """Set the graph type.
+    """Set the default level of graph optimization.
-    If empty, the default DAG graph will be used.
+    We have predefined four levels:
+    -O0(level=0): Do nothing.
+    -O1(level=1): Prune the redundant nodes.
+    -O2(level=2): Add the inplace to outputs.
+    Note that the graph will no longer be a DAG.
+    -O3(level=3): Allocate the buffer for outputs.
+    This level is memory-efficient while debugging will be non-trivial.
    Parameters
    ----------
-    graph_type : str
+    level : {0, 1, 2, 3}, optional, default=3
-        The graph type.
+        The level, see the documentation for details.
    Returns
    -------
@@ -224,7 +218,7 @@ def SetGraphType(graph_type=''):
    """
    global option
-    option['graph_type'] = graph_type
+    option['graph_optimization_level'] = level
 def LogMetaGraph(enabled=True):
@@ -301,7 +295,7 @@ def SetLoggingLevel(level):
    The default level is *INFO*.
    """
-    C.SetLogLevelCC(level)
+    C.SetLoggingLevel(level)
    logging.set_verbosity({
        'DEBUG': logging.DEBUG,
        'INFO': logging.INFO,

--- a/Dragon/python/dragon/core/cuda.py
+++ b/Dragon/python/dragon/core/cuda.py
+# ------------------------------------------------------------
+# Copyright (c) 2017-present, SeetaTech, Co.,Ltd.
+#
+# Licensed under the BSD 2-Clause License.
+# You should have received a copy of the BSD 2-Clause License
+# along with the software. If not, See,
+#
+#      <https://opensource.org/licenses/BSD-2-Clause>
+#
+# ------------------------------------------------------------
+"""List some useful CUDA C++ API."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import dragon.import_c_api as C
+def IsCUDADriverSufficient():
+    """Is cuda driver sufficient?
+    Returns
+    -------
+    boolean
+        ``True`` if your device(s) support CUDA otherwise ``False``.
+    """
+    return C.IsCUDADriverSufficient()
+def GetDevice():
+    """Get the current active cuda device.
+    Returns
+    -------
+    int
+        The device index.
+    """
+    return C.cudaGetDevice()
+def SynchronizeStream(device_id=None, stream_id=0):
+    """Synchronize the specified cuda stream.
+    If ``device_id`` is *None*, the current active device will be selected.
+    Returns
+    -------
+    device_id : int or None
+        The device index.
+    stream_id : int
+        The stream index.
+    """
+    return C.cudaStreamSynchronize(
+        device_id if device_id else -1, stream_id)
\ No newline at end of file
--- a/Dragon/python/dragon/core/gradient_maker.py
+++ b/Dragon/python/dragon/core/gradient_maker.py
@@ -49,9 +49,9 @@ class GraphGradientMaker(object):
        Parameters
        ----------
-        forward_op : dragon_pb2.OperatorDef
+        forward_op : OperatorDef
            The OperatorDef of ``ForwardOp``.
-        g_outputs : list of str or list of None
+        g_outputs : list of str
            The inputs of ``BackwardOp`` (Precomputed grads).
        name : str, optional
            The optional operator name.
@@ -61,13 +61,9 @@ class GraphGradientMaker(object):
        tuple
            The OpDef, outputs and defaults of ``BackwardOp``.
-        References
-        ----------
-        The wrapper of ``CreateGradientDefsCC``.
        """
-        g_ops, g_inputs, defaults = \
+        g_ops, g_inputs, defaults = C.CreateGradientDefs(
-            C.CreateGradientDefsCC(forward_op.SerializeToString(), g_outputs)
+            forward_op.SerializeToString(), g_outputs)
        for idx, g_op in enumerate(g_ops):
            new_def = pb.OperatorDef()
            new_def.ParseFromString(g_op)
@@ -80,13 +76,13 @@ class GraphGradientMaker(object):
        Parameters
        ----------
-        forward_op : dragon_pb2.OperatorDef
+        forward_op : OperatorDef
            The OperatorDef of ``ForwardOp``.
        inputs_to_grads : dict
            The dict of <input, g_input>.
        blacklist : set of str
            The set of ``NoGradient`` tensors.
-        targets : list of str
+        targets : sequence of str
            The solving targets.
        Returns
@@ -123,7 +119,7 @@ class GraphGradientMaker(object):
        Parameters
        ----------
-        forward_ops : list of dragon_pb2.OperatorDef
+        forward_ops : sequence of OperatorDef
            The operators of ``ForwardOp``.
        targets : sequence of str
            The solving targets.
@@ -168,12 +164,12 @@ class GraphGradientMaker(object):
            is_skip, gen_grads = \
                cls.CheckGrad(forward_op, inputs_to_grads, blacklist, targets)
            # Missing grads are represented as ``None``
-            g_outputs = list(inputs_to_grads.get(name, None) for name in forward_op.output)
+            g_outputs = list(inputs_to_grads.get(name, 'ignore') for name in forward_op.output)
            g_ops, g_inputs, defaults = cls.CreateGrad(forward_op, g_outputs)
            # Append ops
            if not is_skip:
-                # --> GenOp
+                # GradientGenerateOp
                if len(gen_grads) > 0:
                    op_inputs = []; op_outputs = []; values = []
                    for item in gen_grads:
@@ -185,7 +181,7 @@ class GraphGradientMaker(object):
                    if forward_op.HasField('device_option'):
                        gen_op.device_option.CopyFrom(forward_op.device_option)
                    backward_ops.append(gen_op)
-                # --> GradOp
+                #  GradientOp
                for g_op in g_ops:
                    g_op.name = OperatorHelper.get_name() if auto_names else 'runtime'
                    backward_ops.append(g_op)

--- a/Dragon/python/dragon/core/helper.py
+++ b/Dragon/python/dragon/core/helper.py
@@ -33,7 +33,7 @@ class OperatorHelper(object):
        # Input(0) => Output(0), shape and data type unchanged.
        'Relu', 'PRelu', 'Elu', 'SElu', 'Sigmoid', 'Tanh', 'Dropout', 'Softmax',
        'Add', 'Sub', 'Mul', 'Div', 'Clip', 'Log', 'Exp', 'Pow', 'Square', 'Sqrt',
-        'Affine', 'Copy', 'Compare', 'StopGradient', 'MovingAverage', 'MPIBroadcast',
+        'Accumulate', 'Affine', 'Copy', 'Compare', 'StopGradient',  'MPIBroadcast',
        'BatchNorm', 'GroupNorm', 'L2Norm', 'LRN', 'BiasAdd', 'DropBlock2d',
    )
@@ -885,10 +885,6 @@ class OperatorHelper(object):
    def _apply_BilinearResize(cls, arguments, inputs, outputs):
        return cls._apply_NNResize(arguments, inputs, outputs)
-    @classmethod
-    def _apply_DenseConcat(cls, arguments, inputs, outputs):
-        return cls._apply_Concat(arguments, inputs, outputs)
 class GradientHelper(object):
    """A helper to store the known gradient relations.

--- a/Dragon/python/dragon/core/logging.py
+++ b/Dragon/python/dragon/core/logging.py
@@ -43,8 +43,9 @@ def get_logger():
        logger = _logging.getLogger('dragon')
        logger.setLevel(INFO)
+        logger.propagate = False
-        if not _logging.getLogger().handlers:
+        if True:
            # Determine whether we are in an interactive environment
            _interactive = False
            try:

--- a/Dragon/python/dragon/core/mpi.py
+++ b/Dragon/python/dragon/core/mpi.py
@@ -9,31 +9,15 @@
 #
 # ------------------------------------------------------------
+"""List some useful MPI C++ API."""
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
-import numpy as np
 import dragon.import_c_api as C
-__all__ = [
-    'Init',
-    'Is_Init',
-    'Rank',
-    'Size',
-    'CreateGroup',
-    'Snapshot',
-    'AllowSnapshot',
-    'Parallel',
-    'AllowParallel',
-    'SetParallelMode',
-    'GetParallelMode',
-    'Finalize',
-]
 _GLOBAL_MPI_IS_INIT = False
 _GLOBAL_MPI_SNAPSHOT_RANKS = []
 _GLOBAL_MPI_PARALLEL_GROUPS = []
@@ -55,12 +39,8 @@ def Init():
    -----
    This function can only be called once.
-    References
-    ----------
-    The wrapper of ``MPIInitCC``
    """
-    C.MPIInitCC()
+    C.MPIInit()
    global _GLOBAL_MPI_IS_INIT
    global _GLOBAL_MPI_SNAPSHOT_RANKS
    _GLOBAL_MPI_IS_INIT = True
@@ -86,13 +66,9 @@ def Rank():
    int
        The world rank.
-    References
-    ----------
-    The wrapper of ``MPIRankCC``.
    """
    _check_init()
-    return C.MPIRankCC()
+    return C.MPIRank()
 def Size():
@@ -103,13 +79,9 @@ def Size():
    int
        The world size.
-    References
-    ----------
-    The wrapper of ``MPISizeCC``.
    """
    _check_init()
-    return C.MPISizeCC()
+    return C.MPISize()
 def CreateGroup(root=0, incl=[], excl=[]):
@@ -129,14 +101,9 @@ def CreateGroup(root=0, incl=[], excl=[]):
    tuple
        The local common and group id.
-    References
-    ----------
-    The wrapper of ``MPICreateGroupCC``.
    """
    _check_init()
-    comm, group = C.MPICreateGroupCC(root, incl, excl)
+    return C.MPICreateGroup(root, incl, excl)
-    return np.int64(comm), np.int64(group)
 def Snapshot(incl):
@@ -193,6 +160,7 @@ def AllowSnapshot():
    Returns
    -------
    boolean
    """
    return Rank() in _GLOBAL_MPI_SNAPSHOT_RANKS
@@ -212,12 +180,12 @@ def AllowParallel():
 def SetParallelMode(mode):
-    """Set the mode of data parallelism.
+    """Set the communication mode of data parallelism.
    Parameters
    ----------
-    mode : str
+    mode : {'MPI', 'NCCL'}, optional
-        The mode, ``MPI``, ``NCCL`` or ``MIXED``.
+        The communication mode.
    Returns
    -------
@@ -228,20 +196,18 @@ def SetParallelMode(mode):
    The default mode is ``MPI``.
    """
-    assert mode == 'MPI' or \
+    assert mode == 'MPI' or mode == 'NCCL'
-           mode == 'NCCL' or \
-           mode == 'MIXED'
    global _GLOBAL_MPI_PARALLEL_MODE
    _GLOBAL_MPI_PARALLEL_MODE = mode
 def GetParallelMode():
-    """Get the current mode of data parallelism.
+    """Get the current communication mode of data parallelism.
    Returns
    -------
-    str
+    str : {'MPI', 'NCCL'}
-        The mode, ``MPI``, ``NCCL`` or ``MIXED``.
+        The communication mode.
    """
    return _GLOBAL_MPI_PARALLEL_MODE
@@ -260,4 +226,4 @@ def Finalize():
    """
    _check_init()
-    C.MPIFinalizeCC()
+    C.MPIFinalize()
\ No newline at end of file
--- a/Dragon/python/dragon/core/proto_utils.py
+++ b/Dragon/python/dragon/core/proto_utils.py
@@ -21,6 +21,7 @@ import numpy as np
 from google.protobuf.message import Message
 import dragon.config as cfg
+import dragon.import_c_api as C
 from dragon.proto import dragon_pb2 as pb
 from dragon.core.scope import get_default_device
@@ -50,14 +51,15 @@ else:
        argument.name = key
        if type(value) is float: argument.f = value
        elif type(value) in (bool, int, long, np.int64) : argument.i = value
-        elif type(value) in (str, unicode): argument.s = value
+        elif type(value) is str: argument.s = value
+        elif type(value) is unicode: argument.s = str(value)
        elif isinstance(value, Message): argument.s = value.SerializeToString()
        elif all(type(v) is float for v in value): argument.floats.extend(value)
        elif all(type(v) is int for v in value): argument.ints.extend(value)
        elif all(type(v) is long for v in value): argument.ints.extend(value)
        elif all(type(v) is str for v in value): argument.strings.extend(value)
-        elif all(type(v) is unicode or type(v) is str for v in value):
+        elif all(type(v) is unicode for v in value):
-            argument.strings.extend(value)
+            argument.strings.extend([str(v) for v in value])
        elif all(isinstance(v, Message) for v in value):
            argument.strings.extend([v.SerializeToString() for v in value])
        else:
@@ -67,8 +69,10 @@ else:
        return argument
-def MakeOperatorDef(op_type, inputs, outputs, name='',
+def MakeOperatorDef(
-                    device_option=None, arg=None, engine=None, **kwargs):
+    op_type, inputs=(), outputs=(),
+        name='', uid=None, device_option=None,
+            arg=None, engine=None, **kwargs):
    operator = pb.OperatorDef()
    operator.type = op_type
    operator.name = name
@@ -81,22 +85,29 @@ def MakeOperatorDef(op_type, inputs, outputs, name='',
    if 'random_seed' in kwargs:
        operator.device_option.random_seed = kwargs['random_seed']
        del kwargs['random_seed']
-    if arg is not None:
+    if uid is not None: operator.uid = uid
-        operator.arg.extend(arg)
+    if arg is not None: operator.arg.extend(arg)
    for k,v in kwargs.items():
        if v is None: continue
        operator.arg.add().CopyFrom(MakeArgument(k,v))
    return operator
-def MutableOperatorDef(meta_def, inputs, outputs):
+def MakeCXXOperatorDef(
-    op = pb.OperatorDef(); op.CopyFrom(meta_def)
+    op_type, inputs=(), outputs=(),
-    op.ClearField('input'); op.input.extend(inputs)
+        name='', uid=None, device_option=None,
-    op.ClearField('output'); op.output.extend(outputs)
+            arg=None, engine=None, **kwargs):
-    return op
+    c_def = C.OperatorDef()
+    py_def = MakeOperatorDef(
+        op_type, inputs, outputs, name, uid,
+            device_option, arg, engine, **kwargs)
+    c_def.ParseFrom(py_def.SerializeToString())
+    return c_def
-def MakeDeviceOption(device_type, device_id, engine=None, rng_seed=None):
+def MakeDeviceOption(
+    device_type, device_id,
+        engine=None, rng_seed=None):
    option = pb.DeviceOption()
    option.device_type = device_type
    option.device_id = device_id
@@ -121,7 +132,9 @@ for i in range(_PREDEFINED_DEVICE_LIMITS):
                MakeDeviceOption(identify, i, 'CUDNN')
-def GetDeviceOption(device_type, device_id=0, engine=None, rng_seed=None):
+def GetDeviceOption(
+    device_type, device_id=0,
+        engine=None, rng_seed=None):
    ctx = (device_type, device_id, engine if engine else '')
    option = _PREDEFINED_DEVICE_OPTION_DICT[ctx]
    if rng_seed is not None:

--- a/Dragon/python/dragon/core/scope.py
+++ b/Dragon/python/dragon/core/scope.py
@@ -88,11 +88,11 @@ class WorkspaceScope(object):
        self.prev = 'default'
    def __enter__(self):
-        self.prev = C.CurrentWorkspaceCC()
+        self.prev = C.CurrentWorkspace()
-        C.SwitchWorkspaceCC(self.ws, True)
+        C.SwitchWorkspace(self.ws, True)
    def __exit__(self, type, value, traceback):
-        C.SwitchWorkspaceCC(self.prev, True)
+        C.SwitchWorkspace(self.prev, True)
 _GLOBAL_TENSOR_STACK = _ThreadLocalStack()

--- a/Dragon/python/dragon/core/tensor.py
+++ b/Dragon/python/dragon/core/tensor.py
@@ -355,10 +355,9 @@ class Tensor(object):
        """
        if inplace:
            return Tensor.CreateOperator(
-                'AsType', [], existing_outputs=[self], dtype=dtype)
+                'Cast', [], existing_outputs=[self], dtype=dtype)
        else:
-            return Tensor.CreateOperator(
+            return Tensor.CreateOperator('Cast', self, dtype=dtype)
-                'AsType', self, dtype=dtype)
    @property
    def extra_targets(self):

--- a/Dragon/python/dragon/core/tensor_utils.py
+++ b/Dragon/python/dragon/core/tensor_utils.py
@@ -9,6 +9,8 @@
 #
 # ------------------------------------------------------------
+"""List some extended Tensor C++ API."""
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
@@ -23,21 +25,7 @@ from dragon.core.tensor import Tensor
 from dragon.core.proto_utils import GetDeviceOption
-__all__ = [
+def FromShape(shape, dtype='float32', name=None):
-    'FromShape',
-    'SetShape',
-    'FromTensor',
-    'FromPyArray',
-    'SetPyArray',
-    'ToPyArray',
-    'ToPyArrayEx',
-    'ToCPUTensor',
-    'ToCUDATensor',
-    'GetTensorInfo',
-]
-def FromShape(shape, dtype='float32', ctx=None, name=None):
    """Create a Tensor from the shape.
    If specifying a existed tensor with larger shape,
@@ -49,8 +37,6 @@ def FromShape(shape, dtype='float32', ctx=None, name=None):
        The shape info.
    dtype : str
        The data type.
-    ctx : dragon_pb2.DeviceOption
-        The context info.
    name : str, optional
        The optional tensor name.
@@ -59,19 +45,14 @@ def FromShape(shape, dtype='float32', ctx=None, name=None):
    Tensor
        The tensor with the specific shape.
-    References
-    ----------
-    The wrapper of ``TensorFromShapeCC``.
    """
    tensor = _try_get_tensor(name)
+    tensor.shape = list(shape)
    if not isinstance(shape, (tuple, list)):
        raise TypeError('The shape should be a tuple or list.')
-    if ctx is None: ctx = GetDeviceOption('CPU')
+    C.TensorFromShape(
-    C.TensorFromShapeCC(
        _stringify_tensor(tensor),
-        list(shape), dtype,
+            list(shape), dtype)
-        _stringify_proto(ctx))
    return tensor
@@ -91,12 +72,8 @@ def SetShape(tensor, shape, dtype='float32'):
    -------
    None
-    References
-    ----------
-    The wrapper of ``TensorFromShapeCC``.
    """
-    C.TensorFromShapeCC(_stringify_tensor(tensor), shape, dtype)
+    C.TensorFromShape(_stringify_tensor(tensor), shape, dtype)
 def FromTensor(src, src_ctx=None, name=None, ctx=None):
@@ -109,11 +86,11 @@ def FromTensor(src, src_ctx=None, name=None, ctx=None):
    ----------
    src_ctx : str
        The name of source tensor.
-    src_ctx : dragon_pb2.DeviceOption
+    src_ctx : DeviceOption
        The context of source tensor.
    name : str
        The optional tensor name for destination tensor.
-    ctx : dragon_pb2.DeviceOption
+    ctx : DeviceOption
        The context for destination tensor.
    Returns
@@ -121,17 +98,13 @@ def FromTensor(src, src_ctx=None, name=None, ctx=None):
    Tensor
        The tensor with the same data as source.
-    References
-    ----------
-    The wrapper of ``TensorFromTensorCC``.
    """
    tensor = _try_get_tensor(name)
    if src_ctx is None: src_ctx = GetDeviceOption('CPU')
    if ctx is None: ctx = GetDeviceOption('CPU')
-    C.TensorFromTensorCC(
+    C.TensorFromTensor(
        _stringify_tensor(tensor), _stringify_tensor(src),
-        _stringify_proto(ctx), _stringify_proto(src_ctx))
+            _stringify_proto(ctx), _stringify_proto(src_ctx))
    return tensor
@@ -155,15 +128,11 @@ def FromPyArray(array, name=None):
    Tensor
        The tensor sharing the memory with original array.
-    References
-    ----------
-    The wrapper of ``TensorFromPyArrayCC``.
    """
    tensor = _try_get_tensor(name)
    if not isinstance(array, np.ndarray):
        raise TypeError('The given nd-array should be numpy.ndarray.')
-    C.TensorFromPyArrayCC(_stringify_tensor(tensor), array)
+    C.TensorFromPyArray(_stringify_tensor(tensor), array)
    return tensor
@@ -188,154 +157,58 @@ def SetPyArray(tensor, array):
    The wrapper of ``TensorFromPyArrayCC``.
    """
-    C.TensorFromPyArrayCC(_stringify_tensor(tensor), array)
+    C.TensorFromPyArray(_stringify_tensor(tensor), array)
-def ToPyArray(tensor):
+def ToPyArray(tensor, readonly=False):
    """Create a Array from a existing Tensor.
-    Note that memory of Array are ``zero-copied``.
+    Note that memory of Array are *zero-copied*.
    Parameters
    ----------
    tensor : Tensor or str
        The input tensor.
+    readonly : boolean
+        Whether to sync the contents with device.
    Returns
    -------
    numpy.ndarray
        The array sharing the memory with original tensor.
-    References
-    ----------
-    The wrapper of ``TensorToPyArrayCC``.
-    """
-    return C.TensorToPyArrayCC(_stringify_tensor(tensor))
-def ToPyArrayEx(tensor):
-    """Create a const Array from a existing Tensor.
-    Note that memory of Array are ``zero-copied`` and ``const``.
-    Parameters
-    ----------
-    tensor : Tensor or str
-        The input tensor.
-    Returns
-    -------
-    numpy.ndarray
-        The array sharing the memory with original tensor.
-    References
-    ----------
-    The wrapper of ``TensorToPyArrayExCC``.
-    """
-    return C.TensorToPyArrayExCC(_stringify_tensor(tensor))
-def ToCPUTensor(tensor):
-    """Switch the storage of a existing Tensor on cpu memory.
-    Parameters
-    ----------
-    tensor : Tensor or str
-        The input tensor.
-    Returns
-    -------
-    None
-    References
-    ----------
-    The wrapper of ``ToCPUTensorCC``.
    """
-    return C.ToCPUTensorCC(_stringify_tensor(tensor))
+    return C.TensorToPyArray(_stringify_tensor(tensor), readonly)
-def ToCUDATensor(tensor, device=0):
+def GetStorage(tensor):
-    """Switch the storage of a existing Tensor on cuda memory.
+    """Get the storage of a existing Tensor.
    Parameters
    ----------
    tensor : Tensor or str
        The input tensor.
-    device : int
-        The id of the device to use.
    Returns
    -------
-    None
+    TensorStorage
+        The storage of the backend.
-    References
-    ----------
-    The wrapper of ``ToCUDATensorCC``.
    """
-    return C.ToCUDATensorCC(_stringify_tensor(tensor), device)
+    tensor = _stringify_tensor(tensor)
+    if not dg.workspace.HasTensor(tensor): return None
+    return C.GetTensor(tensor)
-def GetTensorInfo(tensor, stream=1):
-    """Get the info of a existing Tensor.
-    The string info contains following fields:
-    stream #1: ``dtype``, ``from_numpy``, ``init``, ``mem``, ``mem_at``, ``device_id``
-    stream #2: ``shape``
-    stream #3: #1 + #2
-    Parameters
-    ----------
-    tensor : Tensor or str
-        The input tensor.
-    stream : int
-        The stream id.
-    Returns
-    -------
-    dict
-        The info.
-    References
-    ----------
-    The wrapper of ``GetTensorInfoCC``.
-    """
-    if not dg.workspace.HasTensor(_stringify_tensor(tensor)): return None
-    info = C.GetTensorInfoCC(_stringify_tensor(tensor), stream)
-    info['mem'] = []
-    if 'CPU' in info:
-        info['mem'].append('CPU'); info['device_id'] = 0
-    if 'CUDA' in info:
-        info['mem'].append('CUDA'); info['device_id'] = int(info['CUDA'])
-    if 'CNML' in info:
-        info['mem'].append('CNML'); info['device_id'] = int(info['CNML'])
-    info['init'] = len(info['mem']) > 0
-    return info
 def _stringify_proto(obj):
    """Try to stringify a proto-buffer structure."""
-    if obj is str: return obj
+    return obj.SerializeToString()
-    elif isinstance(obj, Message): return obj.SerializeToString()
-    else: raise TypeError('Object can not be serialized as a string.')
 def _stringify_tensor(obj):
    """Try to stringify a tensor."""
    if hasattr(obj, 'name'): return obj.name
-    else:
+    else: return str(obj)
-        try:
-            obj = str(obj)
-        except Exception as e:
-            raise TypeError('Object can bot be used as a tensor. Error: {0}'.format(str(e)))
-        return obj
 def _try_get_tensor(name=None):

--- a/Dragon/python/dragon/core/workspace.py
+++ b/Dragon/python/dragon/core/workspace.py
--- a/Dragon/python/dragon/import_c_api.py
+++ b/Dragon/python/dragon/import_c_api.py
@@ -33,8 +33,8 @@ except ImportError as e:
    sys.exit(1)
-REGISTERED_OPERATORS = set(s for s in RegisteredOperatorsCC())
+REGISTERED_OPERATORS = set(s for s in RegisteredOperators())
-NO_GRADIENT_OPERATORS = set(s for s in NoGradientOperatorsCC())
+NO_GRADIENT_OPERATORS = set(s for s in NoGradientOperators())
-atexit.register(OnModuleExitCC)
+atexit.register(OnModuleExit)
\ No newline at end of file
--- a/Dragon/python/dragon/operators/__init__.py
+++ b/Dragon/python/dragon/operators/__init__.py
@@ -100,8 +100,8 @@ class ArgumentHelper(object):
                        arguments[name] = None
                        arguments[name + '_desc'] = property.name
                    return arguments
-                extra_kwargs = {'gen_desc_{}'.format(name): Generator}
+                kwargs.update({'gen_desc_{}'.format(name): Generator})
-                return op_func(*args, **kwargs, **extra_kwargs)
+                return op_func(*args, **kwargs)
            return Impl
        return Decorator
@@ -138,8 +138,8 @@ class ArgumentHelper(object):
                    else:
                        arguments[desc_name] = properties
                    return arguments
-                extra_kwargs = {'gen_desc_{}'.format(name): Generator}
+                kwargs.update({'gen_desc_{}'.format(name): Generator})
-                return op_func(*args, **kwargs, **extra_kwargs)
+                return op_func(*args, **kwargs)
            return Impl
        return Decorator

--- a/Dragon/python/dragon/operators/arithmetic.py
+++ b/Dragon/python/dragon/operators/arithmetic.py
@@ -140,11 +140,13 @@ def Minimum(inputs, **kwargs):
 @OpSchema.Inputs(1)
 def Moments(inputs, axes=None, keep_dims=False, **kwargs):
-    """Compute the mean and variance of inputs along the given axes.
+    """Calculate the mean and variance of inputs along the given axes.
    The data type of moments will be *float32* typically,
    except the *float64* inputs (*float64* moments instead).
+    If ``axes`` is *None*, a Scalar will be returned.
    **Type Constraints**: (*int8*, *uint8*, *int32*, *int64*, *float16*, *float32*, *float64*)
    Parameters
@@ -206,9 +208,9 @@ def Matmul(inputs, transA=False, transB=False, **kwargs):
    ----------
    inputs : sequence of Tensor
        The inputs, A and B.
-    transA : bool
+    transA : bool, optional, default=False
        Whether to transpose A.
-    transB : bool
+    transB : bool, optional, default=False
        Whether to transpose B.
    Returns
@@ -234,9 +236,9 @@ def Dot(inputs, transA=False, transB=False, **kwargs):
    ----------
    inputs : sequence of Tensor
        The inputs, A and B.
-    transA : bool
+    transA : bool, optional, default=False
        Whether to transpose A.
-    transB : bool
+    transB : bool, optional, default=False
        Whether to transpose B.
    Returns
@@ -262,9 +264,9 @@ def FullyConnected(inputs, num_output, axis=1, transW=True, **kwargs):
        The inputs, represent [X, W] + [b].
    num_output : int
        The output dim.
-    axis : int, optional
+    axis : int, optional, default=1
        The start axis to calculate, can be negative.
-    transW : bool, optional
+    transW : bool, optional, default=True
        Whether to transpose the W.
    Returns
@@ -346,7 +348,7 @@ def Exp(inputs, **kwargs):
 @OpSchema.Inputs(1)
-def Pow(inputs, power, shift=None, scale=None, **kwargs):
+def Pow(inputs, power, shift=0., scale=1., **kwargs):
    """Calculate the power of input.
    Formulation: |power_function|
@@ -357,11 +359,11 @@ def Pow(inputs, power, shift=None, scale=None, **kwargs):
    ----------
    inputs : Tensor
        The input tensor.
-    power : float
+    power : float, required
        The power factor.
-    shift : float, optional
+    shift : float, optional, default=0.
        The shift magnitude.
-    scale : float, optional
+    scale : float, optional, default=1.
        The scale factor.
    Returns
@@ -414,7 +416,7 @@ def Sqrt(inputs, **kwargs):
        The sqrt result.
    """
-    return Tensor.CreateOperator('Pow', power=0.5, **ParseArgs(locals()))
+    return Tensor.CreateOperator('Sqrt', **ParseArgs(locals()))
 @OpSchema.Inputs(2, 3)
@@ -433,9 +435,9 @@ def Affine(inputs, axis=1, num_axes=1, **kwargs):
    ----------
    inputs : sequence of Tensor
        The inputs, represent [x, A] + [b].
-    axis : int, optional
+    axis : int, optional, default=1
        The start axis to scale, can be negative.
-    num_axes : int, optional
+    num_axes : int, optional, default=1
        The number of axes to scale.
    Returns
@@ -459,7 +461,7 @@ def GramMatrix(inputs, axis=1, **kwargs):
    ---------=
    inputs : Tensor
        The input tensor.
-    axis : int, optional
+    axis : int, optional, default=1
        The start axis to calculate.
    Returns
@@ -469,3 +471,48 @@ def GramMatrix(inputs, axis=1, **kwargs):
    """
    return Tensor.CreateOperator('GramMatrix', **ParseArgs(locals()))
+@OpSchema.Inputs(1, INT_MAX)
+def Accumulate(inputs, alpha=1., beta=1., **kwargs):
+    """Calculate *y = alpha * x + beta * y*
+    **Type Constraints**: (*int8*, *uint8*, *int32*, *int64*, *float16*, *float32*, *float64*)
+    Parameters
+    ----------
+    inputs : sequence of Tensor
+        The inputs, i.e., the *x*.
+    alpha : float, optional, default=1.
+        The alpha value.
+    beta : float, optional, default=1.
+    Returns
+    -------
+    sequence of Tensor
+        The outputs, i.e., the *y*.
+    """
+    return Tensor.CreateOperator('Accumulate', **ParseArgs(locals()))
+@OpSchema.Inputs(1, INT_MAX)
+def MovingAverage(inputs, decay, **kwargs):
+    """Calculate the *y = (1 - decay) * x + decay * y*
+    **Type Constraints**: (*int8*, *uint8*, *int32*, *int64*, *float16*, *float32*, *float64*)
+    Parameters
+    ----------
+    inputs : sequence of Tensor
+        The inputs, i.e., the *x*.
+    decay : float, required
+        The decay factor.
+    Returns
+    -------
+    sequence of Tensor
+        The outputs, i.e., the *y*.
+    """
+    return Accumulate(inputs, 1 - decay, decay, **kwargs)
\ No newline at end of file
--- a/Dragon/python/dragon/operators/misc.py
+++ b/Dragon/python/dragon/operators/misc.py
@@ -17,7 +17,7 @@ from . import *
 @OpSchema.Inputs(1)
-def AsType(inputs, dtype='float32', inplace=False, **kwargs):
+def Cast(inputs, dtype='float32', inplace=False, **kwargs):
    """Cast the data type of inputs to a specific one.
    If ``inplace`` is ``True``, cast ``self`` instead of returning a new one.
@@ -41,7 +41,7 @@ def AsType(inputs, dtype='float32', inplace=False, **kwargs):
    Examples
    --------
    >>> x = Tensor('x', dtype='float32').Variable()
-    >>> y = AsType(x, 'int32')
+    >>> y = Cast(x, 'int32')
    >>> z = x.astype('int64')
    >>> xx = x.astype('float64', inplace=True)
    >>> print(x.name, xx.name)
@@ -53,7 +53,7 @@ def AsType(inputs, dtype='float32', inplace=False, **kwargs):
        arguments['inputs'] = []
        arguments['existing_outputs'] = [inputs]
-    return Tensor.CreateOperator('AsType', **arguments)
+    return Tensor.CreateOperator('Cast', **arguments)
 def Run(inputs, module, op, param_str='', num_outputs=1, **kwargs):
@@ -173,28 +173,4 @@ def StopGradient(inputs, **kwargs):
        A identity of input.
    """
    return Tensor.CreateOperator('StopGradient', **ParseArgs(locals()))
\ No newline at end of file
-@OpSchema.Inputs(1)
-def MovingAverage(inputs, decay, **kwargs):
-    """Calculate the moving average.
-    **Type Constraints**: (*int8*, *uint8*, *int32*, *int64*, *float16*, *float32*, *float64*)
-    Parameters
-    ----------
-    inputs : Tensor
-        The values to calculate moving average.
-    decay : float
-        The decay factor.
-    Returns
-    -------
-    Tensor
-        The output tensor, i.e., ``variable``, calculated as:
-        |moving_average_function|
-    """
-    return Tensor.CreateOperator('MovingAverage', **ParseArgs(locals()))
\ No newline at end of file
--- a/Dragon/python/dragon/operators/ndarray.py
+++ b/Dragon/python/dragon/operators/ndarray.py
@@ -740,7 +740,6 @@ def Shape(inputs, **kwargs):
    return Tensor.CreateOperator('Shape', **ParseArgs(locals()))
-@OpSchema.Inputs(0)
 @ArgumentHelper.Desc('start')
 @ArgumentHelper.Desc('stop')
 @ArgumentHelper.Desc('step')

--- a/Dragon/python/dragon/operators/vision.py
+++ b/Dragon/python/dragon/operators/vision.py
@@ -62,7 +62,7 @@ def Conv2d(
        The dilation multiple(s) of convolution.
    group : int, optional, default=1
        The group size of convolution.
-    padding : {'VALID', 'SAME, 'SAME_UPPER', 'SAME_LOWER'}, optional
+    padding : {'VALID', 'SAME', 'SAME_UPPER', 'SAME_LOWER'}, optional
        The padding algorithm.
    data_format : {'NCHW', 'NHWC'}, optional
        The data_format.
@@ -119,7 +119,7 @@ def DepthwiseConv2d(
        The stride(s) of convolution.
    pads : sequence of int, optional, default=0
        The zero padding size(s) of convolution.
-    padding : {'VALID', 'SAME, 'SAME_UPPER', 'SAME_LOWER'}, optional
+    padding : {'VALID', 'SAME', 'SAME_UPPER', 'SAME_LOWER'}, optional
        The padding algorithm.
    data_format : {'NCHW', 'NHWC'}, optional
        The data_format.
@@ -183,7 +183,7 @@ def ConvTranspose2d(
        The padding value add to one side(right) of the output.
    output_shape : sequence of (int, Tensor), optional
        The deterministic output shape for **SAME** padding.
-    padding : {'VALID', 'SAME, 'SAME_UPPER', 'SAME_LOWER'}, optional
+    padding : {'VALID', 'SAME', 'SAME_UPPER', 'SAME_LOWER'}, optional
        The padding algorithm.
    data_format : {'NCHW', 'NHWC'}, optional
        The data_format.
@@ -224,7 +224,7 @@ def ConvTranspose2d(
 @OpSchema.Inputs(1)
 def Pool2d(
-    inputs, kernel_shape, strides, pads=0, padding='VALID', ceil=True,
+    inputs, kernel_shape, strides, pads=0, padding='VALID', ceil_mode=True,
        mode='MAX', data_format='NCHW', global_pooling=False, **kwargs):
    """2D Pooling, MAX or AVG.
@@ -248,9 +248,9 @@ def Pool2d(
        The stride(s) of of pooling,
    pads : sequence of int, optional, default=0
        The zero padding size(s) of pooling.
-    padding : {'VALID', 'SAME, 'SAME_UPPER', 'SAME_LOWER'}, optional
+    padding : {'VALID', 'SAME', 'SAME_UPPER', 'SAME_LOWER'}, optional
        The padding algorithm.
-    ceil : bool, optional
+    ceil_mode : bool, optional, default=True
        Whether to ceil the boundary.
    mode : {'MAX', 'AVG'}, optional
        The pooling mode.
@@ -505,48 +505,6 @@ def BiasAdd(inputs, data_format='NCHW', **kwargs):
    return Tensor.CreateOperator('BiasAdd', **arguments)
-@OpSchema.Inputs(2)
-def DenseConcat(inputs, growth_rate=0, axis=1, **kwargs):
-    """Memory-efficient concatenation for DenseNet `[Huang et.al, 2017] <http://arxiv.org/abs/1608.06993>`_.
-    This operator is forked from ``Concat``.
-    The memory optimization requires the following settings:
-    1. Set the ``growth_rate``, the value must larger than ``0``.
-    2. Set the ``mirror_stage`` to True.
-    Parameters
-    ----------
-    inputs : sequence of Tensor
-        The inputs, represent A(old) and B(new) respectively.
-    growth_rate : int, optional, default=0
-        The growth rate.
-    axis : int, optional
-        The axis to concatenate.
-    mirror_stage : bool, optional
-        Whether to share input A for output C. Default is ``False``.
-    Returns
-    -------
-    Tensor
-        The concatenated tensor, represents C.
-    Examples
-    --------
-    >>> A = Tensor().Variable()
-    >>> B = Tensor().Variable()
-    >>> C = DenseConcat([A, B], axis=1) # Simple concatenation
-    >>> import dragon.memonger as opt
-    >>> C = opt.Drop(DenseConcat, [A, B], axis=1) # Memory-efficient concatenation
-    >>> D = DenseConcat([A, B], axis=1, mirror_stage=True) # Memory-efficient concatenation, equivalent
-    """
-    return Tensor.CreateOperator('DenseConcat', **ParseArgs(locals()))
 @OpSchema.Inputs(1)
 @ArgumentHelper.Desc('keep_prob', as_target=False)
 def DropBlock2d(

--- a/Dragon/python/dragon/ops.py
+++ b/Dragon/python/dragon/ops.py
@@ -52,7 +52,6 @@ LRN = vision_ops.LRN
 NNResize = vision_ops.NNResize
 BilinearResize = vision_ops.BilinearResize
 BiasAdd = vision_ops.BiasAdd
-DenseConcat = vision_ops.DenseConcat
 DropBlock2d = vision_ops.DropBlock2d
 # Recurrent
@@ -104,6 +103,8 @@ FullyConnected = math_ops.FullyConnected
 Eltwise = math_ops.Eltwise
 Affine = math_ops.Affine
 GramMatrix = math_ops.GramMatrix
+Accumulate = math_ops.Accumulate
+MovingAverage = math_ops.MovingAverage
 # Normalization
 BatchNorm = norm_ops.BatchNorm
@@ -137,19 +138,18 @@ Squeeze = array_ops.Squeeze
 Shape = array_ops.Shape
 Arange = array_ops.Arange
-# ControlFlow
+# Control Flow
 Copy = control_flow_ops.Copy
 Equal = control_flow_ops.Equal
 Less = control_flow_ops.Less
 Grater = control_flow_ops.Greater
 # Misc
-Cast = AsType = misc_ops.AsType
+Cast = AsType = misc_ops.Cast
 Run = misc_ops.Run
 Template = misc_ops.Template
 Accuracy = misc_ops.Accuracy
 StopGradient = misc_ops.StopGradient
-MovingAverage = misc_ops.MovingAverage
 # MPI
 MPIBroadcast = mpi_ops.MPIBroadcast

--- a/Dragon/python/dragon/proto/dragon.proto
+++ b/Dragon/python/dragon/proto/dragon.proto
--- a/Dragon/python/dragon/tools/db.py
+++ b/Dragon/python/dragon/tools/db.py
--- a/Dragon/python/dragon/tools/summary_writer.py
+++ b/Dragon/python/dragon/tools/summary_writer.py
--- a/Dragon/python/dragon/updaters.py
+++ b/Dragon/python/dragon/updaters.py
--- a/Dragon/python/dragon/utils/vision/data_batch.py
+++ b/Dragon/python/dragon/utils/vision/data_batch.py
--- a/Dragon/python/dragon/tools/im2db.py
+++ b/Dragon/python/dragon/tools/im2db.py
--- a/Dragon/python/dragon/vm/caffe/layer.py
+++ b/Dragon/python/dragon/vm/caffe/layer.py
--- a/Dragon/python/dragon/vm/caffe/layers/__init__.py
+++ b/Dragon/python/dragon/vm/caffe/layers/__init__.py
--- a/Dragon/python/dragon/vm/caffe/layers/common.py
+++ b/Dragon/python/dragon/vm/caffe/layers/common.py
--- a/Dragon/python/dragon/vm/caffe/layers/vision.py
+++ b/Dragon/python/dragon/vm/caffe/layers/vision.py
--- a/Dragon/python/dragon/vm/caffe/net.py
+++ b/Dragon/python/dragon/vm/caffe/net.py
--- a/Dragon/python/dragon/vm/caffe/proto/caffe.proto
+++ b/Dragon/python/dragon/vm/caffe/proto/caffe.proto
--- a/Dragon/python/dragon/vm/caffe/solver.py
+++ b/Dragon/python/dragon/vm/caffe/solver.py
--- a/Dragon/python/dragon/vm/onnx/frontend.py
+++ b/Dragon/python/dragon/vm/onnx/frontend.py
--- a/Dragon/python/dragon/vm/onnx/helper.py
+++ b/Dragon/python/dragon/vm/onnx/helper.py
--- a/Dragon/python/dragon/vm/onnx/nodes/activation.py
+++ b/Dragon/python/dragon/vm/onnx/nodes/activation.py
--- a/Dragon/python/dragon/vm/onnx/nodes/arithmetic.py
+++ b/Dragon/python/dragon/vm/onnx/nodes/arithmetic.py
--- a/Dragon/python/dragon/vm/onnx/nodes/contrib.py
+++ b/Dragon/python/dragon/vm/onnx/nodes/contrib.py
--- a/Dragon/python/dragon/vm/onnx/nodes/factory.py
+++ b/Dragon/python/dragon/vm/onnx/nodes/factory.py
--- a/Dragon/python/dragon/vm/onnx/nodes/misc.py
+++ b/Dragon/python/dragon/vm/onnx/nodes/misc.py
--- a/Dragon/python/dragon/vm/onnx/nodes/ndarray.py
+++ b/Dragon/python/dragon/vm/onnx/nodes/ndarray.py
--- a/Dragon/python/dragon/vm/onnx/nodes/norm.py
+++ b/Dragon/python/dragon/vm/onnx/nodes/norm.py
--- a/Dragon/python/dragon/vm/onnx/nodes/vision.py
+++ b/Dragon/python/dragon/vm/onnx/nodes/vision.py
--- a/Dragon/python/dragon/vm/onnx/utils.py
+++ b/Dragon/python/dragon/vm/onnx/utils.py
--- a/Dragon/python/dragon/vm/tensorflow/training/optimizer.py
+++ b/Dragon/python/dragon/vm/tensorflow/training/optimizer.py
--- a/Dragon/python/dragon/vm/theano/compile/function.py
+++ b/Dragon/python/dragon/vm/theano/compile/function.py
--- a/Dragon/python/dragon/vm/theano/compile/scan.py
+++ b/Dragon/python/dragon/vm/theano/compile/scan.py
--- a/Dragon/python/dragon/vm/torch/__init__.py
+++ b/Dragon/python/dragon/vm/torch/__init__.py
--- a/Dragon/python/dragon/vm/torch/autograd/grad_mode.py
+++ b/Dragon/python/dragon/vm/torch/autograd/grad_mode.py
--- a/Dragon/python/dragon/vm/torch/autograd/variable.py
+++ b/Dragon/python/dragon/vm/torch/autograd/variable.py
--- a/Dragon/python/dragon/vm/torch/c_api.py
+++ b/Dragon/python/dragon/vm/torch/c_api.py
--- a/Dragon/python/dragon/vm/torch/cuda/__init__.py
+++ b/Dragon/python/dragon/vm/torch/cuda/__init__.py
--- a/Dragon/python/dragon/vm/torch/environ.py
+++ b/Dragon/python/dragon/vm/torch/environ.py
--- a/Dragon/python/dragon/vm/torch/execution.py
+++ b/Dragon/python/dragon/vm/torch/execution.py
--- a/Dragon/python/dragon/vm/torch/jit.py
+++ b/Dragon/python/dragon/vm/torch/jit.py
--- a/Dragon/python/dragon/vm/torch/module.py
+++ b/Dragon/python/dragon/vm/torch/module.py
--- a/Dragon/python/dragon/vm/torch/nn/__init__.py
+++ b/Dragon/python/dragon/vm/torch/nn/__init__.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/activation.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/activation.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/affine.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/affine.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/batchnorm.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/batchnorm.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/conv.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/conv.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/depthwise_conv.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/depthwise_conv.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/dropblock.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/dropblock.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/dropout.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/dropout.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/groupnorm.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/groupnorm.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/linear.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/linear.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/loss.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/loss.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/pooling.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/pooling.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/rnn.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/rnn.py
--- a/Dragon/python/dragon/vm/torch/c_apis.py
+++ b/Dragon/python/dragon/vm/torch/c_apis.py
--- a/Dragon/python/dragon/vm/torch/onnx/utils.py
+++ b/Dragon/python/dragon/vm/torch/onnx/utils.py
--- a/Dragon/python/dragon/vm/torch/ops/__init__.py
+++ b/Dragon/python/dragon/vm/torch/ops/__init__.py
--- a/Dragon/python/dragon/vm/torch/ops/arithmetic.py
+++ b/Dragon/python/dragon/vm/torch/ops/arithmetic.py
--- a/Dragon/python/dragon/vm/torch/ops/ndarray.py
+++ b/Dragon/python/dragon/vm/torch/ops/ndarray.py
--- a/Dragon/python/dragon/vm/torch/ops/builtin.py
+++ b/Dragon/python/dragon/vm/torch/ops/builtin.py
--- a/Dragon/python/dragon/vm/torch/ops/creation.py
+++ b/Dragon/python/dragon/vm/torch/ops/creation.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/arithmetic.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/arithmetic.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/array.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/array.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/axis.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/axis.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/base.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/base.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/control_flow.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/control_flow.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/creation.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/creation.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/dtype.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/dtype.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/indexing.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/indexing.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/init.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/init.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/reduce.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/reduce.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/shape.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/shape.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/update.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/update.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/vision.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/vision.py
--- a/Dragon/python/dragon/vm/torch/ops/primitive.py
+++ b/Dragon/python/dragon/vm/torch/ops/primitive.py
--- a/Dragon/python/dragon/vm/torch/ops/update.py
+++ b/Dragon/python/dragon/vm/torch/ops/update.py
--- a/Dragon/python/dragon/vm/torch/ops/vision.py
+++ b/Dragon/python/dragon/vm/torch/ops/vision.py
--- a/Dragon/python/dragon/vm/torch/optim/optimizer.py
+++ b/Dragon/python/dragon/vm/torch/optim/optimizer.py
--- a/Dragon/python/dragon/vm/torch/dummy_pool.py
+++ b/Dragon/python/dragon/vm/torch/dummy_pool.py
--- a/Dragon/python/dragon/vm/torch/serialization.py
+++ b/Dragon/python/dragon/vm/torch/serialization.py
--- a/Dragon/python/dragon/vm/torch/tensor.py
+++ b/Dragon/python/dragon/vm/torch/tensor.py
--- a/Dragon/python/dragon/vm/torch/tensor_uitls.py
+++ b/Dragon/python/dragon/vm/torch/tensor_uitls.py
--- a/Dragon/python/dragon/vm/torch/vision/transforms/__init__.py
+++ b/Dragon/python/dragon/vm/torch/vision/transforms/__init__.py
--- a/Dragon/python/dragon/vm/torch/vision/transforms/functional.py
+++ b/Dragon/python/dragon/vm/torch/vision/transforms/functional.py
--- a/Dragon/src/contrib/rcnn/proposal_op.cc
+++ b/Dragon/src/contrib/rcnn/proposal_op.cc
--- a/Dragon/src/contrib/rcnn/proposal_op.h
+++ b/Dragon/src/contrib/rcnn/proposal_op.h
--- a/Dragon/src/core/context.cc
+++ b/Dragon/src/core/context.cc
--- a/Dragon/src/core/graph.cc
+++ b/Dragon/src/core/graph.cc
--- a/Dragon/src/core/graph_gradient.cc
+++ b/Dragon/src/core/graph_gradient.cc
--- a/Dragon/src/core/graph_optimizer.cc
+++ b/Dragon/src/core/graph_optimizer.cc
--- a/Dragon/src/core/mixedmem.cc
+++ b/Dragon/src/core/mixedmem.cc
--- a/Dragon/src/core/operator.cc
+++ b/Dragon/src/core/operator.cc
--- a/Dragon/src/core/workspace.cc
+++ b/Dragon/src/core/workspace.cc
--- a/Dragon/src/kernels/activation/prelu_op_kernel.cc
+++ b/Dragon/src/kernels/activation/prelu_op_kernel.cc
--- a/Dragon/src/kernels/activation/prelu_op_kernel.cu
+++ b/Dragon/src/kernels/activation/prelu_op_kernel.cu
--- a/Dragon/src/kernels/loss/nll_loss_op_kernel.cc
+++ b/Dragon/src/kernels/loss/nll_loss_op_kernel.cc
--- a/Dragon/src/kernels/loss/nll_loss_op_kernel.cu
+++ b/Dragon/src/kernels/loss/nll_loss_op_kernel.cu
--- a/Dragon/src/kernels/loss/sigmoid_ce_loss_op_kernel.cc
+++ b/Dragon/src/kernels/loss/sigmoid_ce_loss_op_kernel.cc
--- a/Dragon/src/kernels/loss/sigmoid_ce_loss_op_kernel.cu
+++ b/Dragon/src/kernels/loss/sigmoid_ce_loss_op_kernel.cu
--- a/Dragon/src/kernels/loss/sigmoid_focal_loss_op_kernel.cc
+++ b/Dragon/src/kernels/loss/sigmoid_focal_loss_op_kernel.cc
--- a/Dragon/src/kernels/loss/sigmoid_focal_loss_op_kernel.cu
+++ b/Dragon/src/kernels/loss/sigmoid_focal_loss_op_kernel.cu
--- a/Dragon/src/kernels/loss/softmax_focal_loss_op_kernel.cc
+++ b/Dragon/src/kernels/loss/softmax_focal_loss_op_kernel.cc
--- a/Dragon/src/kernels/loss/softmax_focal_loss_op_kernel.cu
+++ b/Dragon/src/kernels/loss/softmax_focal_loss_op_kernel.cu
--- a/Dragon/src/kernels/loss/sparse_softmax_ce_loss_op_kernel.cc
+++ b/Dragon/src/kernels/loss/sparse_softmax_ce_loss_op_kernel.cc
--- a/Dragon/src/kernels/loss/sparse_softmax_ce_loss_op_kernel.cu
+++ b/Dragon/src/kernels/loss/sparse_softmax_ce_loss_op_kernel.cu
--- a/Dragon/src/kernels/misc/gradient_op_kernel.cc
+++ b/Dragon/src/kernels/misc/gradient_op_kernel.cc
--- a/Dragon/src/kernels/misc/gradient_op_kernel.cu
+++ b/Dragon/src/kernels/misc/gradient_op_kernel.cu
--- a/Dragon/src/kernels/misc/image_data_op_kernel.cu
+++ b/Dragon/src/kernels/misc/image_data_op_kernel.cu
--- a/Dragon/src/kernels/norm/batch_norm_op_kernel.cc
+++ b/Dragon/src/kernels/norm/batch_norm_op_kernel.cc
--- a/Dragon/src/kernels/norm/batch_norm_op_kernel.cu
+++ b/Dragon/src/kernels/norm/batch_norm_op_kernel.cu
--- a/Dragon/src/kernels/norm/group_norm_op_kernel.cc
+++ b/Dragon/src/kernels/norm/group_norm_op_kernel.cc
--- a/Dragon/src/kernels/norm/group_norm_op_kernel.cu
+++ b/Dragon/src/kernels/norm/group_norm_op_kernel.cu
--- a/Dragon/src/kernels/update/adam_update_op_kernel.cc
+++ b/Dragon/src/kernels/update/adam_update_op_kernel.cc
--- a/Dragon/src/kernels/update/adam_update_op_kernel.cu
+++ b/Dragon/src/kernels/update/adam_update_op_kernel.cu
--- a/Dragon/src/kernels/update/mprec_update_op_kerne.cu
+++ b/Dragon/src/kernels/update/mprec_update_op_kerne.cu
--- a/Dragon/src/kernels/update/mprec_update_op_kernel.cc
+++ b/Dragon/src/kernels/update/mprec_update_op_kernel.cc
--- a/Dragon/src/kernels/update/nesterov_update_op_kernel.cc
+++ b/Dragon/src/kernels/update/nesterov_update_op_kernel.cc
--- a/Dragon/src/kernels/update/nesterov_update_op_kernel.cu
+++ b/Dragon/src/kernels/update/nesterov_update_op_kernel.cu
--- a/Dragon/src/kernels/update/rmsprop_update_op_kernel.cc
+++ b/Dragon/src/kernels/update/rmsprop_update_op_kernel.cc
--- a/Dragon/src/kernels/update/rmsprop_update_op_kernel.cu
+++ b/Dragon/src/kernels/update/rmsprop_update_op_kernel.cu
--- a/Dragon/src/kernels/update/sgd_update_op_kernel.cc
+++ b/Dragon/src/kernels/update/sgd_update_op_kernel.cc
--- a/Dragon/src/kernels/update/sgd_update_op_kernel.cu
+++ b/Dragon/src/kernels/update/sgd_update_op_kernel.cu
--- a/Dragon/src/operators/arithmetic/accumulate.cc
+++ b/Dragon/src/operators/arithmetic/accumulate.cc
--- a/Dragon/src/operators/arithmetic/affine_op.cc
+++ b/Dragon/src/operators/arithmetic/affine_op.cc
--- a/Dragon/src/operators/arithmetic/cudnn_affine_op.cc
+++ b/Dragon/src/operators/arithmetic/cudnn_affine_op.cc
--- a/Dragon/src/operators/arithmetic/div_op.cc
+++ b/Dragon/src/operators/arithmetic/div_op.cc
--- a/Dragon/src/operators/arithmetic/fully_connected_op.cc
+++ b/Dragon/src/operators/arithmetic/fully_connected_op.cc
--- a/Dragon/src/operators/arithmetic/mul_op.cc
+++ b/Dragon/src/operators/arithmetic/mul_op.cc
--- a/Dragon/src/operators/arithmetic/sqrt_op.cc
+++ b/Dragon/src/operators/arithmetic/sqrt_op.cc
--- a/Dragon/src/operators/control_flow/copy_op.cc
+++ b/Dragon/src/operators/control_flow/copy_op.cc
--- a/Dragon/src/operators/control_flow/scan_op.cc
+++ b/Dragon/src/operators/control_flow/scan_op.cc
--- a/Dragon/src/operators/loss/ctc_loss_op.cc
+++ b/Dragon/src/operators/loss/ctc_loss_op.cc
--- a/Dragon/src/operators/loss/l1_loss_op.cc
+++ b/Dragon/src/operators/loss/l1_loss_op.cc
--- a/Dragon/src/operators/loss/l2_loss_op.cc
+++ b/Dragon/src/operators/loss/l2_loss_op.cc
--- a/Dragon/src/operators/loss/nll_loss_op.cc
+++ b/Dragon/src/operators/loss/nll_loss_op.cc
--- a/Dragon/src/operators/loss/sigmoid_ce_loss_op.cc
+++ b/Dragon/src/operators/loss/sigmoid_ce_loss_op.cc
--- a/Dragon/src/operators/loss/sigmoid_focal_loss_op.cc
+++ b/Dragon/src/operators/loss/sigmoid_focal_loss_op.cc
--- a/Dragon/src/operators/loss/smooth_l1_loss_op.cc
+++ b/Dragon/src/operators/loss/smooth_l1_loss_op.cc
--- a/Dragon/src/operators/loss/softmax_ce_loss_op.cc
+++ b/Dragon/src/operators/loss/softmax_ce_loss_op.cc
--- a/Dragon/src/operators/loss/softmax_focal_loss_op.cc
+++ b/Dragon/src/operators/loss/softmax_focal_loss_op.cc
--- a/Dragon/src/operators/loss/sparse_softmax_ce_loss_op.cc
+++ b/Dragon/src/operators/loss/sparse_softmax_ce_loss_op.cc
--- a/Dragon/src/operators/misc/accuracy_op.cc
+++ b/Dragon/src/operators/misc/accuracy_op.cc
--- a/Dragon/src/operators/misc/astype_op.cc
+++ b/Dragon/src/operators/misc/astype_op.cc
--- a/Dragon/src/operators/misc/gradient_op.cc
+++ b/Dragon/src/operators/misc/gradient_op.cc
--- a/Dragon/src/operators/misc/initialize_op.cc
+++ b/Dragon/src/operators/misc/initialize_op.cc
--- a/Dragon/src/operators/misc/python_op.cc
+++ b/Dragon/src/operators/misc/python_op.cc
--- a/Dragon/src/operators/norm/batch_norm.cc
+++ b/Dragon/src/operators/norm/batch_norm.cc
--- a/Dragon/src/operators/norm/cudnn_batch_norm_op.cc
+++ b/Dragon/src/operators/norm/cudnn_batch_norm_op.cc
--- a/Dragon/src/operators/update/adam_update_op.cc
+++ b/Dragon/src/operators/update/adam_update_op.cc
--- a/Dragon/src/operators/update/collective_update_op.cc
+++ b/Dragon/src/operators/update/collective_update_op.cc
--- a/Dragon/src/operators/update/moving_average_op.cc
+++ b/Dragon/src/operators/update/moving_average_op.cc
--- a/Dragon/src/operators/update/nesterov_update_op.cc
+++ b/Dragon/src/operators/update/nesterov_update_op.cc
--- a/Dragon/src/operators/update/rmsprop_update_op.cc
+++ b/Dragon/src/operators/update/rmsprop_update_op.cc
--- a/Dragon/src/operators/update/sgd_update_op.cc
+++ b/Dragon/src/operators/update/sgd_update_op.cc
--- a/Dragon/src/operators/update/update_op_base.cc
+++ b/Dragon/src/operators/update/update_op_base.cc
--- a/Dragon/src/operators/vision/bias_add_op.cc
+++ b/Dragon/src/operators/vision/bias_add_op.cc
--- a/Dragon/src/operators/vision/conv_op_base.cc
+++ b/Dragon/src/operators/vision/conv_op_base.cc
--- a/Dragon/src/operators/vision/cudnn_bias_add_op.cc
+++ b/Dragon/src/operators/vision/cudnn_bias_add_op.cc
--- a/Dragon/src/operators/vision/cudnn_conv2d_op.cc
+++ b/Dragon/src/operators/vision/cudnn_conv2d_op.cc
--- a/Dragon/src/operators/vision/cudnn_conv2d_transpose_op.cc
+++ b/Dragon/src/operators/vision/cudnn_conv2d_transpose_op.cc
--- a/Dragon/src/operators/vision/cudnn_depthwise_conv2d_op.cc
+++ b/Dragon/src/operators/vision/cudnn_depthwise_conv2d_op.cc
--- a/Dragon/src/operators/vision/dense_concat_op.cc
+++ b/Dragon/src/operators/vision/dense_concat_op.cc
--- a/Dragon/src/operators/vision/depthwise_conv2d_op.cc
+++ b/Dragon/src/operators/vision/depthwise_conv2d_op.cc
--- a/Dragon/src/operators/vision/lrn_op.cc
+++ b/Dragon/src/operators/vision/lrn_op.cc
--- a/Dragon/src/operators/vision/nn_resize_op.cc
+++ b/Dragon/src/operators/vision/nn_resize_op.cc
--- a/Dragon/src/operators/vision/roi_align_op.cc
+++ b/Dragon/src/operators/vision/roi_align_op.cc
--- a/Dragon/src/operators/vision/roi_pool_op.cc
+++ b/Dragon/src/operators/vision/roi_pool_op.cc
--- a/Dragon/src/proto/dragon.proto
+++ b/Dragon/src/proto/dragon.proto
--- a/Dragon/src/utils/math_functions.cc
+++ b/Dragon/src/utils/math_functions.cc
--- a/Dragon/src/utils/math_functions.cu
+++ b/Dragon/src/utils/math_functions.cu
--- a/Dragon/src/utils/math_functions.fp16.cc
+++ b/Dragon/src/utils/math_functions.fp16.cc
--- a/Dragon/src/utils/math_functions.fp16.cu
+++ b/Dragon/src/utils/math_functions.fp16.cu
--- a/pybind11 @ 25abf7ef
+++ b/pybind11 @ 25abf7ef