Select pybind11 to expose the C++ API

Ting PAN
Commit d0fa332c authored Mar 09, 2019 by Ting PAN
Showing with 1275 additions and 1352 deletions
.gitmodules
CHANGES
Docs/api/python/contents/config.rst
Docs/api/python/contents/core.rst
Docs/api/python/contents/core/cuda.rst
Docs/api/python/contents/core/tensor_utils.rst
Docs/api/python/contents/core/workspace.rst
Docs/api/python/contents/ops.rst
Docs/api/python/contents/tools.rst
Docs/api/python/contents/tools/summary_writer.rst
Docs/api/python/contents/utils.rst
Docs/api/python/contents/tools/im2db.rst → Docs/api/python/contents/utils/vision/database.rst
Docs/api/python/contents/vm.rst
Docs/api/python/contents/vm/caffe/layer.rst
Docs/api/python/contents/vm/torch.rst
Dragon/CMakeLists.txt
Dragon/include/core/context.h
Dragon/include/core/context_cnml.h
Dragon/include/core/context_cuda.h
Dragon/include/core/graph.h
--- a/.gitmodules
+++ b/.gitmodules
@@ -10,3 +10,6 @@
 [submodule "ThirdParty/cub"]
 	path = ThirdParty/cub
 	url = https://github.com/NVlabs/cub
+[submodule "ThirdParty/pybind11"]
+	path = ThirdParty/pybind11
+	url = https://github.com/pybind/pybind11
--- a/CHANGES
+++ b/CHANGES
 ------------------------------------------------------------------------
 The list of most significant changes made over time in Dragon.

-Dragon 0.3.0.0 (20190110)
+Dragon 0.3.0.0 (20190309)
 DRAGON_VERSION == 3000

 Changes (w.r.t. Dragon 0.2.2.13):
@@ -24,6 +24,8 @@ Preview Features:

 - Use ``Eigen`` as the default cpu math library instead of ``OpenBLAS``.

+- Use ``PyBind11`` as the default python module exporter.
+
 - Integer data types support for common operators,

  see the documentation for more detail information.
@@ -32,6 +34,8 @@ Preview Features:

  which unifies the naming of static and dynamic computation graph.

+- The behavior of accumulating gradients have been canceled.
+

 Bugs fixed:


--- a/Docs/api/python/contents/config.rst
+++ b/Docs/api/python/contents/config.rst
@@ -8,23 +8,22 @@
 Quick Reference
 ---------------

-==========================   =============================================================================
+===============================    =============================================================================
 List                               Brief
-==========================   =============================================================================
+===============================    =============================================================================
 `EnableCPU`_                       Enable CPU mode globally.
-`IsCUDADriverSufficient`_    Is CUDADriver sufficient?
 `EnableCUDA`_                      Enable CUDA mode globally.
 `SetRandomSeed`_                   Set the global random seed.
 `GetRandomSeed`_                   Get the global random seed.
 `SetGPU`_                          Set the global id GPU.
 `GetGPU`_                          Get the global id of GPU.
-`SetDebugMode`_              Enable Debug mode globally.
+`SetGraphOptimizationLevel`_       Set the default level of graph optimization.
 `LogMetaGraph`_                    Enable to log meta graph globally.
 `LogOptimizedGraph`_               Enable to log optimized graph globally.
 `ExportMetaGraph`_                 Enable to export all runnable meta graphs into text files.
 `SetLoggingLevel`_                 Set the minimum level of Logging.
 `SetLoggingFile`_                  Redirect the logging into the specific file.
-==========================   =============================================================================
+===============================    =============================================================================

 API Reference
 -------------
@@ -33,13 +32,12 @@ API Reference
    :members:

 .. _EnableCPU: #dragon.config.EnableCPU
-.. _IsCUDADriverSufficient: #dragon.config.IsCUDADriverSufficient
 .. _EnableCUDA: #dragon.config.EnableCUDA
 .. _SetRandomSeed: #dragon.config.SetRandomSeed
 .. _GetRandomSeed: #dragon.config.GetRandomSeed
 .. _SetGPU: #dragon.config.SetGPU
 .. _GetGPU: #dragon.config.GetGPU
-.. _SetDebugMode: #dragon.config.SetDebugMode
+.. _SetGraphOptimizationLevel: #dragon.config.SetGraphOptimizationLevel
 .. _LogMetaGraph: #dragon.config.LogMetaGraph
 .. _LogOptimizedGraph: #dragon.config.LogOptimizedGraph
 .. _ExportMetaGraph: #dragon.config.ExportMetaGraph

--- a/Docs/api/python/contents/core.rst
+++ b/Docs/api/python/contents/core.rst
@@ -27,6 +27,7 @@ C++ Binding Wrapper
   core/workspace
   core/tensor_utils
   core/mpi
+   core/cuda
   core/gradient_maker

 ==============================      =======================================================================
@@ -34,11 +35,13 @@ List                                Brief
 ==============================      =======================================================================
 `dragon.core.workspace`_            The interfaces of Workspace, mostly are the wrappers of C++.
 `dragon.core.gradient_maker`_       The generator of GradientOps.
-`dragon.core.tensor_utils`_         The Tensor utilities.
-`dragon.core.mpi`_                  The MPI utilities.
+`dragon.core.tensor_utils`_         List some extended Tensor C++ API.
+`dragon.core.mpi`_                  List some useful MPI C++ API.
+`dragon.core.cuda`_                 List some useful CUDA C++ API.
 ==============================      =======================================================================

 .. _dragon.core.mpi: core/mpi.html
+.. _dragon.core.cuda: core/cuda.html
 .. _dragon.core.scope: core/scope.html
 .. _dragon.core.tensor: core/tensor.html
 .. _dragon.core.tensor_utils: core/tensor_utils.html

--- a/Docs/api/python/contents/core/cuda.rst
+++ b/Docs/api/python/contents/core/cuda.rst
+===========
+:mod:`CUDA`
+===========
+
+.. toctree::
+   :hidden:
+
+Quick Reference
+---------------
+
+==============================    =============================================================================
+List                              Brief
+==============================    =============================================================================
+`IsCUDADriverSufficient`_         Is cuda driver sufficient?
+`GetDevice`_                      Get the current active cuda device.
+`SynchronizeStream`_              Synchronize the specified cuda stream.
+==============================    =============================================================================
+
+.. automodule:: dragon.core.cuda
+    :members:
+
+.. _IsCUDADriverSufficient: #dragon.core.cuda.IsCUDADriverSufficient
+.. _GetDevice: #dragon.core.cuda.GetDevice
+.. _SynchronizeStream: #dragon.core.cuda.SynchronizeStream
\ No newline at end of file
--- a/Docs/api/python/contents/core/tensor_utils.rst
+++ b/Docs/api/python/contents/core/tensor_utils.rst
@@ -16,10 +16,9 @@ List                              Brief
 `FromPyArray`_                    Create a Tensor from a existing Array.
 `SetPyArray`_                     Set a Tensor from a existing Array.
 `ToPyArray`_                      Create a Array from a existing Tensor.
-`ToPyArrayEx`_                    Create a const Array from a existing Tensor.
+`GetStorage`_                     Get the storage of a existing Tensor.
 `ToCPUTensor`_                    Switch the storage of a existing Tensor on cpu memory.
 `ToCUDATensor`_                   Switch the storage of a existing Tensor on cuda memory.
-`GetTensorInfo`_                  Get the info of a existing Tensor.
 ==============================    =============================================================================

 API Reference
@@ -33,7 +32,6 @@ API Reference
 .. _FromPyArray: #dragon.core.tensor_utils.FromPyArray
 .. _SetPyArray: #dragon.core.tensor_utils.SetPyArray
 .. _ToPyArray: #dragon.core.tensor_utils.ToPyArray
-.. _ToPyArrayEx: #dragon.core.tensor_utils.ToPyArrayEx
+.. _GetStorage: #dragon.core.tensor_utils.GetStorage
 .. _ToCPUTensor: #dragon.core.tensor_utils.ToCPUTensor
 .. _ToCUDATensor: #dragon.core.tensor_utils.ToCUDATensor
\ No newline at end of file
-.. _GetTensorInfo: #dragon.core.tensor_utils.GetTensorInfo
\ No newline at end of file
--- a/Docs/api/python/contents/core/workspace.rst
+++ b/Docs/api/python/contents/core/workspace.rst
@@ -14,7 +14,7 @@ List                              Brief
 `HasTensor`_                      Query whether tensor has registered in current workspace.
 `CreateFiller`_                   Create the filler in the backend.
 `GetTensorName`_                  Query the name represented in current workspace.
-`RenameTensor`_                   Rename a tensor in current workspace.
+`SetTensorAlias`_                 Bind a alias to a existed tensor.
 `FeedTensor`_                     Feed the values to the given tensor.
 `FetchTensor`_                    Fetch the values of given tensor.
 `ResetTensor`_                    Reset the memory of given tensor.
@@ -27,7 +27,7 @@ Operator
 ==============================    =============================================================================
 List                              Brief
 ==============================    =============================================================================
-`RunOperator`_                    Create and Run the operator in the VM backend.
+`RunOperator`_                    Run the operator in the VM backend.
 ==============================    =============================================================================


@@ -39,7 +39,6 @@ List                              Brief
 ==============================    =============================================================================
 `CreateGraph`_                    Create the graph in the backend.
 `RunGraph`_                       Run the specific graph.
-`RunGraphEx`_                     Run the graph from the meta definition.
 ==============================    =============================================================================

 Misc
@@ -73,14 +72,13 @@ API Reference
 .. _CreateGraph: #dragon.core.workspace.CreateGraph
 .. _HasTensor: #dragon.core.workspace.HasTensor
 .. _GetTensorName: #dragon.core.workspace.GetTensorName
-.. _RenameTensor: #dragon.core.workspace.RenameTensor
+.. _SetTensorAlias: #dragon.core.workspace.SetTensorAlias
 .. _CreateFiller: #dragon.core.workspace.CreateFiller
 .. _FetchTensor: #dragon.core.workspace.FetchTensor
 .. _FeedTensor: #dragon.core.workspace.FeedTensor
 .. _ResetTensor: #dragon.core.workspace.ResetTensor
 .. _RunOperator: #dragon.core.workspace.RunOperator
 .. _RunGraph: #dragon.core.workspace.RunGraph
-.. _RunGraphEx: #dragon.core.workspace.RunGraphEx
 .. _Snapshot: #dragon.core.workspace.Snapshot
 .. _Restore: #dragon.core.workspace.Restore
 .. _LogMetaGraph: #dragon.core.workspace.LogMetaGraph

--- a/Docs/api/python/contents/ops.rst
+++ b/Docs/api/python/contents/ops.rst
@@ -42,7 +42,6 @@ List                   Brief
 `NNResize`_            Resize the image with *Nearest-Neighbor* method.
 `BilinearResize`_      Resize the image with *Bi-Linear* method.
 `BiasAdd`_             Add the bias across channels to a *NCHW* or *NHWC* input.
-`DenseConcat`_         Memory-efficient concatenation for DenseNet. `[Huang et.al, 2017] <http://arxiv.org/abs/1608.06993>`_.
 `DropBlock2d`_         Randomly drop the outputs according to the spatial blocks. `[Ghiasi et.al, 2018] <https://arxiv.org/abs/1810.12890>`_.
 ===================    ======================================================================

@@ -113,7 +112,9 @@ List                  Brief
 `Eltwise`_            Element-wise Sum or Product the arbitrary number of inputs.
 `Affine`_             Calculate *Y = Ax + b* along the given range of axes.
 `GramMatrix`_         Calculate the gram matrix. `[Gatys et.al, 2016] <https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Gatys_Image_Style_Transfer_CVPR_2016_paper.pdf>`_.
-`Moments`_            Compute the mean and variance of inputs along the given axes.
+`Moments`_            Calculate the mean and variance of inputs along the given axes.
+`Accumulate`_         Calculate *y = alpha * x + beta * y*
+`MovingAverage`_      Calculate the *y = (1 - decay) * x + decay * y*
 ==================    ======================================================================

 Normalization
@@ -174,12 +175,11 @@ Misc
 =================    ======================================================================
 List                 Brief
 =================    ======================================================================
-`AsType`_            Cast the data type of inputs to a specific one.
+`Cast`_              Cast the data type of inputs to a specific one.
 `Run`_               Run a custom operator. (Without GradientFlow)
 `Template`_          Run a custom operator. (With GradientFlow)
 `Accuracy`_          Calculate the Top-K accuracy.
 `StopGradient`_      Return the identity of input with truncated gradient flow.
-`MovingAverage`_     Calculate the moving average.
 =================    ======================================================================

 Contrib
@@ -268,6 +268,8 @@ List                 Brief
 .. _Affine: operators/arithmetic.html#dragon.operators.arithmetic.Affine
 .. _GramMatrix: operators/arithmetic.html#dragon.operators.arithmetic.GramMatrix
 .. _Moments: operators/arithmetic.html#dragon.operators.arithmetic.Moments
+.. _Accumulate: operators/arithmetic.html#dragon.operators.arithmetic.Accumulate
+.. _MovingAverage: operators/arithmetic.html#dragon.operators.arithmetic.MovingAverage

 .. _BatchNorm: operators/norm.html#dragon.operators.norm.BatchNorm
 .. _GroupNorm: operators/norm.html#dragon.operators.norm.GroupNorm
@@ -304,12 +306,11 @@ List                 Brief
 .. _Less: operators/control_flow.html#dragon.operators.control_flow.Less
 .. _Greater: operators/control_flow.html#dragon.operators.control_flow.Greater

-.. _AsType: operators/misc.html#dragon.operators.misc.AsType
+.. _Cast: operators/misc.html#dragon.operators.misc.Cast
 .. _Run: operators/misc.html#dragon.operators.misc.Run
 .. _Template: operators/misc.html#dragon.operators.misc.Template
 .. _Accuracy: operators/misc.html#dragon.operators.misc.Accuracy
 .. _StopGradient: operators/misc.html#dragon.operators.misc.StopGradient
-.. _MovingAverage: operators/misc.html#dragon.operators.misc.MovingAverage

 .. _Proposal: operators/contrib/rcnn.html#dragon.operators.contrib.rcnn.ops.Proposal


--- a/Docs/api/python/contents/tools.rst
+++ b/Docs/api/python/contents/tools.rst
@@ -19,16 +19,12 @@ ToolBox
   :hidden:

   tools/db
-   tools/im2db
-   tools/summary_writer
   tools/tensorboard

 ====================    ====================================================================================
 List                    Brief
 ====================    ====================================================================================
 `LMDB`_                 A wrapper of LMDB package.
-`IM2DB`_                Make the sequential database for images.
-`SummaryWriter`_        Write summaries for DragonBoard.
 `TensorBoard`_          Write summaries for TensorBoard.
 ====================    ====================================================================================

@@ -38,8 +34,5 @@ List                    Brief
    <p style="text-indent:1.5em; font-size: 18px; max-width: 830px;">

 .. _pip: https://pypi.python.org/pypi/pip
-
 .. _LMDB: tools/db.html
-.. _IM2DB: tools/im2db.html
-.. _SummaryWriter: tools/summary_writer.html
 .. _TensorBoard: tools/tensorboard.html
--- a/Docs/api/python/contents/tools/summary_writer.rst
+++ b/Docs/api/python/contents/tools/summary_writer.rst
-====================
-:mod:`SummaryWriter`
-====================
-
-.. toctree::
-   :hidden:
-
-Quick Reference
---------------
-
-====================    =============================================================================
-List                    Brief
-====================    =============================================================================
-`ScalarSummary`_        Write scalar summary.
-====================    =============================================================================
-
-
-API Reference
-------------
-
-.. currentmodule:: dragon.tools.summary_writer
-
-.. autoclass:: ScalarSummary
-    :members:
-
-    .. automethod:: __init__
-
-.. _ScalarSummary: #dragon.tools.summary_writer.ScalarSummary
\ No newline at end of file
--- a/Docs/api/python/contents/utils.rst
+++ b/Docs/api/python/contents/utils.rst
@@ -2,40 +2,30 @@
 :mod:`dragon.utils`
 ===================

-Wrapper
-------
+Vision
+------

 .. toctree::
   :hidden:

+   utils/vision/database
   utils/vision/data_batch
-
-===================================    =====================================================================
-List                                   Brief
-===================================    =====================================================================
-`dragon.utils.vision.data_batch`_      Efficient Batch data provider based on `LMDB`_.
-===================================    =====================================================================
-
-Component
---------
-
-.. toctree::
-   :hidden:
-
   utils/vision/data_reader
   utils/vision/data_transformer
   utils/vision/blob_fetcher

-==========================================      =====================================================================
+=========================================    =====================================================================
 List                                         Brief
-==========================================      =====================================================================
+=========================================    =====================================================================
+`dragon.utils.vision.im2db`_                 Make the sequential database for images.
+`dragon.utils.vision.data_batch`_            Efficient Batch data provider based on `LMDB`_.
 `dragon.utils.vision.data_reader`_           Queue encoded string from `LMDB`_.
 `dragon.utils.vision.data_transformer`_      Queue transformed images from `DataReader`_.
 `dragon.utils.vision.blob_fetcher`_          Queue blobs from `DataTransformer`_.
-==========================================      =====================================================================
-
+=========================================    =====================================================================

 .. _LMDB: http://lmdb.readthedocs.io/en/release
+.. _dragon.utils.vision.im2db: utils/vision/database.html
 .. _DataReader: utils/vision/data_reader.html#dragon.utils.vision.data_reader
 .. _DataTransformer: utils/vision/data_transformer.html#dragon.utils.vision.data_transformer
 .. _dragon.utils.vision.data_batch: utils/vision/data_batch.html

--- a/Docs/api/python/contents/tools/im2db.rst
+++ b/Docs/api/python/contents/tools/im2db.rst
-============
-:mod:`IM2DB`
-============
+===============
+:mod:`Database`
+===============

 .. toctree::
   :hidden:
@@ -19,8 +19,8 @@ List                    Brief
 API Reference
 -------------

-.. automodule:: dragon.tools.im2db
+.. automodule:: dragon.utils.vision.im2db
    :members:

-.. _resize_image: #dragon.tools.im2db.resize_image
-.. _make_db: #dragon.tools.im2db.make_db
\ No newline at end of file
+.. _resize_image: #dragon.utils.vision.im2db.resize_image
+.. _make_db: #dragon.utils.vision.im2db.make_db
\ No newline at end of file
--- a/Docs/api/python/contents/vm.rst
+++ b/Docs/api/python/contents/vm.rst
@@ -20,20 +20,23 @@ VirtualBox

   vm/caffe
   vm/theano
+   vm/torch

 ====================    ====================================================================================
 List                    Brief
 ====================    ====================================================================================
 `Theano`_               **Theano** is an inception of the modern deep learning frameworks.
 `Caffe`_                **Caffe** is one of the most famous deep learning framework for Computer Vision.
+`PyTorch`_              **PyTorch** provides straight-forward operations on research prototyping.
 ====================    ====================================================================================

 .. |para| raw:: html

    <p style="text-indent:1.5em; font-size: 18px; max-width: 830px;">

-
 .. _TinyDragon: ../index.html#tinydragon
 .. _Theano:  vm/theano.html
 .. _Caffe: vm/caffe.html
+.. _PyTorch: vm/torch.html
 .. _TensorFlow: ../index.html#tensorflow
+
--- a/Docs/api/python/contents/vm/caffe/layer.rst
+++ b/Docs/api/python/contents/vm/caffe/layer.rst
@@ -66,7 +66,6 @@ List                        Brief
 `AddLayer`_                 The extended implementation of ``EltwiseLayer``.
 `ConcatLayer`_              The implementation of ``ConcatLayer``.
 `SliceLayer`_               The implementation of ``SliceLayer``.
-`DenseConcatLayer`_         The implementation for `DenseNet`_.
 `CropLayer`_                The implementation of ``CropLayer``.
 `ReshapeLayer`_             The implementation of ``ReshapeLayer``.
 `PermuteLayer`_             The implementation of ``PermuteLayer``.
@@ -180,7 +179,6 @@ API Reference
 .. _AddLayer: #dragon.vm.caffe.layers.common.AddLayer
 .. _ConcatLayer: #dragon.vm.caffe.layers.common.ConcatLayer
 .. _SliceLayer: #dragon.vm.caffe.layers.common.SliceLayer
-.. _DenseConcatLayer: #dragon.vm.caffe.layers.common.DenseConcatLayer
 .. _CropLayer: #dragon.vm.caffe.layers.common.CropLayer
 .. _ReshapeLayer: #dragon.vm.caffe.layers.common.ReshapeLayer
 .. _PermuteLayer: #dragon.vm.caffe.layers.common.PermuteLayer
@@ -210,12 +208,10 @@ API Reference
 .. _MPIBroadcastLayer: #dragon.vm.caffe.layers.mpi.MPIBroadcastLayer
 .. _MPIGatherLayer: #dragon.vm.caffe.layers.mpi.MPIGatherLayer

-
 .. _Layer.Setup: #dragon.vm.caffe.layer.Layer.Setup
 .. _Layer.Fill: #dragon.vm.caffe.layer.Layer.Fill

 .. _LMDB: http://lmdb.readthedocs.io/en/release
-.. _DenseNet: http://arxiv.org/abs/1608.06993
 .. _LayerSetUp(layer.hpp, L91): https://github.com/BVLC/caffe/blob/effcdb0b62410b2a6a54f18f23cf90733a115673/include/caffe/layer.hpp#L91
 .. _DataParameter.source: https://github.com/BVLC/caffe/blob/effcdb0b62410b2a6a54f18f23cf90733a115673/src/caffe/proto/caffe.proto#L647
 .. _DataParameter.prefetch: https://github.com/BVLC/caffe/blob/effcdb0b62410b2a6a54f18f23cf90733a115673/src/caffe/proto/caffe.proto#L672

--- a/Docs/api/python/contents/vm/torch.rst
+++ b/Docs/api/python/contents/vm/torch.rst
+============
+:mod:`Torch`
+============
+
+Abstraction
+-----------
+
+|para| `PyTorch`_ provides straight-forward operations on research prototyping.
+
+|para| We are aware that **Dragon** is a graph-based framework with strictly naming
+for tensors, operators, and workspaces, while `Torch`_ is not.
+A simple way to bridge their differences is **JIT**, which traces the anonymous expressions,
+indicates a series of executions to the backend. If so, **AutoGrad** will just be a trick(Remember the *Chain Rule*).
+
+|para| Rewriting the GC(*Garbage Collection*) is crucial in this role,
+as the costly deconstruction on memories and operators must be avoided.
+We could either persist a Operator(i.e. **Module**),
+or reuse the several memories by turns(i.e. **MemoryPool**), if naming them formally.
+
+|para| We are still working hard to cover the original PyTorch operators,
+however, a bunch of extended operators in many other frameworks can be used.
+Our **PyTorch** will be unique and more powerful than the official one.
+
+Related Work
+------------
+
+|paratitle| **Proto-based Intermediate Representation**
+
+|para| Recent years, several powerful frameworks choose the ProtocolBuffer to
+describe the operators with various arguments, including `Caffe`_, `Caffe2`_, `TensorFlow`_, and `ONNX`_.
+The most important reason is that, these descriptors can be easily serialized and sent to the backend.
+With the help of **Factory Pattern**, we have had an elegant way to dispatch the executions, while not
+call them imperatively. This way is also known as the **Declarative Programming**.
+
+|para| Attaching the IR(Intermediate Representation) takes the following advantages:
+
+* Traceable pipelines, much helpful for visualizing and debugging.
+
+* Deterministic executions, detailed optimization can be applied.
+
+* Efficient deployments, data-flows has been well organized.
+
+|para| A good news is that, we can reduce the overhead of IR below 5% of computation time,
+which means the dynamic graph could work as fast as the static graph while retain the flexibility.
+
+|paratitle| **Caffe2**
+
+|para| We have noticed that some developers discouraged the **Declarative Programming** in 2017 and early 2018,
+due to the counter-intuitive building of computation graph. Actually, `Caffe2`_ had published Operator-Wise execution
+(a.k.a, *workspace.RunOperator()*) since 2016. In other words, **Imperative Programming** is the subset of **Declarative Programming**,
+if we process the declaration implicitly. This mechanism is sometimes called **JIT** by someone.
+
+Architectures
+-------------
+
+.. toctree::
+   :hidden:
+
+.. _Torch: http://torch.ch
+.. _PyTorch: https://pytorch.org
+.. _Caffe: http://caffe.berkeleyvision.org
+.. _Caffe2: http://caffe2.ai
+.. _TensorFlow: https://www.tensorflow.org
+.. _ONNX: https://onnx.ai
+
+.. |nbsp| raw:: html
+
+    &nbsp
+
+.. |br| raw:: html
+
+    <br />
+
+.. |paratitle| raw:: html
+
+    <p style="font-size: 20px">
+
+.. |sectitle| raw:: html
+
+    <p style="text-indent:1em; font-size: 18px">
+
+.. |para| raw:: html
+
+    <p style="text-indent:1.5em; font-size: 18px; max-width: 830px;">
+
+.. |context| raw:: html
+
+    <p style="font-size: 18px; max-width: 830px;">
+
+
--- a/Dragon/CMakeLists.txt
+++ b/Dragon/CMakeLists.txt
@@ -97,6 +97,7 @@ include_directories(${PROJECT_SOURCE_DIR}/src)
 if (BUILD_PYTHON_API)
    include_directories(${PYTHON_INCLUDE_DIRS})
    include_directories(${NUMPY_INCLUDE_DIR})
+    include_directories(${THIRD_PARTY_DIR}/pybind11/include)
 endif()
 if (WITH_CUDA)
    include_directories(${CUDA_INCLUDE_DIRS})

--- a/Dragon/include/core/context.h
+++ b/Dragon/include/core/context.h
@@ -38,7 +38,7 @@ class CPUContext {
    void SwitchToDevice() {}

    /*! \brief Switch to the device with the given stream */
-    void SwitchToDevice(int stream_id) {}
+    void SwitchToDevice(const int stream_id) {}

    /*! \brief Synchronize the dispatched operations */
    void FinishDeviceCompution() {}
@@ -106,6 +106,9 @@ class CPUContext {
    /*! \brief Return the device id */
    int device_id() const { return 0; }

+    /*! \brief Return the stream id */
+    int stream_id() const { return 0; }
+
    /*! \brief Set the stream id */
    void set_stream_id(int stream_id) {}


--- a/Dragon/include/core/context_cnml.h
+++ b/Dragon/include/core/context_cnml.h
@@ -32,6 +32,7 @@ class CNRTObject;

 class CNMLContext {
 public:
+     /*! \brief Default Constructor */
     CNMLContext(const DeviceOption& option)
        : device_id_(option.device_id()),
        random_seed_(option.has_random_seed() ?
@@ -39,34 +40,43 @@ class CNMLContext {
        CHECK_EQ(option.device_type(), PROTO_CNML);
    }

+    /*! \brief Constructor with the specified device id */
    CNMLContext(const int device_id = 0)
        : device_id_(device_id),
          random_seed_(DEFAULT_RNG_SEED) {}

+    /*! \brief Switch to the device with the given stream */
    void SwitchToDevice(int stream_id);

-    inline void SwitchToDevice() { SwitchToDevice(1); }
+    /*! \brief Switch to the device of this context */
+    inline void SwitchToDevice() { SwitchToDevice(0); }

+    /*! \brief Synchronize the dispatched operations */
    void FinishDeviceCompution();

+    /*! \brief Malloc the memory */
    static void* New(size_t nbytes);

+    /*! \brief Zero-Reset the memory */
    static void Memset(
        size_t              nbytes,
        void*               ptr);

+    /*! \brief Zero-Reset the memory asynchronously */
    inline void MemsetAsync(
        size_t              nbytes,
        void*               ptr) {
        Memset(nbytes, ptr);
    }

+    /*! \brief Copy the memory */
    template<class DstContext, class SrcContext>
    static void Memcpy(
        size_t              nbytes,
        void*               dst,
        const void*         src);

+    /*! \brief Copy the memory with given type asynchronously */
    template<class DstContext, class SrcContext>
    inline void MemcpyAsync(
        size_t              nbytes,
@@ -75,23 +85,33 @@ class CNMLContext {
        Memcpy<DstContext, SrcContext>(dst, src, nbytes);
    }

+    /*! \brief Free the memory */
    static void Delete(void* data);

-    inline int device_id() const { return device_id_; }
+    /*! \brief Return the device id */
+    int device_id() const { return device_id_; }

-    inline void set_stream_id(int stream_id) { stream_id_ = stream_id; }
+    /*! \brief Return the stream id */
+    int stream_id() const { return stream_id_; }
    
-    inline cnrtStream_t cnrt_stream() {
+    /*! \brief Set the stream id */
+    void set_stream_id(int stream_id) { stream_id_ = stream_id; }
+
+    /*! \brief Return the internal cnrt stream */
+    cnrtStream_t cnrt_stream() {
        return cnrt_stream(device_id_, stream_id_);
    }

+    /*! \brief Return the specified cnrt stream */
    static cnrtStream_t cnrt_stream(
        int                 device_id,
        int                 stream_id);

+    /*! \brief Return the global context locker */
    static std::mutex& mutex() { static std::mutex m; return m; }

-    static CNRTObject* cuda_object();
+    /*! \brief Return the thread local cnrt object */
+    static CNRTObject* cnrt_object();

 private:
    int device_id_, stream_id_ = 1, random_seed_;

--- a/Dragon/include/core/context_cuda.h
+++ b/Dragon/include/core/context_cuda.h
@@ -80,11 +80,16 @@ class CUDAObject {
        } return dev_streams[stream_id];
    }

-    /*! \brief Return the default cuda stream */
+    /*! \brief Return the default cuda stream of current device */
    cudaStream_t GetDefaultStream() {
        return GetStream(CUDA_GET_DEVICE(), 0);
    }

+    /*! \brief Return the default cuda stream of given device */
+    cudaStream_t GetDefaultStream(int device_id) {
+        return GetStream(device_id, 0);
+    }
+
    /*! \brief Return the specified cublas handle */
    cublasHandle_t GetCuBLASHandle(int device_id, int stream_id) {
        vector<cublasHandle_t>& dev_handles = cublas_handles[device_id];
@@ -141,13 +146,13 @@ class CUDAContext {
          random_seed_(DEFAULT_RNG_SEED) {}

    /*! \brief Switch to the device with the given stream */
-    void SwitchToDevice(int stream_id) {
+    void SwitchToDevice(const int stream_id) {
        CUDA_CHECK(cudaSetDevice(device_id_));
        stream_id_ = stream_id;
    }

    /*! \brief Switch to the device of this context */
-    void SwitchToDevice() { SwitchToDevice(1); }
+    void SwitchToDevice() { SwitchToDevice(0); }

    /*! \brief Synchronize the dispatched operations */
    void FinishDeviceCompution() {
@@ -191,8 +196,19 @@ class CUDAContext {
        size_t              nbytes,
        void*               dst,
        const void*         src) {
+        MemcpyEx<DstContext, SrcContext>(
+            nbytes, dst, src, active_device_id());
+    }
+
+    /*! \brief Copy the memory [Extended] */
+    template<class DstContext, class SrcContext>
+    static void MemcpyEx(
+        size_t              nbytes,
+        void*               dst,
+        const void*         src,
+        int                 device_id) {
        cudaStream_t stream = CUDAContext::
-            cuda_object()->GetDefaultStream();
+            cuda_object()->GetDefaultStream(device_id);
        CUDA_CHECK(cudaMemcpyAsync(dst, src, nbytes,
            cudaMemcpyDefault, stream));
        cudaError_t error = SynchronizeStream(stream);
@@ -230,9 +246,15 @@ class CUDAContext {
        return cudaGetLastError();
    }

-    /*! \brief Return the device id */
+    /*! \brief Return the device id of this context */
    int device_id() const { return device_id_; }

+    /*! \brief Return the active device id of current thread */
+    static int active_device_id() { return CUDA_GET_DEVICE(); }
+
+    /*! \brief Return the stream id */
+    int stream_id() const { return stream_id_; }
+
    /*! \brief Set the stream id */
    void set_stream_id(int stream_id) { stream_id_ = stream_id; }

@@ -292,85 +314,48 @@ class CUDAContext {
    }

 private:
-    int device_id_, stream_id_ = 1, random_seed_;
+    int device_id_, stream_id_ = 0, random_seed_;
    unique_ptr<std::mt19937> rand_generator_;
    curandGenerator_t curand_generator_ = nullptr;
 };

-template <class Context>
-class CUDAClosure {
- public:
-     /*! \brief Default Constructor */
-    CUDAClosure() {}
-
-    /*! \brief Constructor with the given context */
-    explicit CUDAClosure(Context* ctx): ctx_(ctx) {}
-
-    /*! \brief Synchronize the dispatched operations */
-    void Sync() {
-        for (auto stream_id : active_streams_) {
-            cudaStreamSynchronize(cuda_object_
-                .GetStream(ctx_->device_id(), stream_id));
-            cudaError_t error = cudaGetLastError();
-            CHECK_EQ(error, cudaSuccess)
-                << "\nCUDA Error: " << cudaGetErrorString(error);
-        }
-        active_streams_.clear();
-    }
-
-    /*! \brief Return the specified cuda stream */
-    cudaStream_t cuda_stream(int stream_id) {
-        active_streams_.push_back(stream_id);
-        return cuda_object_.GetStream(
-            ctx_->device_id(), stream_id);
-    }
-
-    /*! \brief Return the specified cublas handle */
-    cublasHandle_t cublas_handle(int stream_id) {
-        active_streams_.push_back(stream_id);
-        return cuda_object_.GetCuBLASHandle(
-            ctx_->device_id(), stream_id);
-    }
-
-    /*! \brief Return the specified cudnn handle */
-#ifdef WITH_CUDNN
-    cudnnHandle_t cudnn_handle(int stream_id) {
-        active_streams_.push_back(stream_id);
-        return cuda_object_.GetCuDNNHandle(
-            ctx_->device_id(), stream_id);
-    }
-#endif
-
- protected:
-    Context* ctx_;
-    CUDAObject cuda_object_;
-    vector<int> active_streams_;
-};
-
 #else  // WITH_CUDA

 class CUDAContext {
 public:
+    /*! \brief Default Constructor */
    CUDAContext(const DeviceOption& option) { CUDA_NOT_COMPILED; }
+
+    /*! \brief Constructor with the specified device id */
    CUDAContext(const int device_id = 0) { CUDA_NOT_COMPILED; }

-    void SwitchToDevice() { CUDA_NOT_COMPILED; }
+    /*! \brief Switch to the device with the given stream */
    void SwitchToDevice(int stream_id) { CUDA_NOT_COMPILED; }

+    /*! \brief Switch to the device of this context */
+    void SwitchToDevice() { CUDA_NOT_COMPILED; }
+
+    /*! \brief Synchronize the dispatched operations */
    void FinishDeviceCompution() { CUDA_NOT_COMPILED; }

+    /*! \brief Malloc the memory */
+    static void* New(size_t nbytes) { CUDA_NOT_COMPILED; }
+
+    /*! \brief Zero-Reset the memory */
    static void Memset(
        size_t              nbytes,
        void*               ptr) {
        CUDA_NOT_COMPILED;
    }

+    /*! \brief Zero-Reset the memory asynchronously */
    void MemsetAsync(
        size_t              nbytes,
        void*               ptr) {
        CUDA_NOT_COMPILED;
    }

+    /*! \brief Copy the memory */
    template<class DstContext, class SrcContext>
    static void Memcpy(
        size_t              nbytes,
@@ -379,6 +364,17 @@ class CUDAContext {
        CUDA_NOT_COMPILED;
    }

+    /*! \brief Copy the memory [Extended] */
+    template<class DstContext, class SrcContext>
+    static void MemcpyEx(
+        size_t              nbytes,
+        void*               dst,
+        const void*         src,
+        int                 device_id) {
+        CUDA_NOT_COMPILED;
+    }
+
+    /*! \brief Copy the memory asynchronously */
    template<class DstContext, class SrcContext>
    void MemcpyAsync(
        size_t              nbytes,
@@ -387,7 +383,16 @@ class CUDAContext {
        CUDA_NOT_COMPILED;
    }

+    /*! \brief Return the device id */
    int device_id() const { return 0; }
+
+    /*! \brief Return the active device id of current thread */
+    static int active_device_id() { return 0; }
+
+    /*! \brief Return the stream id */
+    int stream_id() const { return 0; }
+
+    /*! \brief Set the stream id */
    void set_stream_id(int stream_id) {}
 };


--- a/Dragon/include/core/graph.h
+++ b/Dragon/include/core/graph.h
@@ -20,80 +20,69 @@ namespace dragon {

 class GraphBase {
 public:
-    struct Node {
-        vector<string> parents;
-        vector<string> childs;
-        int op_idx = -1;
-        OperatorDef op_def;
-    };
-
+    /*! \brief Default constructor */
    GraphBase(
        const GraphDef&         meta_graph,
        Workspace*              ws);
+
+    /*! \brief Default deconstructor */
    virtual ~GraphBase() {}

+    GraphDef BuildUpdateOps(const GraphDef& input_def);
+
+    /*! \brief Create a graph from the optimized def */
    virtual bool Create(
        const GraphDef&         optimized_graph,
        Workspace*              ws) = 0;

+    /*! \brief Run the graph once synchronously */
    virtual bool Run(
        const string&           include,
        const string&           exclude,
-        const int               stream_id = 1) = 0;
+        int                     stream_id = 0) = 0;

+    /*! \brief Return the name of this graph */
    string name() const { return name_; }

 protected:
+    /*! \brief Store the name and running phase */
    string name_, phase_;
+
+    /*! \brief Store the defined arguments */
    Map<string, Argument> args_;
+
+    /*! \brief Store the parent workspace */
    Workspace* ws_;
 };

 class Graph : public GraphBase {
 public:
+    /*! \brief Default constructor */
    Graph(const GraphDef& meta_graph, Workspace* ws);
+
+    /*! \brief Default deconstructor */
    virtual ~Graph() { for (auto* op : ops_) delete op; }

+    /*! \brief Create a graph from the optimized def */
    bool Create(
        const GraphDef&         optimized_graph,
        Workspace*              ws) override;

+    /*! \brief Run the graph once synchronously */
    bool Run(
        const string&           include,
        const string&           exclude,
-        const int               stream_id = 1) override;
-
-    GraphDef Prune(const GraphDef& meta_graph);
-    GraphDef Share(const GraphDef& optimized_graph);
-    void ShareGrads(GraphDef& optimized_graph);
-
-    GraphDef BuildUpdateOps(const GraphDef& meta_graph);
-
-    void RecomputingAware(
-        const GraphDef&         optimized_graph,
-        Workspace*              ws);
+        int                     stream_id = 0) override;

+    /*! \brief Return the parent workspace */
    Workspace* ws() const { return ws_; }

 protected:
-    void ForwardShareDyeing(
-        const string&               u,
-        const string&               ancestor);
-
-    void ForwardPruneDyeing(
-        const string&               u,
-        const string&               leaf,
-        const vector<string>&       path);
-
-    void BackwardPruneDyeing(string v);
-
+    /*! \brief Store the internal operators */
    vector<OperatorBase*> ops_;
-    Map<string, Node> dag_;
-    Map<string, bool> visited_, colored_;
-    Map<string, string> renamed_;
-    Set<string> targets_;
 };

+/*! \brief Create a graph from the raw def */
 GraphBase* NewGraph(
    const GraphDef&             meta_graph,
    Workspace*                  ws);

--- a/Dragon/include/core/graph_gradient.h
+++ b/Dragon/include/core/graph_gradient.h
@@ -19,14 +19,19 @@ namespace dragon {

 class GraphGradientMaker {
 public:
-    GraphGradientMaker(): cur_op_idx_(0) {}
+    GraphGradientMaker()
+        : cur_op_idx_(0) {}

    void Make(
-        const GraphDef&         forward_def,
+        const vector<OperatorDef*>&  forward_def,
        const vector<string>&        targets,
        GraphDef&                    new_def);

-    void Share(const string& grads_prefix, GraphDef& graph);
+    void Make(
+        const GraphDef&              forward_def,
+        GraphDef&                    backward_def);
+
+    void Share(GraphDef& graph);

    void SetTerms(const Map<string, string>& terms) { terms_ = terms; }
    void SetOperatorPrefix(const string& prefix) { op_prefix_ = prefix; }

--- a/Dragon/include/core/graph_optimizer.h
+++ b/Dragon/include/core/graph_optimizer.h
+/*!
+ * Copyright (c) 2017-present, SeetaTech, Co.,Ltd.
+ *
+ * Licensed under the BSD 2-Clause License.
+ * You should have received a copy of the BSD 2-Clause License
+ * along with the software. If not, See,
+ *
+ *      <https://opensource.org/licenses/BSD-2-Clause>
+ *
+ * ------------------------------------------------------------
+ */
+
+#ifndef DRAGON_CORE_GRAPH_OPTIMIZER_H_
+#define DRAGON_CORE_GRAPH_OPTIMIZER_H_
+
+#include "core/common.h"
+
+namespace dragon {
+
+class Workspace;
+
+class GraphOptimizer {
+ public:
+    /*! \brief The simple node structure */
+    struct Node {
+        vector<string> parents;
+        vector<string> childs;
+        int op_idx = -1;
+        OperatorDef op_def;
+    };
+
+    /*! \brief Default constructor */
+    GraphOptimizer(Workspace* ws) : ws_(ws) {}
+
+    /*! \brief Prune the redundant nodes (-O1) */
+    GraphDef PruneNodes(const GraphDef& input_def);
+
+    /*! \brief Add the inplace for outputs (-O2) */
+    GraphDef AddInplace(const GraphDef& input_def);
+
+    /*! \brief Plan the recomputing for inputs (-O3) */
+    GraphDef MirrorStage(
+        const GraphDef&                   input_def,
+        Map< string, vector<int> >&       op_indices);
+
+    /*! \brief Allocate the buffer for outputs (-O3) */
+    GraphDef SimulateGC(const GraphDef& input_def);
+
+ protected:
+    /*! \brief Traverse from input gradients to dying the nodes */
+    void ForwardPruneTraversal(
+        const string&               u,
+        const string&               leaf,
+        const vector<string>&       path);
+
+    /*! \brief Traverse from targets to dying the nodes */
+    void BackwardPruneTraversal(const string& v);
+
+    /*! \brief Traverse from inputs to find the available inplace chain */
+    void InplaceTraversal(
+        const string&               u,
+        const string&               ancestor);
+
+    /* \brief Store the workspace of parent graph */
+    Workspace* ws_;
+
+    /* \brief Store the DAG */
+    Map<string, Node> dag_;
+
+    /* \brief Store the traversal flags */
+    Map<string, bool> visited_, colored_;
+
+    /* \brief Store the inplace relations */
+    Map<string, string> renamed_;
+};
+
+}  // namespace dragon
+
+#endif  // DRAGON_CORE_GRAPH_OPTIMIZER_H_
\ No newline at end of file
--- a/Dragon/include/core/mixedmem.h
+++ b/Dragon/include/core/mixedmem.h
@@ -35,8 +35,6 @@ class MixedMemory {
        STATE_AT_CUDA,
        /*! \brief Memory could be modified by CNMLContext last time */
        STATE_AT_CNML,
-        /*! \brief Memory should be copied to another device next time */
-        SWITCHED,
        /*! \brief Host and Device now hold the same contents */
        SYNCED,
    } State;
@@ -46,7 +44,7 @@ class MixedMemory {
          cuda_ptr_(nullptr), cnml_ptr_(nullptr) {}

    /*! \brief Constructor with the known meta and size */
-    MixedMemory(const TypeMeta& meta, const size_t nbytes)
+    MixedMemory(const TypeMeta& meta, size_t nbytes)
        : meta_(meta), nbytes_(nbytes), cpu_ptr_(nullptr),
          cuda_ptr_(nullptr), cnml_ptr_(nullptr) {}

@@ -54,19 +52,19 @@ class MixedMemory {
    ~MixedMemory();

    /*! \brief Return the const data pointer on CPUContext */
-    const void* cpu_data();
+    const void* cpu_data(size_t nbytes = 0);

    /*! \brief Return the const data pointer on CUDAContext */
-    const void* cuda_data();
+    const void* cuda_data(size_t nbytes = 0);

    /*! \brief Return the const data pointer on CNMLContext */
    const void* cnml_data();

    /*! \brief Return the mutable data pointer on CPUContext */
-    void* mutable_cpu_data();
+    void* mutable_cpu_data(size_t nbytes = 0);

    /*! \brief Return the mutable data pointer on CUDAContext */
-    void* mutable_cuda_data();
+    void* mutable_cuda_data(size_t nbytes = 0);

    /*! \brief Return the mutable data pointer on CNMLContext */
    void* mutable_cnml_data();
@@ -86,10 +84,10 @@ class MixedMemory {
    /*! \brief Set the cpu data pointer from external context */
    void set_cpu_data(void* cpu_ptr, size_t nbytes);

-    /*! \brief Switch to the device set by Context before */
-    void SwitchToDevice();
-
    /*! \brief Switch to the specified device */
+    void SwitchToDevice(int device_id);
+
+    /*! \brief Switch to the specified cuda device */
    void SwitchToCUDADevice(int device_id);

    /*! \brief Return the total bytes of this memory */
@@ -110,14 +108,17 @@ class MixedMemory {
    /*! \brief Set the storage order */
    void set_order(StorageOrder order) { order_ = order; }

+    /*! \brief Return the device id of the memory on device */
+    int device_id() const { return ptr_device_; }
+
    /*! \brief Return a string to describe the internal structure */
    const Map<string, string> info() const;

    /*! \brief Control the state machine to CPUContext */
-    void ToCPU();
+    void ToCPU(size_t nbytes = 0);

    /*! \brief Control the state machine to CUDAContext */
-    void ToCUDA();
+    void ToCUDA(size_t nbytes = 0);

 private:
    /*! \brief The type meta to call the deconstructor */

--- a/Dragon/include/core/operator.h
+++ b/Dragon/include/core/operator.h
@@ -30,10 +30,10 @@ class Workspace;

 class OperatorBase {
 public:
-    /*! Default constructor */
+    /*! \brief Default constructor */
    OperatorBase(const OperatorDef& def, Workspace* ws);

-    /*! Default deconstructor */
+    /*! \brief Default deconstructor */
    virtual ~OperatorBase() {}

    /*! \brief Return the specified input tensor */
@@ -49,19 +49,13 @@ class OperatorBase {
    int OutputSize() { return (int)outputs_.size(); }

    /*! \brief Modify this operator according to the given def  */
-    void MutableOp(const OperatorDef& def);
-
-    /*! \brief Modify this operator according to the given properties */
-    void MutableOp(
-        const vector<string>&       inputs,
-        const vector<string>&       outputs,
-        const string&               anchor);
+    void UpdateFrom(const OperatorDef& def);

    /*! \brief Switch the internal running phase */
    void SwitchToPhase(const string& phase) { phase_ = phase; }

    /*! \brief Run this operator on the specified stream */
-    virtual void Run(int stream_id = 1) { NOT_IMPLEMENTED; }
+    virtual void Run(int stream_id = 0) { NOT_IMPLEMENTED; }

    /*! \brief Fusion this operator into the specified graph */
    virtual void Fusion(void* graph) { NOT_IMPLEMENTED; }
@@ -100,14 +94,14 @@ class OperatorBase {
    /*! \brief Return the specified argument */
    const Argument& arg(const string& name) { return *(args_[name]); }

-    typedef Map<string, vector<OperatorBase*> > RecomputeMap;
+    typedef Map<string, vector<OperatorBase*> > SubGraph;

-    /*! \brief Return the recomputing map of this operator */
-    RecomputeMap& recompute_map() { return recompute_map_; }
+    /*! \brief Return the recomputing subgraph of this operator */
+    SubGraph& subgraph() { return subgraph_; }

-    /*! \brief Set the given recomputing map */
-    void set_recompute_map(RecomputeMap recompute_map) {
-        recompute_map_ = recompute_map; 
+    /*! \brief Set the given recomputing subgraph */
+    void set_subgraph(SubGraph subgraph) {
+        subgraph_ = subgraph;
    }

    /*! \brief Return the stored operator def */
@@ -129,7 +123,7 @@ class OperatorBase {
 protected:
    string phase_, anchor_;
    Map<std::string, const Argument*> args_;
-    Map<string, vector<OperatorBase*> > recompute_map_;
+    SubGraph subgraph_;
    vector<Tensor*> inputs_, outputs_;
    OperatorDef def_;
    Workspace* ws_;
@@ -138,50 +132,66 @@ class OperatorBase {
 template <class Context>
 class Operator : public OperatorBase {
 public:
+    /*! \brief Default constructor */
    Operator(const OperatorDef& def, Workspace* ws)
        : OperatorBase(def, ws), ctx_(def.device_option()),
-          allow_recompute_(OperatorBase::Arg<bool>(
-              "recomputing_aware", false)),
+          allow_recomputing_(OperatorBase::Arg<bool>(
+              "allow_recomputing", false)),
          do_sync_(OperatorBase::Arg<bool>(
-              "do_sync", true)) {
+              "do_sync", false)) {
        allow_run_ = true;
-        allow_run_ &= _MPICheck();
+        allow_run_ &= MPICheck();
        allow_run_ &= (!(OutputSize() == 1 &&
            Output(0)->name() == "ignore"));
    }

-    void Run(int stream_id = 1) final {
+    /*! \brief Run this operator on the specified stream */
+    void Run(int stream_id = 0) final {
        if (!allow_run_) return;
-        if (allow_recompute_) MakeResource();
+        if (allow_recomputing_) PrepareResource();
        ctx()->SwitchToDevice(stream_id);
        MemorySwitch();
        RunOnDevice();
-        if (do_sync_) ctx()->FinishDeviceCompution();
-        if (allow_recompute_) CleanResource();
+        if (do_sync_ || stream_id > 0) {
+            // We will sync the stream 0 at the specific time
+            ctx()->FinishDeviceCompution();
        }
+        if (allow_recomputing_) ReleaseResource();
+    }
+
+    /*! \brief Prepare the content of inputs */
+    virtual void PrepareResource();

-    virtual void ElimateCorruption();
-    virtual void MakeResource();
-    virtual void CleanResource();
+    /*! \brief Release the ownership of inputs */
+    virtual void ReleaseResource();

+    /*! \brief Coordinate the context of inputs and outputs */
    virtual void MemorySwitch() {
-        for (auto* I : inputs_)
-            if(I->name() != "ignore") I->SwitchToDevice();
-        for (auto* O : outputs_) 
-            if(O->name() != "ignore") O->SwitchToDevice();
+        for (auto* e : inputs_)
+            if(e->name() != "ignore")
+                e->SwitchToDevice(ctx()->device_id());
+        for (auto* e : outputs_)
+            if(e->name() != "ignore")
+                e->SwitchToDevice(ctx()->device_id());
    }

+    /*! \brief Implement the detailed execution */
    virtual void RunOnDevice() = 0;

+    /*! \brief Return the internal context */
    Context* ctx() { return &ctx_; }
+
+    /*! \brief Whether this operator can be ignored */
    bool AllowRun() { return allow_run_; }

 protected:
+    /*! \brief Store the internal context */
    Context ctx_;
-    bool allow_run_, allow_recompute_, do_sync_;
+    bool allow_run_, allow_recomputing_, do_sync_;

 private:
-    bool _MPICheck() {
+    /*! \brief Check the MPI conditions */
+    bool MPICheck() {
 #ifndef WITH_MPI
        return true;
 #else
@@ -197,7 +207,13 @@ class Operator : public OperatorBase {
    }
 };

-OperatorBase* CreateOperator(const OperatorDef& def, Workspace* ws);
+/*! \brief New a operator from the raw def */
+
+OperatorBase* NewOperator(
+    const OperatorDef&          def,
+    Workspace*                  ws);
+
+/*! Macros */

 #define USE_SIMPLE_CTOR_DTOR(name) \
    name(const OperatorDef& def, Workspace* ws) \
@@ -350,7 +366,9 @@ DECLARE_REGISTRY(
            << "\nExcepted the size of " << #argument \
            << " > " << idx << ". (Got " \
            << argument##_desc.size() << ")."; \
-        Tensor* argument##_tensor = ws()->GetTensor(argument##_desc[idx]); \
+        Tensor* argument##_tensor = ws()->GetTensor( \
+            str::replace_first(argument##_desc[idx], \
+                "${ANCHOR}", anchor())); \
        CHECK(argument##_tensor->IsType<type>()) \
            << "\nThe type of " << #argument << " should be " << #type << "."; \
        CHECK_EQ(argument##_tensor->count(), 1) \

--- a/Dragon/include/core/operator_gradient.h
+++ b/Dragon/include/core/operator_gradient.h
@@ -46,10 +46,17 @@ class GradientMakerBase {

    virtual Gradient Make() {
        vector<OperatorDef> new_defs = MakeDefs();
+        if (def.has_uid()) {
+            // Attach the anchor to the name if having UID
+            for (int i = 0; i < new_defs.size(); i++)
+                new_defs[i].set_name(def.name());
+        } else {
+            // Otherwise, just put it into the arguments
            Argument anchor;
            anchor.set_name("anchor"); anchor.set_s(def.name());
            for (int i = 0; i < new_defs.size(); i++)
                new_defs[i].add_arg()->CopyFrom(anchor);
+        }
        return Gradient(new_defs, g_inputs_, DefaultValues());
    };


--- a/Dragon/include/core/tensor.h
+++ b/Dragon/include/core/tensor.h
@@ -80,7 +80,7 @@ class Tensor {
    int ndim() const { return (int)dims_.size(); }

    /*! \brief Return the dimension of given axis */
-    int64_t dim(const int64_t i) const{ return dims_[axis(i)]; }
+    int64_t dim(int64_t i) const{ return dims_[axis(i)]; }

    /*! \brief Return all the dimensions */
    const vector<int64_t>& dims() const { return dims_; }
@@ -95,7 +95,7 @@ class Tensor {
    size_t capacity() const { return capacity_; }

    /*! \brief Return the number of elements along the [start, end) axes */
-    int64_t count(const int64_t start, const int64_t end) const {
+    int64_t count(int64_t start, int64_t end) const {
        int64_t nelements = 1;
        for (int64_t i = start; i < end; i++) nelements *= dim(i);
        return nelements;
@@ -105,10 +105,10 @@ class Tensor {
    int64_t count() const { return (int64_t)size_; }

    /*! \brief Return the number of elements from the start axis */
-    int64_t count(const int64_t start) const { return count(start, ndim()); }
+    int64_t count(int64_t start) const { return count(start, ndim()); }

    /*! \brief Return the stride of given axis */
-    int64_t stride(const int64_t i) const { return strides_[axis(i)]; }
+    int64_t stride(int64_t i) const { return strides_[axis(i)]; }

    /*! \brief Return all the strides */
    const vector<int64_t>& strides() const { return strides_; }
@@ -128,11 +128,11 @@ class Tensor {
    /*! \brief Return a string to describe the dimensions of this tensor */
    string DimString() const { return DimString(dims_); }

-    /*! \brief Whether the memory of this tensor is unstable */
-    bool is_corrupted() const { return is_corrupted_; }
+    /*! \brief Return the version of this tensor */
+    int version() const { return version_; }

-    /*! \brief Mark the internal memory to be unstable */
-    void Corrupt() { is_corrupted_ = true; }
+    /*! \brief Set the version of this tensor */
+    void set_version(int version) { version_ = version; }

    /*! \brief Whether this tensor holds a valid memory */
    bool has_memory() const { return memory_ || ex_memory_ != nullptr; }
@@ -152,10 +152,10 @@ class Tensor {
        return memory()->state();
    }

-    /*! \brief Switch the memory to device set by Context before */
-    void SwitchToDevice() {
+    /*! \brief Switch the memory to the specific device */
+    void SwitchToDevice(int device_id) {
        MixedMemory* mem = memory();
-        if (mem) mem->SwitchToDevice();
+        if (mem) mem->SwitchToDevice(device_id);
    }

    /*! \brief Return the type meta of this tensor */
@@ -177,10 +177,10 @@ class Tensor {
        } else {
            if (TypeMeta::Id<Context>() ==
                TypeMeta::Id<CPUContext>()) {
-                *data_ptr = mem->mutable_cpu_data();
+                *data_ptr = mem->mutable_cpu_data(nbytes());
            } else if (TypeMeta::Id<Context>() ==
                TypeMeta::Id<CUDAContext>()) {
-                *data_ptr = mem->mutable_cuda_data();
+                *data_ptr = mem->mutable_cuda_data(nbytes());
            } else if (TypeMeta::Id<Context>() ==
                TypeMeta::Id<CNMLContext>()) {
                *data_ptr = mem->mutable_cnml_data();
@@ -198,10 +198,10 @@ class Tensor {
        CHECK(mem) << "\nMemory access before allowcating.";
        if (TypeMeta::Id<Context>() ==
            TypeMeta::Id<CPUContext>()) {
-            return mem->cpu_data();
+            return mem->cpu_data(nbytes());
        } else if (TypeMeta::Id<Context>() ==
            TypeMeta::Id<CUDAContext>()) {
-            return mem->cuda_data();
+            return mem->cuda_data(nbytes());
        } else if (TypeMeta::Id<Context>() ==
            TypeMeta::Id<CNMLContext>()) {
            return mem->cnml_data();
@@ -258,10 +258,18 @@ class Tensor {
    T* mutable_data() {
        void* data_ptr;
        mutable_data_ptr<Context>(&data_ptr);
-        if (data_ptr && meta_ == TypeMeta::Make<T>())
+        if (data_ptr) {
+            auto meta = TypeMeta::Make<T>();
+            if (meta_ == meta) {
                return static_cast<T*>(data_ptr);
-        return static_cast<T*>(
-            raw_mutable_data<Context>(TypeMeta::Make<T>()));
+            } else if (capacity_ >=
+                size_ * meta.itemsize()) {
+                meta_ = meta;
+                return static_cast<T*>(data_ptr);
+            }
+        }
+        return static_cast<T*>(raw_mutable_data
+            <Context>(TypeMeta::Make<T>()));
    }

    /*! \brief Get the typed const data pointer */
@@ -325,6 +333,9 @@ class Tensor {
    /*! \brief Store the size and capacity */
    size_t size_ = 0, capacity_ = 0;

+    /*! \brief Store the version for shared tensor */
+    int version_ = -1;
+
    /*! \brief Store the dimensions and strides */
    vector<int64_t> dims_, strides_;

@@ -335,7 +346,7 @@ class Tensor {
    MixedMemory* ex_memory_ = nullptr;

    /*! \brief External memory indicators */
-    bool is_corrupted_ = false, is_shared_ = false, own_mem_ = true;
+    bool is_shared_ = false, own_mem_ = true;
 };

 }  // namespace dragon

--- a/Dragon/include/core/typeid.h
+++ b/Dragon/include/core/typeid.h
@@ -69,8 +69,8 @@ class TypeMeta {

    template <typename T>
    static TypeId Id() {
-        //  return T's id
-        //  using a intptr_t as hash key
+        // Return T's id
+        // Using a intptr_t as hash key
        return TypeRegister<T>::id();
    }


--- a/Dragon/include/core/workspace.h
+++ b/Dragon/include/core/workspace.h
@@ -19,14 +19,12 @@

 namespace dragon {

-#define WORKSPACE_MAX_CORRUPTED_SIZE 2
-
 class Workspace {
 public:
    typedef Map<string, Map<string, int64_t> > DummyNameMap;

    typedef Map<string, unique_ptr<Tensor> > TensorMap;
-    typedef Map<string, string> TensorProxyMap;
+    typedef Map<string, string> TensorAliasMap;
    typedef Map<string, TensorFillerProto> TensorFillerMap;

    typedef Map<string, unique_ptr<OperatorBase> > OperatorMap;
@@ -107,18 +105,14 @@ class Workspace {
        return Tcaches;
    }

-    /*! \brief Creathe a persistent operator in this workspace */
-    void CreatePersistentOp(const OperatorDef& def);
+    /*! \brief Create a operator in this workspace */
+    OperatorBase* CreateOperator(const OperatorDef& def);

    /*! \brief Run the specified persistent operator */
-    void RunPersistentOp(
-        const string&               key,
-        const string&               anchor,
-        const vector<string>&       inputs,
-        const vector<string>&       outputs);
+    void RunOperator(const OperatorDef& def);

    /*! \brief Try to run the operator in a adaptive mode */
-    void RunOperator(const OperatorDef& def);
+    void RunOperatorOnce(const OperatorDef& def);

    /*! \brief Create a Graph in this workspace */
    GraphBase* CreateGraph(const GraphDef& def);
@@ -128,13 +122,13 @@ class Workspace {
        const string&               graph_name,
        const string&               include,
        const string&               exclude,
-        const int                   stream_id = 1);
+        int                         stream_id = 0);

    /*! \brief Return all the stored graph names */
    vector<string> GetGraphs() const;

-    /* \brief Set a proxy name for the tensor */
-    bool SetTensorProxy(const string& key, const string& proxy);
+    /* \brief Set an alias for the tensor */
+    bool SetTensorAlias(const string& name, const string& alias);

    /* \brief Return a unique dummy name within this workspace */
    string GetDummyName(
@@ -157,7 +151,7 @@ class Workspace {
    TensorFillerMap tensor_filler_map_;

    /*! \brief Store the proxy name of tensors */
-    TensorProxyMap tensor_proxy_map_;
+    TensorAliasMap tensor_alias_map_;

    /*! \brief Store the registered operators for dynamic graph */
    OperatorMap operator_map_;

--- a/Dragon/include/operators/activation/softmax_op.h
+++ b/Dragon/include/operators/activation/softmax_op.h
@@ -99,6 +99,6 @@ class CuDNNSoftmaxGradientOp final : public Operator<Context> {

 #endif  // WITH_CUDNN

-}
+}  // namespace dragon

 #endif  // DRAGON_OPERATORS_ACTIVATION_SOFTMAX_OP_H_
\ No newline at end of file
--- a/Dragon/include/operators/update/moving_average_op.h
+++ b/Dragon/include/operators/update/moving_average_op.h
@@ -10,29 +10,29 @@
 * ------------------------------------------------------------
 */

-#ifndef DRAGON_OPERATORS_UPDATE_MOVING_AVERAGE_OP_H_
-#define DRAGON_OPERATORS_UPDATE_MOVING_AVERAGE_OP_H_
+#ifndef DRAGON_OPERATORS_ARITHMETIC_ACCUMULATE_OP_H_
+#define DRAGON_OPERATORS_ARITHMETIC_ACCUMULATE_OP_H_

 #include "core/operator.h"

 namespace dragon {

 template <class Context>
-class MovingAverageOp final : public Operator<Context> {
+class AccumulateOp final : public Operator<Context> {
 public:
-    MovingAverageOp(const OperatorDef& def, Workspace* ws)
+    AccumulateOp(const OperatorDef& def, Workspace* ws)
        : Operator<Context>(def, ws),
-          decay(OperatorBase::Arg<float>("decay", 1.f)) {}
+          alpha(OperatorBase::Arg<float>("alpha", 1.f)),
+          beta(OperatorBase::Arg<float>("beta", 1.f)) {}
    USE_OPERATOR_FUNCTIONS;

    void RunOnDevice() override;
-    template <typename T> void RunWithType();
+    template <typename T> void RunWithType(Tensor* X, Tensor* Y);

 protected:
-    float decay;
+    float alpha, beta;
 };

 }  // namespace dragon

-
-#endif  // DRAGON_OPERATORS_UPDATE_MOVING_AVERAGE_OP_H_
\ No newline at end of file
+#endif  // DRAGON_OPERATORS_ARITHMETIC_ACCUMULATE_OP_H_
\ No newline at end of file
--- a/Dragon/include/operators/arithmetic/affine_op.h
+++ b/Dragon/include/operators/arithmetic/affine_op.h
@@ -46,12 +46,12 @@ class AffineGradientOp final : public Operator<Context> {
    void RunOnDevice() override;
    template <typename T> void BiasRunWithType();
    template <typename T> void ScaleRunWithType();
+    template <typename T> void ComputeScaleGradient(T* dYxX, T* dA);
    template <typename T> void RunWithType();

 protected:
    int64_t axis, num_axes;
    int64_t outer_dim, inner_dim, scale_dim, sum_dim, dim;
-    Tensor sum_result;
 };

 #ifdef WITH_CUDNN
@@ -125,18 +125,12 @@ public:

    template <typename DT, typename CT>
    void ComputeScaleGradient(DT* dYxX, DT* dA);
-    template <typename DT, typename CT>
-    void ComputeBiasGradient(const DT* dY, DT* dB);
-
    template <typename T> void ComputeScaleGradient_v2(T* dYxX, T* dA);
-    template <typename T> void ComputeBiasGradient_v2(const T* dY, T* dB);
-
    template <typename DT, typename CT> void RunWithType();

 protected:
    USE_CUDNN_AFFINE_FUCNTIONS;
    int64_t outer_dim, inner_dim, scale_dim, dim, sum_dim;
-    Tensor sum_result;
 };

 #endif

--- a/Dragon/include/operators/vision/dense_concat_op.h
+++ b/Dragon/include/operators/vision/dense_concat_op.h
@@ -10,36 +10,33 @@
 * ------------------------------------------------------------
 */

-#ifndef DRAGON_OPERATORS_VISION_DENSE_CONCAT_OP_H_
-#define DRAGON_OPERATORS_VISION_DENSE_CONCAT_OP_H_
+#ifndef DRAGON_OPERATORS_ARITHMETIC_SQRT_OP_H_
+#define DRAGON_OPERATORS_ARITHMETIC_SQRT_OP_H_

-#include "operators/ndarray/concat_op.h"
+#include "core/operator.h"

 namespace dragon {

 template <class Context>
-class DenseConcatOp final : public ConcatOp<Context> {
+class SqrtOp final : public Operator<Context> {
 public:
-    DenseConcatOp(const OperatorDef& def, Workspace* ws)
-        : ConcatOp<Context>(def, ws) {}
+    USE_SIMPLE_CTOR_DTOR(SqrtOp);
    USE_OPERATOR_FUNCTIONS;
+
+    void RunOnDevice() override;
+    template <typename T> void RunWithType();
 };

 template <class Context>
-class DenseConcatGradientOp final : public ConcatGradientOp<Context> {
+class SqrtGradientOp final : public Operator<Context> {
 public:
-    DenseConcatGradientOp(const OperatorDef& def, Workspace* ws)
-        : ConcatGradientOp<Context>(def, ws),
-          growth_rate(OperatorBase::Arg<int64_t>("growth_rate", 0)) {}
+    USE_SIMPLE_CTOR_DTOR(SqrtGradientOp);
    USE_OPERATOR_FUNCTIONS;

-    void ElimateCorruption() override;
-    template <typename T> void RestoreX1();
-
- protected:
-    int64_t growth_rate;
+    void RunOnDevice() override;
+    template <typename T> void RunWithType();
 };

 }  // namespace dragon

-#endif  // DRAGON_OPERATORS_VISION_DENSE_CONCAT_OP_H_
\ No newline at end of file
+#endif  // DRAGON_OPERATORS_ARITHMETIC_SQRT_OP_H_
\ No newline at end of file
--- a/Dragon/include/operators/arithmetic/square_op.h
+++ b/Dragon/include/operators/arithmetic/square_op.h
@@ -19,7 +19,7 @@ namespace dragon {

 template <class Context>
 class SquareOp final : public Operator<Context> {
-public:
+ public:
    USE_SIMPLE_CTOR_DTOR(SquareOp);
    USE_OPERATOR_FUNCTIONS;

@@ -29,7 +29,7 @@ public:

 template <class Context>
 class SquareGradientOp final : public Operator<Context> {
-public:
+ public:
    USE_SIMPLE_CTOR_DTOR(SquareGradientOp);
    USE_OPERATOR_FUNCTIONS;


--- a/Dragon/include/operators/loss/sigmoid_focal_loss_op.h
+++ b/Dragon/include/operators/loss/sigmoid_focal_loss_op.h
@@ -37,7 +37,7 @@ class SigmoidFocalLossOp
    USE_OPERATOR_FUNCTIONS;

    void RunOnDevice() override;
-    template <typename T> void RunWithType();
+    template <typename Tx, typename Ty> void RunWithType();

 protected:
    float alpha, gamma, pos_alpha, neg_alpha;
@@ -66,7 +66,7 @@ class SigmoidFocalLossGradientOp
    USE_OPERATOR_FUNCTIONS;

    void RunOnDevice() override;
-    template <typename T> void RunWithType();
+    template <typename Tx, typename Ty> void RunWithType();

 protected:
    float alpha, gamma, pos_alpha, neg_alpha;

--- a/Dragon/include/operators/loss/softmax_focal_loss_op.h
+++ b/Dragon/include/operators/loss/softmax_focal_loss_op.h
@@ -37,7 +37,7 @@ class SoftmaxFocalLossOp
    USE_OPERATOR_FUNCTIONS;

    void RunOnDevice() override;
-    template <typename T> void RunWithType();
+    template <typename Tx, typename Ty> void RunWithType();

 protected:
    float alpha, gamma, pos_alpha, neg_alpha;
@@ -66,7 +66,7 @@ class SoftmaxFocalLossGradientOp
    USE_OPERATOR_FUNCTIONS;

    void RunOnDevice() override;
-    template <typename T> void RunWithType();
+    template <typename Tx, typename Ty> void RunWithType();

 protected:
    float alpha, gamma, pos_alpha, neg_alpha;

--- a/Dragon/include/operators/misc/astype_op.h
+++ b/Dragon/include/operators/misc/astype_op.h
@@ -10,17 +10,17 @@
 * ------------------------------------------------------------
 */

-#ifndef DRAGON_OPERATORS_MISC_ASTYPE_OP_H_
-#define DRAGON_OPERATORS_MISC_ASTYPE_OP_H_
+#ifndef DRAGON_OPERATORS_MISC_CAST_OP_H_
+#define DRAGON_OPERATORS_MISC_CAST_OP_H_

 #include "core/operator.h"

 namespace dragon {

 template <class Context>
-class AsTypeOp final : public Operator<Context> {
+class CastOp final : public Operator<Context> {
 public:
-    AsTypeOp(const OperatorDef& def, Workspace* ws)
+    CastOp(const OperatorDef& def, Workspace* ws)
       : Operator<Context>(def, ws),
         dtype(OperatorBase::Arg<string>("dtype", "float32")),
         inplace(OperatorBase::Arg<bool>("inplace", false)) {}
@@ -33,6 +33,18 @@ class AsTypeOp final : public Operator<Context> {
    bool inplace;
 };

+template <class Context>
+class CastGradientOp final : public Operator<Context> {
+ public:
+    USE_SIMPLE_CTOR_DTOR(CastGradientOp);
+    USE_OPERATOR_FUNCTIONS;
+
+    void RunOnDevice() override;
+
+ protected:
+    string dtype;
+};
+
 }  // namespace dragon

-#endif  // DRAGON_OPERATORS_MISC_ASTYPE_OP_H_
\ No newline at end of file
+#endif  // DRAGON_OPERATORS_MISC_CAST_OP_H_
\ No newline at end of file
--- a/Dragon/include/operators/misc/initialize_op.h
+++ b/Dragon/include/operators/misc/initialize_op.h
@@ -128,7 +128,7 @@ public:

 template <class Context>
 class TruncatedNormalOp final : public InitializeOp<Context> {
-public:
+ public:
    TruncatedNormalOp(const OperatorDef& def, Workspace* ws)
        : InitializeOp<Context>(def, ws) {
        this->filler_proto.set_type("truncated_normal");

--- a/Dragon/include/operators/update/adam_update_op.h
+++ b/Dragon/include/operators/update/adam_update_op.h
@@ -25,8 +25,7 @@ class AdamUpdateOp final : public UpdateOpBase<Context> {
    USE_OPERATOR_FUNCTIONS;
    USE_UPDATER_FUNCTIONS(Context);

-    void ComputeRunWithFloat32() override;
-    void ComputeRunWithFloat16() override;
+    void ComputeUpdates(Tensor* dX) override;

 protected:
    int t; float lr, beta1, beta2, eps;

--- a/Dragon/include/operators/update/collective_update_op.h
+++ b/Dragon/include/operators/update/collective_update_op.h
@@ -75,7 +75,6 @@ class CollectiveUpdateOp final : public Operator<Context> {

 #ifdef WITH_NCCL
    ncclComm_t nccl_comm;
-    CUDAClosure<Context> closure;
 #endif
 };


--- a/Dragon/include/operators/update/nesterov_update_op.h
+++ b/Dragon/include/operators/update/nesterov_update_op.h
@@ -25,8 +25,7 @@ class NesterovUpdateOp final : public UpdateOpBase<Context> {
    USE_OPERATOR_FUNCTIONS;
    USE_UPDATER_FUNCTIONS(Context);

-    void ComputeRunWithFloat32() override;
-    void ComputeRunWithFloat16() override;
+    void ComputeUpdates(Tensor* dX) override;

 protected:
    float lr, momentum;

--- a/Dragon/include/operators/update/rmsprop_update_op.h
+++ b/Dragon/include/operators/update/rmsprop_update_op.h
@@ -25,8 +25,7 @@ class RMSPropUpdateOp final : public UpdateOpBase<Context> {
    USE_OPERATOR_FUNCTIONS;
    USE_UPDATER_FUNCTIONS(Context);

-    void ComputeRunWithFloat32() override;
-    void ComputeRunWithFloat16() override;
+    void ComputeUpdates(Tensor* dX) override;

 protected:
    float lr, decay, eps;

--- a/Dragon/include/operators/update/sgd_update_op.h
+++ b/Dragon/include/operators/update/sgd_update_op.h
@@ -26,8 +26,7 @@ class SGDUpdateOp final : public UpdateOpBase<Context> {
    USE_OPERATOR_FUNCTIONS;
    USE_UPDATER_FUNCTIONS(Context);

-    void ComputeRunWithFloat32() override;
-    void ComputeRunWithFloat16() override;
+    void ComputeUpdates(Tensor* dX) override;

 protected:
    float old_lr, lr, momentum, correction;

--- a/Dragon/include/operators/update/update_op_base.h
+++ b/Dragon/include/operators/update/update_op_base.h
@@ -24,29 +24,29 @@ class UpdateOpBase : public Operator<Context> {
        : Operator<Context>(def, ws),
          lr_mult(OperatorBase::Arg<float>("lr_mult", 1.f)),
          decay_mult(OperatorBase::Arg<float>("decay_mult", 1.f)),
-          slot(OperatorBase::Arg<string>("slot", "")),
-          zero_grad(OperatorBase::Arg<bool>("zero_grad", true)) {
+          slot(OperatorBase::Arg<string>("slot", "")) {
        CHECK(!slot.empty()) << "\nRequired a non-empty slot";
    }
    USE_OPERATOR_FUNCTIONS;

+    string Slot() { return slot + "/" + Output(0)->name(); }
+
    float Param(const string& name) const;
-    string Slot();

-    void RunOnDevice() override;
-    template <typename T> void PreprocessRunWithType();
+    template <typename T>
+    void ProcessGradients(Tensor* dX, Tensor* X);

-    virtual void ComputeRunWithFloat32() = 0;
-    virtual void ComputeRunWithFloat16() = 0;
+    virtual void ComputeUpdates(Tensor* dX) = 0;

-    void UpdateRunWithFloat32();
-    void UpdateRunWithFloat16();
+    template <typename T>
+    void ApplyUpdates(Tensor* dX, Tensor* X);
+
+    void RunOnDevice() override;

 protected:
    float lr_mult, decay_mult;
    float l2_decay, clip_thresh, scale_factor;
    string slot;
-    bool zero_grad;
 };

 #define USE_UPDATER_FUNCTIONS(context) \

--- a/Dragon/include/operators/vision/conv_op.h
+++ b/Dragon/include/operators/vision/conv_op.h
@@ -88,6 +88,7 @@ class CuDNNConv2dOp final : public Conv2dOp<Context> {
    }

    void RunOnDevice() override;
+    void SetConvDescFromInputs();
    template <typename T> void ResetDesc();
    template <typename T> void RunWithType();

@@ -101,7 +102,7 @@ class CuDNNConv2dOp final : public Conv2dOp<Context> {
    cudnnFilterDescriptor_t filter_desc;
    size_t fwd_data_size;
    int64_t cudnn_group;
-    vector<int64_t> input_dims;
+    vector<int64_t> input_dims, filter_dims;
    bool enable_tensor_core;
 };

@@ -142,6 +143,7 @@ class CuDNNConv2dGradientOp final : public Conv2dGradientOp<Context> {
    }

    void RunOnDevice() override;
+    void SetConvDescFromInputs();
    template <typename T> void ResetDesc();
    template <typename T> void RunWithType();

@@ -156,7 +158,7 @@ class CuDNNConv2dGradientOp final : public Conv2dGradientOp<Context> {
    cudnnFilterDescriptor_t filter_desc;
    size_t bwd_filter_size, bwd_data_size;
    int64_t cudnn_group;
-    vector<int64_t> input_dims;
+    vector<int64_t> input_dims, filter_dims;
    bool enable_tensor_core;
 };


--- a/Dragon/include/operators/vision/conv_transpose_op.h
+++ b/Dragon/include/operators/vision/conv_transpose_op.h
@@ -95,6 +95,7 @@ class CuDNNConvTranspose2dOp final
    }

    void RunOnDevice() override;
+    void SetConvDescFromInputs();
    template <typename T> void ResetDesc();
    template <typename T> void RunWithType();

@@ -108,7 +109,7 @@ class CuDNNConvTranspose2dOp final
    cudnnFilterDescriptor_t filter_desc;
    size_t fwd_data_size;
    int64_t cudnn_group;
-    vector<int64_t> input_dims;
+    vector<int64_t> output_dims, filter_dims;
    bool enable_tensor_core;
 };

@@ -152,6 +153,7 @@ public:
    }

    void RunOnDevice() override;
+    void SetConvDescFromInputs();
    template <typename T> void ResetDesc();
    template <typename T> void RunWithType();

@@ -166,7 +168,7 @@ public:
    cudnnFilterDescriptor_t filter_desc;
    size_t bwd_filter_size, bwd_data_size;
    int64_t cudnn_group;
-    vector<int64_t> input_dims;
+    vector<int64_t> output_dims, filter_dims;
    bool enable_tensor_core;
 };


--- a/Dragon/include/operators/vision/nn_resize_op.h
+++ b/Dragon/include/operators/vision/nn_resize_op.h
@@ -55,6 +55,7 @@ class NNResizeGradientOp final : public Operator<Context> {
    USE_OPERATOR_FUNCTIONS;

    void RunOnDevice() override;
+    void RunWithFloat16();
    template <typename T> void RunWithType();

 protected:

--- a/Dragon/include/operators/vision/pool_op.h
+++ b/Dragon/include/operators/vision/pool_op.h
@@ -26,7 +26,7 @@ class Pool2dOp : public Operator<Context> {
          data_format(OperatorBase::Arg<string>("data_format", "NCHW")),
          padding(OperatorBase::Arg<string>("padding", "VALID")),
          global_pooling(OperatorBase::Arg<bool>("global_pooling", false)),
-          ceil_mode(OperatorBase::Arg<bool>("ceil", true)) {
+          ceil_mode(OperatorBase::Arg<bool>("ceil_mode", true)) {
        auto ks = OperatorBase::Args<int64_t>("kernel_shape");
        auto s = OperatorBase::Args<int64_t>("strides");
        auto p = OperatorBase::Args<int64_t>("pads");
@@ -68,7 +68,7 @@ class Pool2dGradientOp : public Operator<Context> {
          data_format(OperatorBase::Arg<string>("data_format", "NCHW")),
          padding(OperatorBase::Arg<string>("padding", "VALID")),
          global_pooling(OperatorBase::Arg<bool>("global_pooling", false)),
-          ceil_mode(OperatorBase::Arg<bool>("ceil", true)) {
+          ceil_mode(OperatorBase::Arg<bool>("ceil_mode", true)) {
        auto ks = OperatorBase::Args<int64_t>("kernel_shape");
        auto s = OperatorBase::Args<int64_t>("strides");
        auto p = OperatorBase::Args<int64_t>("pads");

--- a/Dragon/include/operators/vision/roi_align_op.h
+++ b/Dragon/include/operators/vision/roi_align_op.h
@@ -54,6 +54,7 @@ class ROIAlignGradientOp final : public Operator<Context> {
    USE_OPERATOR_FUNCTIONS;

    void RunOnDevice() override;
+    void RunWithFloat16();
    template <typename T> void RunWithType();

 protected:

--- a/Dragon/include/operators/vision/roi_pool_op.h
+++ b/Dragon/include/operators/vision/roi_pool_op.h
@@ -49,6 +49,7 @@ class ROIPoolGradientOp final : public Operator<Context> {
    USE_OPERATOR_FUNCTIONS;

    void RunOnDevice() override;
+    void RunWithFloat16();
    template <typename T> void RunWithType();

 protected:

--- a/Dragon/include/utils/cub_device.h
+++ b/Dragon/include/utils/cub_device.h
@@ -12,7 +12,7 @@ namespace dragon {
 template <typename T>
 using BlockReduce = cub::BlockReduce<T, CUDA_THREADS>;

-}
+}  // namespace dragon

 #endif  // WITH_CUDA


--- a/Dragon/include/utils/math_functions.h
+++ b/Dragon/include/utils/math_functions.h
@@ -102,7 +102,7 @@ template <typename T, class Context>
 void Set(
    const int               n,
    const T                 alpha,
-    T*                      x,
+    T*                      y,
    Context*                ctx);

 template <typename T, class Context>
@@ -122,6 +122,15 @@ void Axpy(
    Context*                ctx);

 template<typename T, class Context>
+void Axpby(
+    const int               n,
+    const float             alpha,
+    const T*                x,
+    const float             beta,
+    T*                      y,
+    Context*                ctx);
+
+template<typename T, class Context>
 void AddScalar(
    const int               n,
    const float             alpha,
@@ -141,17 +150,8 @@ void AddScalar(
 template<typename T, class Context>
 void InvStd(
    const int               n,
-    float                   eps,
-    const T*                x,
-    T*                      y,
-    Context*                ctx);
-
-template<typename T, class Context>
-void Axpby(
-    const int               n,
-    float                   alpha,
+    const float             eps,
    const T*                x,
-    float                   beta,
    T*                      y,
    Context*                ctx);


--- a/Dragon/include/utils/op_kernel.h
+++ b/Dragon/include/utils/op_kernel.h
@@ -378,8 +378,8 @@ void NLLLoss(
    const Tx*               log_prob,
    const Ty*               labels,
    const int*              ignores,
-    float*                  losses,
-    float*                  flags,
+    Tx*                     losses,
+    int*                    flags,
    Context*                ctx);

 template <typename Tx, typename Ty, class Context>
@@ -392,7 +392,7 @@ void NLLLossGrad(
    const Ty*               labels,
    const int*              ignores,
    Tx*                     dx,
-    float*                  flags,
+    int*                    flags,
    Context*                ctx);

 /*! loss.sigmoid_ce_loss */
@@ -403,7 +403,7 @@ void SigmoidCrossEntropy(
    const T*                logits,
    const T*                targets,
    T*                      losses,
-    T*                      flags,
+    int*                    flags,
    Context*                ctx);

 template <typename T, class Context>
@@ -412,12 +412,12 @@ void SigmoidCrossEntropyGrad(
    const T*                logits,
    const T*                targets,
    T*                      dlogits,
-    T*                      flags,
+    int*                    flags,
    Context*                ctx);

 /*! loss.sigmoid_focal_loss */

-template <typename T, class Context>
+template <typename Tx, typename Ty, class Context>
 void SigmoidFocalLoss(
    const int               outer_dim,
    const int               axis_dim,
@@ -426,13 +426,13 @@ void SigmoidFocalLoss(
    const float             neg_alpha,
    const float             gamma,
    const int               neg_id,
-    const float*            logits,
-    const float*            targets,
-    float*                  losses,
-    float*                  flags,
+    const Tx*               logits,
+    const Ty*               targets,
+    Tx*                     losses,
+    int*                    flags,
    Context*                ctx);

-template <typename T, class Context>
+template <typename Tx, typename Ty, class Context>
 void SigmoidFocalLossGrad(
    const int               outer_dim,
    const int               axis_dim,
@@ -441,10 +441,10 @@ void SigmoidFocalLossGrad(
    const float             neg_alpha,
    const float             gamma,
    const int               neg_id,
-    const float*            logits,
-    const float*            targets,
-    float*                  dlogits,
-    float*                  flags,
+    const Tx*               logits,
+    const Ty*               targets,
+    Tx*                     dlogits,
+    int*                    flags,
    Context*                ctx);

 /*! loss.smooth_l1_loss */
@@ -477,7 +477,7 @@ void SoftmaxCrossEntropy(

 /*! loss.softmax_focal_loss */

-template <typename T, class Context>
+template <typename Tx, typename Ty, class Context>
 void SoftmaxFocalLoss(
    const int               outer_dim,
    const int               axis_dim,
@@ -487,14 +487,14 @@ void SoftmaxFocalLoss(
    const float             neg_alpha,
    const float             gamma,
    const int               neg_id,
-    const T*                prob,
-    const T*                labels,
+    const Tx*               prob,
+    const Ty*               labels,
    const int*              ignores,
-    T*                      losses,
-    T*                      flags,
+    Tx*                     losses,
+    int*                    flags,
    Context*                ctx);

-template <typename T, class Context>
+template <typename Tx, typename Ty, class Context>
 void SoftmaxFocalLossGrad(
    const int               outer_dim,
    const int               axis_dim,
@@ -504,11 +504,11 @@ void SoftmaxFocalLossGrad(
    const float             neg_alpha,
    const float             gamma,
    const int               neg_id,
-    const T*                prob,
-    const T*                labels,
+    const Tx*               prob,
+    const Ty*               labels,
    const int*              ignores,
-    T*                      dx,
-    T*                      flags,
+    Tx*                     dx,
+    int*                    flags,
    Context*                ctx);

 /*! loss.sparse_softmax_cross_entropy */
@@ -522,8 +522,8 @@ void SparseSoftmaxCrossEntropy(
    const Tx*               prob,
    const Ty*               labels,
    const int*              ignores,
-    float*                  losses,
-    float*                  flags,
+    Tx*                     losses,
+    int*                    flags,
    Context*                ctx);

 template <typename Tx, typename Ty, class Context>
@@ -536,7 +536,7 @@ void SparseSoftmaxCrossEntropyGrad(
    const Ty*               labels,
    const int*              ignores,
    Tx*                     dx,
-    float*                  flags,
+    int*                    flags,
    Context*                ctx);

 /*! misc.astype */
@@ -548,6 +548,16 @@ void TypeA2B(
    Tb*                     b,
    Context*                ctx);

+/*! misc.gradient */
+
+template <typename T, class Context>
+void GradientTwoSum(
+    const int               count,
+    const T*                dy1,
+    const T*                dy2,
+    T*                      dx,
+    Context*                ctx);
+
 /*! misc.image_data */

 template <typename Tx, typename Ty, class Context>
@@ -976,11 +986,18 @@ void SGDUpdate(
 /*! update.op_base */

 template <typename T, class Context>
+void MixedPrecisionL2Decay(
+    const int               count,
+    const float             alpha,
+    const T*                w,
+    float*                  dx,
+    Context*                ctx);
+
+template <typename T, class Context>
 void MixedPrecisionUpdate(
    const int               count,
    const float*            updates,
    T*                      w,
-    T*                      g,
    Context*                ctx);

 /*! vision.bias_add */

--- a/Dragon/include/utils/string.h
+++ b/Dragon/include/utils/string.h
@@ -37,6 +37,20 @@ inline std::vector<std::string> split(
    return ret;
 }

+inline std::string replace_first(
+    const std::string&              str,
+    const std::string&              pattern,
+    const std::string&              excepted) {
+    size_t pos = 0;
+    if ((pos = str.find(pattern)) != std::string::npos) {
+        std::string ret(str);
+        ret.replace(pos, pattern.size(), excepted);
+        return ret;
+    } else {
+        return str;
+    }
+}
+
 }  // namespace str

 }  // namespace dragon

--- a/Dragon/modules/cxx/dragon.cc
+++ b/Dragon/modules/cxx/dragon.cc
@@ -269,7 +269,7 @@ void LoadONNXModel(
 *                                       *
 * * * * * * * * * * * * * * * * * * * * */

-void SetLogLevel(const std::string& level) {
+void SetLoggingLevel(const std::string& level) {
    SetLogDestination(StrToLogSeverity(level));
 }


--- a/Dragon/modules/cxx/dragon.h
+++ b/Dragon/modules/cxx/dragon.h
@@ -97,7 +97,7 @@ DRAGON_API std::string CreateGraph(
 DRAGON_API void RunGraph(
    const std::string&              graph_name,
    Workspace_t                     ws,
-    const int                       stream_id = 1);
+    int                             stream_id = 0);

 /* * * * * * * * * * * * * * * * * * * * *
 *                                       *
@@ -156,7 +156,7 @@ DRAGON_API void LoadONNXModel(
 *                                       *
 * * * * * * * * * * * * * * * * * * * * */

-DRAGON_API void SetLogLevel(const std::string& level);
+DRAGON_API void SetLoggingLevel(const std::string& level);

 }  // namespace dragon


--- a/Dragon/modules/python/py_autograd.h
+++ b/Dragon/modules/python/py_autograd.h
@@ -19,95 +19,45 @@ namespace dragon {

 namespace python {

-PyObject* CreateGradientDefsCC(PyObject* self, PyObject* args) {
-    PyObject* def_string = nullptr;
-    PyObject* py_g_outputs = nullptr;
-    if (!PyArg_ParseTuple(args, "SO!",
-            &def_string, &PyList_Type, &py_g_outputs)) {
-        PyErr_SetString(PyExc_ValueError,
-            "Excepted a serialized string of OperatorDef "
-            "and a list containing outputs of this GradientOp.");
-         return nullptr;
-    }
+void AddGradientMethods(pybind11::module& m) {
+    m.def("CreateGradientDefs", [](
+        const string&               forward_def,
+        const vector<string>&       g_outputs) {
        OperatorDef def;
-    if (!def.ParseFromString(PyBytes_AsStringEx(def_string))) {
-        PyErr_SetString(PyExc_ValueError,
-            "Failed to parse the OperatorDef.");
-        return nullptr;
-    }
-    if (!GradientRegistry()->Has(def.type())) {
-        PyErr_SetString(PyExc_KeyError,
-            "This Operator does not register GradientOp.");
-        return nullptr;
-    }
-    vector<string> g_outputs;
-    PyList_AsVecString(py_g_outputs, g_outputs, "ignore");
+        if (!def.ParseFromString(forward_def))
+            LOG(FATAL) << "Failed to parse the OperatorDef.";
+        if (!GradientRegistry()->Has(def.type()))
+            LOG(FATAL) << def.type() << "Op has no gradients.";
        Gradient grad = MakeGradientForOp(def, g_outputs);
-    PyObject* g_ops = PyList_New(grad.ops.size());
-    PyObject* g_input = PyList_New(grad.g_inputs.size());
-    PyObject* g_defaults = PyList_New(grad.defaults.size());
-    for (int i = 0; i < grad.ops.size(); i++) {
-        PyObject* e = String_AsPyBytes(grad.ops[i].SerializeAsString());
-        SetPyList(g_ops, i, e);
-    }
-    for (int i = 0; i < grad.g_inputs.size(); i++) {
-        PyObject* e = String_AsPyUnicode(grad.g_inputs[i]);
-        SetPyList(g_input, i, e);
-    }
-    for (int i = 0; i < grad.defaults.size(); i++) {
-        PyObject* e = PyFloat_FromDouble(grad.defaults[i]);
-        SetPyList(g_defaults, i, e);
-    }
-    PyObject* pack = PyTuple_Pack(3, g_ops, g_input, g_defaults);
-    Py_XDECREF(g_ops);
-    Py_XDECREF(g_input);
-    Py_XDECREF(g_defaults);
-    return pack;
-}
+        vector<pybind11::bytes> grad_ops;
+        for (const auto& e : grad.ops)
+            grad_ops.push_back(e.SerializeAsString());
+        return std::tuple<
+            vector<pybind11::bytes>, vector<string>, vector<float>
+        >(grad_ops, grad.g_inputs, grad.defaults);
+    });

-PyObject* RunGradientFlowCC(PyObject* self, PyObject* args) {
-    PyObject* py_fp_ops, *py_targets;
-    PyObject* py_input_grads, *py_ignore_grads;
-    PyObject* py_share_grads, *py_export_graph;
-    if (!PyArg_ParseTuple(args, "OOOOOO",
-        &py_fp_ops, &py_targets,
-            &py_input_grads, &py_ignore_grads,
-                &py_share_grads, &py_export_graph)) {
-        PyErr_SetString(PyExc_ValueError,
-            "Excepted a list of serialized input ops, targets, "
-            "input grads, ignore grads and whehter to share grads or log graph.");
-        return nullptr;
-    }
-    // Make -> Optm -> Run
-    vector<string> targets, input_grads, ignore_grads;
-    PyList_AsVecString(py_targets, targets, "");
-    PyList_AsVecString(py_input_grads, input_grads, "");
-    PyList_AsVecString(py_ignore_grads, ignore_grads, "");
-    GraphDef fp_ops, bp_ops;
-    if (!fp_ops.ParseFromString(PyBytes_AsStringEx(py_fp_ops))) {
-        PyErr_SetString(PyExc_RuntimeError, 
-            "Failed to parse the GraphDef of forward ops.");
-        return nullptr;
-    }
+    m.def("FlowGradients", [](
+        const vector<OperatorDef*>&   forward_ops,
+        const vector<string>&         targets,
+        const vector<string>&         input_grads,
+        const vector<string>&         ignore_grads,
+        const bool                    is_sharing,
+        const bool                    verbose) {
+        // Make => Optimize => Run
+        GraphDef backward_ops;
        GraphGradientMaker maker;
        for (auto& grad : input_grads) maker.AddExternalGrad(grad);
        for (auto& grad : ignore_grads) maker.AddIgnoreGrad(grad);
-    maker.Make(fp_ops, targets, bp_ops);
-    bool share_grads = PyObject_IsTrue(py_share_grads) ? true : false;
-    bool export_graph = PyObject_IsTrue(py_export_graph) ? true : false;
-    if (share_grads) maker.Share("/share/buffer/grads", bp_ops);
-    if (export_graph) {
-        Tensor* tensor = ws()->CreateTensor(
-            "/graph_def/dynamic/gradient_flow")->Reshape({ 1 });
-        string* data = tensor->mutable_data<string, CPUContext>();
-        data[0] = bp_ops.SerializeAsString();
-        tensor = ws()->CreateTensor(
-            "/graph_def/dynamic/forward_flow")->Reshape({ 1 });
-        data = tensor->mutable_data<string, CPUContext>();
-        data[0] = fp_ops.SerializeAsString();
+        maker.Make(forward_ops, targets, backward_ops);
+        if (is_sharing) maker.Share(backward_ops);
+        pybind11::gil_scoped_release g;
+        for (auto& op : backward_ops.op()) {
+            if (verbose) std::cout << op.DebugString() << std::endl;
+            if (op.has_uid()) ws()->RunOperator(op);
+            else ws()->RunOperatorOnce(op);
        }
-    for (auto& op : bp_ops.op()) ws()->RunOperator(op);
-    Py_RETURN_TRUE;
+    });
 }

 }  // namespace python

--- a/Dragon/modules/python/py_config.h
+++ b/Dragon/modules/python/py_config.h
@@ -19,15 +19,10 @@ namespace dragon {

 namespace python {

-inline PyObject* SetLogLevelCC(PyObject* self, PyObject* args) {
-    char* cname;
-    if (!PyArg_ParseTuple(args, "s", &cname)) {
-        PyErr_SetString(PyExc_ValueError,
-            "Excepted the logging level.");
-        return nullptr;
-    }
-    SetLogDestination(StrToLogSeverity(string(cname)));
-    Py_RETURN_TRUE;
+void AddConfigMethods(pybind11::module& m) {
+    m.def("SetLoggingLevel", [](const string& level) {
+        SetLogDestination(StrToLogSeverity(level));
+    });
 }

 }  // namespace python

--- a/Dragon/modules/python/py_cuda.h
+++ b/Dragon/modules/python/py_cuda.h
@@ -19,15 +19,34 @@ namespace python {

 #include "py_dragon.h"

-inline PyObject* IsCUDADriverSufficientCC(PyObject* self, PyObject* args) {
+void AddCUDAMethods(pybind11::module& m) {
+    m.def("IsCUDADriverSufficient", []() {
 #ifdef WITH_CUDA
        int count;
        cudaError_t err = cudaGetDeviceCount(&count);
-    if (err == cudaErrorInsufficientDriver) return PyBool_FromLong(0);
-    return PyBool_FromLong(1);
+        if (err == cudaErrorInsufficientDriver) false;
+        return true;
 #else
-    return PyBool_FromLong(0);
+        return false;
 #endif
+    });
+
+    m.def("cudaGetDevice", []() {
+        return CUDAContext::active_device_id();
+    });
+
+    m.def("cudaStreamSynchronize", [](
+        int device_id, int stream_id) {
+#ifdef WITH_CUDA
+        if (device_id < 0) device_id =
+            CUDAContext::active_device_id();
+        cudaStreamSynchronize(CUDAContext::cuda_object()
+            ->GetStream(device_id, stream_id));
+        cudaError_t error = cudaGetLastError();
+        CHECK_EQ(error, cudaSuccess)
+            << "\nCUDA Error: " << cudaGetErrorString(error);
+#endif
+    });
 }

 }  // namespace python

--- a/Dragon/modules/python/py_dragon.h
+++ b/Dragon/modules/python/py_dragon.h
@@ -13,8 +13,9 @@
 #ifndef DRAGON_PYTHON_PY_DRAGON_H_
 #define DRAGON_PYTHON_PY_DRAGON_H_

+#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
+
 #include "py_types.h"
-#include "py_macros.h"
 #include "core/common.h"
 #include "core/registry.h"
 #include "core/context.h"
@@ -25,6 +26,9 @@
 #include "core/workspace.h"
 #include "utils/caffemodel.h"

+#include <pybind11/stl.h>
+#include <pybind11/pybind11.h>
+
 namespace dragon {

 namespace python {
@@ -32,19 +36,20 @@ namespace python {
 class TensorFetcherBase {
 public:
    virtual ~TensorFetcherBase() {}
-    virtual PyObject* Fetch(const Tensor& tensor) = 0;
+    virtual pybind11::object Fetch(const Tensor& tensor) = 0;
 };

 class TensorFeederBase {
 public:
    virtual ~TensorFeederBase() {}
-    virtual PyObject* Feed(
+    virtual void Feed(
        const DeviceOption&             option,
        PyArrayObject*                  array,
        Tensor*                         tensor) = 0;
 };

 DECLARE_TYPED_REGISTRY(TensorFetcherRegistry, TypeId, TensorFetcherBase);
+
 #define REGISTER_TENSOR_FETCHER(type, ...) \
    REGISTER_TYPED_CLASS(TensorFetcherRegistry, type, __VA_ARGS__)

@@ -53,62 +58,58 @@ inline TensorFetcherBase* CreateFetcher(TypeId type) {
 }

 DECLARE_TYPED_REGISTRY(TensorFeederRegistry, TypeId, TensorFeederBase);
+
 #define REGISTER_TENSOR_FEEDER(type, ...) \
    REGISTER_TYPED_CLASS(TensorFeederRegistry, type, __VA_ARGS__)

 class NumpyFetcher : public TensorFetcherBase {
 public:
-    PyObject* Fetch(const Tensor& tensor) override {
+    pybind11::object Fetch(const Tensor& tensor) override {
        CHECK_GT(tensor.count(), 0);
        vector<npy_intp> npy_dims;
        for (const auto dim : tensor.dims()) npy_dims.push_back(dim);
        int npy_type = TypeMetaToNPY(tensor.meta());
        if (npy_type == -1) {
-            string s = "The data type of Tensor(" +
+            LOG(FATAL) <<  "The data type of Tensor(" +
                tensor.name() + ") is unknown. Have you solved it ?";
-            PyErr_SetString(PyExc_RuntimeError, s.c_str());
-            return nullptr;
        }
+        CHECK(tensor.memory()) << "\nIllegal memory access.";
        // Create a empty array with the same shape
        PyObject* array = PyArray_SimpleNew(
            tensor.ndim(), npy_dims.data(), npy_type);
        // Copy the tensor data to the numpy array
        if (tensor.memory_state() == MixedMemory::STATE_AT_CUDA) {
-            CUDAContext::Memcpy<CPUContext, CUDAContext>(tensor.nbytes(),
+            CUDAContext::MemcpyEx<CPUContext, CUDAContext>(tensor.nbytes(),
                     PyArray_DATA(reinterpret_cast<PyArrayObject*>(array)),
-                                         tensor.raw_data<CUDAContext>());
+                                            tensor.raw_data<CUDAContext>(),
+                                             tensor.memory()->device_id());
        } else {
            CPUContext::Memcpy<CPUContext, CPUContext>(tensor.nbytes(),
                 PyArray_DATA(reinterpret_cast<PyArrayObject*>(array)),
                                        tensor.raw_data<CPUContext>());
        }
-        return array;
+        return pybind11::reinterpret_steal<pybind11::object>(array);
    }
 };

 class StringFetcher : public TensorFetcherBase {
 public:
-    PyObject* Fetch(const Tensor& tensor) override {
-        CHECK_GT(tensor.count(), 0);
-        return String_AsPyBytes(*tensor.data<string, CPUContext>());
+    pybind11::object Fetch(const Tensor& tensor) override {
+        CHECK_EQ(tensor.count(), 1);
+        return pybind11::bytes(tensor.data<string, CPUContext>()[0]);
    }
 };

 class NumpyFeeder : public TensorFeederBase {
 public:
-    PyObject* Feed(
+    void Feed(
        const DeviceOption&         option,
        PyArrayObject*              original_array,
        Tensor*                     tensor) override {
        PyArrayObject* array = PyArray_GETCONTIGUOUS(original_array);
        const TypeMeta& meta = TypeNPYToMeta(PyArray_TYPE(array));
-        if (meta.id() == 0) {
-            PyErr_SetString(PyExc_TypeError, "Unsupported data type.");
-            return nullptr;
-        }
-        if (meta.id() != tensor->meta().id() && tensor->meta().id() != 0)
-            LOG(WARNING) << "Feed Tensor(" << tensor->name() << ")"
-                         << " with different data type from original one.";
+        if (meta.id() == 0) LOG(FATAL) << "Unsupported data type.";
+        tensor->SetMeta(meta);
        int ndim = PyArray_NDIM(array);
        npy_intp* npy_dims = PyArray_DIMS(array);
        vector<int64_t> dims;
@@ -116,21 +117,22 @@ class NumpyFeeder : public TensorFeederBase {
        tensor->Reshape(dims);
        if (option.device_type() == PROTO_CUDA) {
 #ifdef WITH_CUDA
-            CUDAContext context(option);
-            context.SwitchToDevice();
-            auto* data = tensor->raw_mutable_data<CUDAContext>(meta);
-            context.Memcpy<CUDAContext, CPUContext>(tensor->nbytes(),
-                      data, static_cast<void*>(PyArray_DATA(array)));
+            CUDAContext::MemcpyEx<CUDAContext, CPUContext>(
+                                          tensor->nbytes(),
+                   tensor->raw_mutable_data<CUDAContext>(),
+                   static_cast<void*>(PyArray_DATA(array)),
+                                       option.device_id());
 #else
            LOG(FATAL) << "CUDA was not compiled.";
 #endif
        } else {
-            auto* data = tensor->raw_mutable_data<CPUContext>(meta);
-            CPUContext::Memcpy<CPUContext, CPUContext>(tensor->nbytes(),
-                         data, static_cast<void*>(PyArray_DATA(array)));
+            auto* data = tensor->raw_mutable_data<CPUContext>();
+            CPUContext::Memcpy<CPUContext, CPUContext>(
+                                      tensor->nbytes(),
+                tensor->raw_mutable_data<CPUContext>(),
+              static_cast<void*>(PyArray_DATA(array)));
        }
        Py_XDECREF(array);
-        Py_RETURN_TRUE;
    }
 };


--- a/Dragon/modules/python/py_graph.h
+++ b/Dragon/modules/python/py_graph.h
@@ -19,66 +19,41 @@ namespace dragon {

 namespace python {

-inline PyObject* CreateGraphCC(PyObject* self, PyObject* args) {
-    PyObject* graph_str, *verbose;
-    if (!PyArg_ParseTuple(args, "S|O", &graph_str, &verbose)) {
-        PyErr_SetString(PyExc_ValueError,
-            "Excepted a serialized string of GraphDef.");
-        return nullptr;
-    } 
-    if (verbose == nullptr) verbose = Py_False;
-
+void AddGraphMethods(pybind11::module& m) {
+    /*! \brief Create a graph from the serialized def */
+    m.def("CreateGraph", [](
+        const string&           serialized,
+        const bool              verbose) {
        GraphDef graph_def;
-    if (!graph_def.ParseFromString(PyBytes_AsStringEx(graph_str))) {
-        PyErr_SetString(PyExc_RuntimeError,
-            "Failed to parse the GraphDef.");
-        return nullptr;
-    } 
-
+        if (!graph_def.ParseFromString(serialized))
+            LOG(FATAL) << "Failed to parse the GraphDef.";
        auto* graph = ws()->CreateGraph(graph_def);
-
-    if (!graph) {
-        PyErr_SetString(PyExc_RuntimeError,
-            "Failed to create the Graph.");
-        return nullptr;
-    } else {
+        if (verbose) {
            // It is not a good design to print the debug string
-        if (PyObject_IsTrue(verbose) ? true : false) {
            auto* graph_tensor = ws()->CreateTensor(
                "/graph_def/optimized/" + graph->name());
            if (graph_tensor->count() > 0) {
                auto* data = graph_tensor->mutable_data<string, CPUContext>();
                std::cout << data[0] << std::endl;
            }
-        }
+
        }
        // Return the graph name may be different from the def
        // We will make a unique dummy name on creating the graph
-    return String_AsPyUnicode(graph->name());
-}
-
-inline PyObject* RunGraphCC(PyObject* self, PyObject* args) {
-    char* cname, *include, *exclude;
-    if (!PyArg_ParseTuple(args, "sss",
-            &cname, &include, &exclude)) {
-        PyErr_SetString(PyExc_ValueError,
-            "Excepted the graph name, include and exclude rules.");
-        return nullptr;
-    }
-    ws()->RunGraph(
-        string(cname),
-        string(include),
-        string(exclude)
-    );
-    Py_RETURN_TRUE;
-}
-
-inline PyObject* GraphsCC(PyObject* self, PyObject* args) {
-    vector<string> graphs = ws()->GetGraphs();
-    PyObject* list = PyList_New(graphs.size());
-    for (int i = 0; i < graphs.size(); i++)
-        CHECK_EQ(PyList_SetItem(list, i, String_AsPyUnicode(graphs[i])), 0);
-    return list;
+        return graph->name();
+    });
+
+    /*! \brief Run an existing graph */
+    m.def("RunGraph", [](
+        const string&           name,
+        const string&           include,
+        const string&           exclude) {
+        pybind11::gil_scoped_release g;
+        ws()->RunGraph(name, include, exclude);
+    });
+
+    /*! \brief List all of the existing graphs */
+    m.def("Graphs", []() { ws()->GetGraphs(); });
 }

 }  // namespace python

--- a/Dragon/modules/python/py_io.h
+++ b/Dragon/modules/python/py_io.h
@@ -19,48 +19,42 @@ namespace dragon {

 namespace python {

-inline PyObject* SnapshotCC(PyObject* self, PyObject* args) {
-    char* path; int format;
-    PyObject* names; vector<Tensor*> tensors;
-    if (!PyArg_ParseTuple(args, "sOi", &path, &names, &format)) {
-        PyErr_SetString(PyExc_ValueError,
-            "Excepted the model path, tensors, and data format.");
-        return nullptr;
-    }
+void AddIOMethods(pybind11::module& m) {
+    m.def("Snapshot", [](
+        const string&       filename,
+        vector<string>&     names,
+        const int           format) {
+        vector<Tensor*> tensors;
        switch (format) {
            case 0:  // Pickle
-            PyErr_SetString(PyExc_NotImplementedError,
-                "Format depends on Pickle. Can't be used in C++.");
+                LOG(FATAL) << "Format depends on Pickle. "
+                              "Can't be used in C++.";
                break;
            case 1:  // CaffeModel
-            for (int i = 0; i < PyList_Size(names); i++)
-                tensors.push_back(ws()->GetTensor(
-                    PyString_AsString(PyList_GetItem(names, i))));
-            SavaCaffeModel(path, tensors);
+                for (const auto& e : names)
+                    tensors.emplace_back(ws()->GetTensor(e));
+                SavaCaffeModel(filename, tensors);
                break;
-        default: LOG(FATAL) << "Unknwon format, code: " << format;
+            default:
+                LOG(FATAL) << "Unknwon format, code: " << format;
        }
-   Py_RETURN_TRUE;
-}
+    });

-inline PyObject* RestoreCC(PyObject* self, PyObject* args) {
-    char* path; int format;
-    if (!PyArg_ParseTuple(args, "si", &path, &format)) {
-        PyErr_SetString(PyExc_ValueError,
-            "Excepted the model path and data format.");
-        return nullptr;
-    }
+    m.def("Restore", [](
+        const string&       filename,
+        const int           format) {
        switch (format) {
            case 0:  // Pickle
-            PyErr_SetString(PyExc_NotImplementedError,
-                "Format depends on Pickle. Can't be used in C++.");
+                LOG(FATAL) << "Format depends on Pickle. "
+                    "Can't be used in C++.";
                break;
            case 1:  // CaffeModel
-            LoadCaffeModel(path, ws());
+                LoadCaffeModel(filename, ws());
                break;
-        default: LOG(FATAL) << "Unknwon format, code: " << format;
+            default: 
+                LOG(FATAL) << "Unknwon format, code: " << format;
        }
-    Py_RETURN_TRUE;
+    });
 }

 }  // namespace python

--- a/Dragon/modules/python/py_macros.h
+++ b/Dragon/modules/python/py_macros.h
-/*!
- * Copyright (c) 2017-present, SeetaTech, Co.,Ltd.
- *
- * Licensed under the BSD 2-Clause License.
- * You should have received a copy of the BSD 2-Clause License
- * along with the software. If not, See,
- *
- *      <https://opensource.org/licenses/BSD-2-Clause>
- *
- * ------------------------------------------------------------
- */
-
-#ifndef DRAGON_PYTHON_PY_MACROS_H_
-#define DRAGON_PYTHON_PY_MACROS_H_
-
-#include <string>
-#include <sstream>
-
-#include <Python.h>
-#include <numpy/arrayobject.h>
-
-namespace dragon {
-
-namespace python {
-
-#ifdef WITH_PYTHON3
-#define PyInt_FromLong PyLong_FromLong
-#define _PyInt_AsInt _PyLong_AsInt
-#define PyString_AsString PyUnicode_AsUTF8
-#endif
-
-/*!
- * ------------------------------------------------------------
- *
- *                  <Having Fun with PyString>
- *
- * For Python3, Get/Return PyUnicode for regular string.
- * For Python3, Get/Return PyBytes for google-protobuf.
- * For Python2, Get/Return PyBytes only.
- *
- * ------------------------------------------------------------
- */
-
-#define PyBytes_AsStringEx(pystring) \
-    std::string(PyBytes_AsString(pystring), PyBytes_Size(pystring))
-
-// Return string to Python
-inline PyObject* String_AsPyBytes(const std::string& cstring) {
-    return PyBytes_FromStringAndSize(cstring.c_str(), cstring.size());
-}
-
-inline PyObject* String_AsPyUnicode(const std::string& cstring) {
-#ifdef WITH_PYTHON3
-    return PyUnicode_FromStringAndSize(cstring.c_str(), cstring.size());
-#else
-    return PyBytes_FromStringAndSize(cstring.c_str(), cstring.size());
-#endif
-}
-
-// Macors
-#define PyList_AsVecString(plist, vs, defaults) \
-    for (int i = 0; i < PyList_Size(plist); i++) { \
-        PyObject* e = PyList_GetItem(plist, i); \
-        if (e == Py_None) vs.emplace_back(defaults); \
-        else vs.push_back(PyString_AsString(PyObject_Str(e))); \
-    }
-
-#define SetPyList(plist, ix, e) \
-    PyList_SetItem(plist, ix, e)
-
-#define SetPyDictS2S(object, key, value) \
-    PyDict_SetItemString(object, key, Py_BuildValue("s", value))
-
-#define SetPyDictS2I(object, key, value) \
-    PyDict_SetItemString(object, key, Py_BuildValue("i", value))
-
-// Misc
-template <typename T>
-inline void MakeStringInternal(std::stringstream& ss, const T& t) { ss << t; }
-
-template <typename T,typename ... Args>
-inline void MakeStringInternal(std::stringstream& ss, const T& t, const Args& ... args) {
-    MakeStringInternal(ss, t);
-    MakeStringInternal(ss, args...);
-}
-
-template <typename ... Args>
-std::string MakeString(const Args&... args) {
-    std::stringstream ss;
-    MakeStringInternal(ss, args...);
-    return std::string(ss.str());
-}
-
-inline void PrErr_SetString(PyObject* type, const std::string& str) { 
-    PyErr_SetString(type, str.c_str()); 
-}
-
-}  // namespace python
-
-}  // namespace dragon
-
-#endif  // DRAGON_PYTHON_PY_MACROS_H_
\ No newline at end of file
--- a/Dragon/modules/python/py_module.cc
+++ b/Dragon/modules/python/py_module.cc
--- a/Dragon/modules/python/py_mpi.h
+++ b/Dragon/modules/python/py_mpi.h
@@ -15,88 +15,94 @@

 #include "py_dragon.h"

+#ifdef WITH_MPI
+#include <mpi.h>
+#endif
+
 namespace dragon {

 namespace python {

+void AddMPIMethods(pybind11::module& m) {
+    m.def("MPIInit", []() {
 #ifdef WITH_MPI
-
-#include <mpi.h>
-
-inline PyObject* MPIInitCC(PyObject* self, PyObject* args) {
+        // Enabling the multi-threads for Python is meaningless
+        // While we will still hold this interface here
        int thread_type;
+        char* mt_is_required = nullptr;
+        mt_is_required = getenv("DRAGON_MPI_THREADS_ENABLE");
+        if (mt_is_required != nullptr && string(mt_is_required) == "1") {
            MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, &thread_type);
            CHECK_EQ(thread_type, MPI_THREAD_MULTIPLE)
                << "\nRequire to enable <MPI_THREAD_MULTIPLE> support.";
-    Py_RETURN_TRUE;
-}
-
-inline PyObject* MPIFinalizeCC(PyObject* self, PyObject* args) {
-    MPI_Finalize();
-    Py_RETURN_TRUE;
-}
+        } else {
+            MPI_Init_thread(NULL, NULL, MPI_THREAD_SINGLE, &thread_type);
+        }
+#else
+        LOG(FATAL) << "MPI was not compiled.";
+#endif
+    });

-inline PyObject* MPIRankCC(PyObject* self, PyObject* args) {
+    m.def("MPIRank", []() {
+#ifdef WITH_MPI
        int world_rank;
        MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
-    return PyInt_FromLong(world_rank);
-}
+        return world_rank;
+#else
+        LOG(FATAL) << "MPI was not compiled.";
+#endif
+    });

-inline PyObject* MPISizeCC(PyObject* self, PyObject* args) {
+    m.def("MPISize", []() {
+#ifdef WITH_MPI
        int world_size;
        MPI_Comm_size(MPI_COMM_WORLD, &world_size);
-    return PyInt_FromLong(world_size);
-}
-
-inline PyObject* MPICreateGroupCC(PyObject* self, PyObject* args) {
-    PyObject *incl, *excl, *ret;
-    int local_root, world_size;
-    if (!PyArg_ParseTuple(args, "iOO", &local_root, &incl, &excl)) {
-        PyErr_SetString(PyExc_ValueError,
-            "Excepted the local root, include and exclued list.");
-        return nullptr;
-    }
+        return world_size;
+#else
+        LOG(FATAL) << "MPI was not compiled.";
+#endif
+    });
+
+    m.def("MPICreateGroup", [](
+        const int local_root,
+        const vector<int>& incl,
+        const vector<int>& excl) {
+#ifdef WITH_MPI
+        int world_size;
        MPI_Group world_group, local_group;
        MPI_Comm local_comm;
        int err_code;
        MPI_Comm_group(MPI_COMM_WORLD, &world_group);
        MPI_Comm_size(MPI_COMM_WORLD, &world_size);
+
        set<int> all_ranks;
        for (int i = 0; i < world_size; i++) all_ranks.insert(i);
        local_group = world_group;

-    // Check inclue ranks
-    int size = (int)PyList_Size(incl);
-    if (size > 0) {
+        // Check include ranks
+        if (!incl.empty()) {
            all_ranks.clear();
-        unique_ptr<int> incl_ranks(new int[size]);
-        int* ranks = incl_ranks.get();
-        for (int i = 0; i < size; i++) {
-            ranks[i] = _PyInt_AsInt(PyList_GetItem(incl, i));
-            all_ranks.insert(ranks[i]);
-        }
-        err_code = MPI_Group_incl(world_group, size, ranks, &local_group);
-        CHECK(err_code == MPI_SUCCESS) << "\nFail to create mpi group.";
+            for (auto e : incl) all_ranks.insert(e);
+            err_code = MPI_Group_incl(world_group,
+                (int)incl.size(), incl.data(), &local_group);
+            CHECK(err_code == MPI_SUCCESS)
+                << "\nFail to create MPI Group.";
        }

        // Check exclude ranks
-    size = (int)PyList_Size(excl);
-    if (size > 0) {
+        if (!excl.empty()) {
            all_ranks.clear(); Set<int> tmp;
-        unique_ptr<int> excl_ranks(new int[size]);
-        int* ranks = excl_ranks.get();
-        for (int i = 0; i < size; i++) {
-            ranks[i] = _PyInt_AsInt(PyList_GetItem(excl, i));
-            tmp.insert(ranks[i]);
-        }
+            for (auto e : excl) tmp.insert(e);
            for (int i = 0; i < world_size; i++)
                if (!tmp.count(i)) all_ranks.insert(i);
-        err_code = MPI_Group_excl(world_group, size, ranks, &local_group);
-        CHECK(err_code == MPI_SUCCESS) << "Fail to create mpi group.";
+            err_code = MPI_Group_excl(world_group,
+                (int)excl.size(), excl.data(), &local_group);
+            CHECK(err_code == MPI_SUCCESS)
+                << "\nFail to create MPI Group.";
        }

        err_code = MPI_Comm_create(MPI_COMM_WORLD, local_group, &local_comm);
-    CHECK(err_code == MPI_SUCCESS) << "Fail to create mpi group.";
+        CHECK(err_code == MPI_SUCCESS) << "\nFail to create MPI Group.";

        if (local_comm != MPI_COMM_NULL) {
            int world_rank, local_size;
@@ -115,25 +121,20 @@ inline PyObject* MPICreateGroupCC(PyObject* self, PyObject* args) {
                LOG(INFO) << log_info;
            }
        }
-    ret = PyList_New(2);
-    PyList_SetItem(ret, 0, PyInt_FromLong((long)local_comm));
-    PyList_SetItem(ret, 1, PyInt_FromLong((long)local_group));
-    return ret;
-}
-
-#else  // WITH_MPI
-
-#define MPI_NOT_IMPLEMENTED \
-    LOG(FATAL) << "MPI was not compiled."; \
-    Py_RETURN_TRUE
-
-inline PyObject* MPIInitCC(PyObject* self, PyObject* args) { MPI_NOT_IMPLEMENTED; }
-inline PyObject* MPIFinalizeCC(PyObject* self, PyObject* args) { MPI_NOT_IMPLEMENTED; }
-inline PyObject* MPIRankCC(PyObject* self, PyObject* args) { MPI_NOT_IMPLEMENTED; }
-inline PyObject* MPISizeCC(PyObject* self, PyObject* args) { MPI_NOT_IMPLEMENTED; }
-inline PyObject* MPICreateGroupCC(PyObject* self, PyObject* args) { MPI_NOT_IMPLEMENTED; }
+        return vector<long>({ (long)local_comm, (long)local_group });
+#else
+        LOG(FATAL) << "MPI was not compiled.";
+#endif
+    });

-#endif  // WITH_MPI
+    m.def("MPIFinalize", []() {
+#ifdef WITH_MPI
+        MPI_Finalize();
+#else
+        LOG(FATAL) << "MPI was not compiled.";
+#endif
+    });
+}

 }  // namespace python


--- a/Dragon/modules/python/py_onnx.h
+++ b/Dragon/modules/python/py_onnx.h
 /*!
-* Copyright (c) 2017-present, SeetaTech, Co.,Ltd.
-*
-* Licensed under the BSD 2-Clause License.
-* You should have received a copy of the BSD 2-Clause License
-* along with the Xpensource.org/licenses/BSD-2-Clause>
-*
-* ------------------------------------------------------------
-*/
+ * Copyright (c) 2017-present, SeetaTech, Co.,Ltd.
+ *
+ * Licensed under the BSD 2-Clause License.
+ * You should have received a copy of the BSD 2-Clause License
+ * along with the software. If not, See,
+ *
+ *      <https://opensource.org/licenses/BSD-2-Clause>
+ *
+ * ------------------------------------------------------------
+ */

 #ifndef DRAGON_PYTHON_PY_ONNX_H_
 #define DRAGON_PYTHON_PY_ONNX_H_
@@ -19,13 +21,9 @@ namespace dragon {

 namespace python {

-inline PyObject* ImportONNXModelCC(PyObject* self, PyObject* args) {
-    char* model_path;
-    if (!PyArg_ParseTuple(args, "s", &model_path)) {
-        PyErr_SetString(PyExc_ValueError,
-            "Excepted the model path.");
-        return nullptr;
-    }
+void AddONNXMethods(pybind11::module& m) {
+    m.def("ImportONNXModel", [](
+        const string&           model_path) {
        GraphDef init_graph, pred_graph;
        onnx::ONNXBackend onnx_backend;
        onnx_backend.Prepare(model_path, &init_graph, &pred_graph);
@@ -33,7 +31,8 @@ inline PyObject* ImportONNXModelCC(PyObject* self, PyObject* args) {
        // We should apply the initializer immediately
        ws()->CreateGraph(init_graph);
        ws()->RunGraph(init_graph.name(), "", "");
-    return String_AsPyBytes(pred_graph.SerializeAsString());
+        return pybind11::bytes(pred_graph.SerializeAsString());
+    });
 }

 }  // namespace python

--- a/Dragon/modules/python/py_operator.h
+++ b/Dragon/modules/python/py_operator.h
@@ -19,91 +19,38 @@ namespace dragon {

 namespace python {

-inline PyObject* RegisteredOperatorsCC(PyObject* self, PyObject* args) {
-    set<string> all_keys;
-    for (const auto& name : CPUOperatorRegistry()->keys()) all_keys.insert(name);
-    PyObject* list = PyList_New(all_keys.size());
-    int idx = 0;
-    for (const string& name : all_keys)
-        CHECK_EQ(PyList_SetItem(list, idx++, String_AsPyUnicode(name)), 0);
-    return list;
-}
-
-inline PyObject* NoGradientOperatorsCC(PyObject* self, PyObject* args) {
-    set<string> all_keys;
-    for (const auto& name : NoGradientRegistry()->keys()) all_keys.insert(name);
-    PyObject* list = PyList_New(all_keys.size());
-    int idx = 0;
-    for (const string& name : all_keys)
-        CHECK_EQ(PyList_SetItem(list, idx++, String_AsPyUnicode(name)), 0);
-    return list;
-}
-
-inline PyObject* RunOperatorCC(PyObject* self, PyObject* args) {
-    PyObject* op_str;
-    if (!PyArg_ParseTuple(args, "S", &op_str)) {
-        PyErr_SetString(PyExc_ValueError,
-            "Excepted a serialized string of OperatorDef.");
-        return nullptr;
-    }
-    OperatorDef op_def;
-    if (!op_def.ParseFromString(PyBytes_AsStringEx(op_str))) {
-        PyErr_SetString(PyExc_RuntimeError,
-            "Failed to parse the OperatorDef.");
-        return nullptr;
-    }
-    ws()->RunOperator(op_def);
-    Py_RETURN_TRUE;
-}
-
-inline PyObject* RunOperatorsCC(PyObject* self, PyObject* args) {
-    PyObject* py_ops;
-    if (!PyArg_ParseTuple(args, "O", &py_ops)) {
-        PyErr_SetString(PyExc_ValueError,
-            "Excepted a list of serialized string of OperatorDef.");
-        return nullptr;
+void AddOperatorMethods(pybind11::module& m) {
+    /*! \brief Return all the registered operators */
+    m.def("RegisteredOperators", []() { return CPUOperatorRegistry()->keys(); });
+
+    /*! \brief Return all the operators without gradients */
+    m.def("NoGradientOperators", []() { return NoGradientRegistry()->keys(); });
+
+    /*! \brief Run a operator from the def reference */
+    m.def("RunOperator", [](
+        OperatorDef*        def,
+        const bool          verbose) {
+        pybind11::gil_scoped_release g;
+        if (verbose) {
+            // It is not a good design to print the debug string
+            std::cout << def->DebugString() << std::endl;
        }
-    OperatorDef op_def;
-    for (int i = 0; i < PyList_Size(py_ops); i++) {
-        PyObject* op_str = PyList_GetItem(py_ops, i);
-        CHECK(op_def.ParseFromString(PyBytes_AsStringEx(op_str)));
-        ws()->RunOperator(op_def);
-    }
-    Py_RETURN_TRUE;
-}
-
-inline PyObject* CreatePersistentOpCC(PyObject* self, PyObject* args) {
-    PyObject* op_str;
-    if (!PyArg_ParseTuple(args, "S", &op_str)) {
-        PyErr_SetString(PyExc_ValueError,
-            "Excepted a serialized string of OperatorDef.");
-        return nullptr;
-    }
-    OperatorDef op_def;
-    if (!op_def.ParseFromString(PyBytes_AsStringEx(op_str))) {
-        PyErr_SetString(PyExc_RuntimeError,
-            "Failed to parse the OperatorDef.");
-        return nullptr;
-    }
-    ws()->CreatePersistentOp(op_def);
-    Py_RETURN_TRUE;
-}
-
-inline PyObject* RunPersistentOpCC(PyObject* self, PyObject* args) {
-    char* key, *anchor;
-    PyObject* py_inputs, *py_outputs;
-    if (!PyArg_ParseTuple(args, "ssOO",
-            &key, &anchor, &py_inputs, &py_outputs)) {
-        PyErr_SetString(PyExc_ValueError, 
-            "Excepted a persistent key, anchor, "
-            "list of inputs and outputs.");
-        return nullptr;
+        ws()->RunOperator(*def);
+    });
+
+    /*! \brief Run a operator from the serialized def */
+    m.def("RunOperator", [](
+        const string&       serialized,
+        const bool          verbose) {
+        OperatorDef def;
+        CHECK(def.ParseFromString(serialized));
+        pybind11::gil_scoped_release g;
+        if (verbose) {
+            // It is not a good design to print the debug string
+            std::cout << def.DebugString() << std::endl;
        }
-    vector<string> inputs, outputs;
-    PyList_AsVecString(py_inputs, inputs, "");
-    PyList_AsVecString(py_outputs, outputs, "");
-    ws()->RunPersistentOp(key, anchor, inputs, outputs);
-    Py_RETURN_TRUE;
+        ws()->RunOperatorOnce(def);
+    });
 }

 }  // namespace python

--- a/Dragon/modules/python/py_proto.h
+++ b/Dragon/modules/python/py_proto.h
+/*!
+ * Copyright (c) 2017-present, SeetaTech, Co.,Ltd.
+ *
+ * Licensed under the BSD 2-Clause License.
+ * You should have received a copy of the BSD 2-Clause License
+ * along with the software. If not, See,
+ *
+ *      <https://opensource.org/licenses/BSD-2-Clause>
+ *
+ * ------------------------------------------------------------
+ */
+
+#ifndef DRAGON_PYTHON_PY_PROTO_H_
+#define DRAGON_PYTHON_PY_PROTO_H_
+
+#include "py_dragon.h"
+
+namespace dragon {
+
+namespace python {
+
+void AddProtoMethods(pybind11::module& m) {
+    /*! \brief Extented C-Style OperatorDef */
+    pybind11::class_<OperatorDef>(m, "OperatorDef")
+        .def(pybind11::init())
+        .def("CopyFrom", [](
+            OperatorDef*            self,
+            OperatorDef*            other) {
+            self->CopyFrom(*other);
+      }).def("ParseFrom", [](
+            OperatorDef*            self,
+            const string&           serialized) {
+            self->ParseFromString(serialized);
+      }).def("SerializeAs", [](
+            OperatorDef*            self) {
+            return pybind11::bytes(self->SerializeAsString());
+      }).def("add_input", [](
+            OperatorDef*            self,
+            const string&           input) {
+          self->add_input(input);
+      }).def("add_output", [](
+            OperatorDef*            self,
+            const string&           output) {
+          self->add_output(output);
+      }).def_property("name",
+          [](OperatorDef* self) {
+              return self->name(); },
+          [](OperatorDef* self, const string& name) {
+              self->set_name(name);
+      }).def_property("type",
+          [](OperatorDef* self) {
+            return self->type(); },
+          [](OperatorDef* self, const string& type) {
+              self->set_type(type);
+      }).def_property("input",
+          [](OperatorDef* self) -> vector<string> {
+              return { self->input().begin(), self->input().end() }; },
+          [](OperatorDef* self, const vector<string>& input) {
+              *(self->mutable_input()) = { input.begin(), input.end() };
+      }).def_property("output",
+          [](OperatorDef* self) -> vector<string> {
+             return{ self->output().begin(), self->output().end() }; },
+          [](OperatorDef* self, const vector<string>& output) {
+              *(self->mutable_output()) = { output.begin(), output.end() };
+      });
+
+    m.def("TestOperatorDefs", [](vector<OperatorDef*> defs) {
+        for (auto* def : defs) {
+            std::cout << def->DebugString() << std::endl;
+        }
+    });
+}
+
+}  // namespace python
+
+}  // namespace dragon
+
+#endif DRAGON_PYTHON_PY_PROTO_H_
\ No newline at end of file
--- a/Dragon/modules/python/py_tensor.h
+++ b/Dragon/modules/python/py_tensor.h
--- a/Dragon/modules/python/py_types.h
+++ b/Dragon/modules/python/py_types.h
@@ -13,6 +13,7 @@
 #ifndef DRAGON_PYTHON_PY_TYPES_H_
 #define DRAGON_PYTHON_PY_TYPES_H_

+#include <string>
 #include <numpy/arrayobject.h>

 #include "core/types.h"
@@ -31,6 +32,7 @@ inline const int TypeMetaToNPY(const TypeMeta& meta) {
        { TypeMeta::Id<float16>(), NPY_FLOAT16 },
        { TypeMeta::Id<float>(), NPY_FLOAT32 },
        { TypeMeta::Id<double>(), NPY_FLOAT64 },
+        { TypeMeta::Id<std::string>(), NPY_OBJECT },
    };
    return m2npy_type_map.count(meta.id()) ? m2npy_type_map[meta.id()] : -1;
 }
@@ -45,6 +47,8 @@ inline const TypeMeta& TypeNPYToMeta(int npy_type) {
        { NPY_FLOAT16, TypeMeta::Make<float16>() },
        { NPY_FLOAT32, TypeMeta::Make<float>() },
        { NPY_FLOAT64, TypeMeta::Make<double>() },
+        { NPY_UNICODE, TypeMeta::Make<std::string>() },
+        { NPY_STRING, TypeMeta::Make<std::string>() },
    };
    static TypeMeta unknown_type;
    return npy2m_type_map.count(npy_type) ?

--- a/Dragon/python/dragon/__init__.py
+++ b/Dragon/python/dragon/__init__.py
@@ -24,6 +24,7 @@ from dragon.core.tensor import Tensor
 import dragon.core.workspace as workspace
 import dragon.core.tensor_utils as tensor_utils
 import dragon.core.mpi as mpi
+import dragon.core.cuda as cuda
 import dragon.memonger as memonger

 # Operators

--- a/Dragon/python/dragon/config.py
+++ b/Dragon/python/dragon/config.py
@@ -23,7 +23,7 @@ option = {}
 # The current device, 'CPU', 'CUDA' or 'CNML'
 option['device'] = 'CPU'

-# The device id
+# The device index
 option['device_id'] = 0

 # Whether to use cuDNN if possible
@@ -32,8 +32,8 @@ option['use_cudnn'] = False
 # The global random seed
 option['random_seed'] = 3

-# Disable the memonger if true
-option['debug_mode'] = False
+# Set the level of graph optimization
+option['graph_optimization_level'] = 3

 # Whether to share grads
 option['share_grads'] = True
@@ -76,29 +76,13 @@ def EnableCPU():
    option['device'] = 'CPU'


-def IsCUDADriverSufficient():
-    """Is CUDADriver sufficient?
-
-    Returns
-    -------
-    boolean
-        ``True`` if your device(s) support CUDA otherwise ``False``.
-
-    References
-    ----------
-    The wrapper of ``IsCUDADriverSufficientCC``.
-
-    """
-    return C.IsCUDADriverSufficientCC()
-
-
 def EnableCUDA(gpu_id=0, use_cudnn=True):
    """Enable NVIDIA's CUDA mode globally.

    Parameters
    ----------
    gpu_id : int
-        The id of GPU to use.
+        The index of GPU to use.
    use_cudnn : boolean
        Whether to use cuDNN if available.

@@ -119,7 +103,7 @@ def EnableCNML(mlu_id=0):
    Parameters
    ----------
    device_id : int
-        The id of MLU to use.
+        The index of MLU to use.

    Returns
    -------
@@ -161,12 +145,12 @@ def GetRandomSeed():


 def SetGPU(id):
-    """Set the global id GPU.
+    """Set the global index GPU.

    Parameters
    ----------
    id : int
-        The id of GPU to use.
+        The index of GPU to use.

    Returns
    -------
@@ -178,26 +162,26 @@ def SetGPU(id):


 def GetGPU():
-    """Get the global id of GPU.
+    """Get the global index of GPU.

    Returns
    -------
    int
-        The global id of GPU.
+        The global index of GPU.

    """
    return option['device_id']


-def SetDebugMode(enabled=True):
-    """Enable Debug mode globally.
+def SetGraphType(graph_type=''):
+    """Set the graph type.

-    It will disable all memory sharing optimizations.
+    If empty, the default DAG graph will be used.

    Parameters
    ----------
-    enabled : boolean
-        Whether to enable debug mode.
+    graph_type : str
+        The graph type.

    Returns
    -------
@@ -205,18 +189,28 @@ def SetDebugMode(enabled=True):

    """
    global option
-    option['debug_mode'] = enabled
+    option['graph_type'] = graph_type


-def SetGraphType(graph_type=''):
-    """Set the graph type.
+def SetGraphOptimizationLevel(level=3):
+    """Set the default level of graph optimization.

-    If empty, the default DAG graph will be used.
+    We have predefined four levels:
+
+    -O0(level=0): Do nothing.
+
+    -O1(level=1): Prune the redundant nodes.
+
+    -O2(level=2): Add the inplace to outputs.
+    Note that the graph will no longer be a DAG.
+
+    -O3(level=3): Allocate the buffer for outputs.
+    This level is memory-efficient while debugging will be non-trivial.

    Parameters
    ----------
-    graph_type : str
-        The graph type.
+    level : {0, 1, 2, 3}, optional, default=3
+        The level, see the documentation for details.

    Returns
    -------
@@ -224,7 +218,7 @@ def SetGraphType(graph_type=''):

    """
    global option
-    option['graph_type'] = graph_type
+    option['graph_optimization_level'] = level


 def LogMetaGraph(enabled=True):
@@ -301,7 +295,7 @@ def SetLoggingLevel(level):
    The default level is *INFO*.

    """
-    C.SetLogLevelCC(level)
+    C.SetLoggingLevel(level)
    logging.set_verbosity({
        'DEBUG': logging.DEBUG,
        'INFO': logging.INFO,

--- a/Dragon/python/dragon/core/cuda.py
+++ b/Dragon/python/dragon/core/cuda.py
+# ------------------------------------------------------------
+# Copyright (c) 2017-present, SeetaTech, Co.,Ltd.
+#
+# Licensed under the BSD 2-Clause License.
+# You should have received a copy of the BSD 2-Clause License
+# along with the software. If not, See,
+#
+#      <https://opensource.org/licenses/BSD-2-Clause>
+#
+# ------------------------------------------------------------
+
+"""List some useful CUDA C++ API."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import dragon.import_c_api as C
+
+
+def IsCUDADriverSufficient():
+    """Is cuda driver sufficient?
+
+    Returns
+    -------
+    boolean
+        ``True`` if your device(s) support CUDA otherwise ``False``.
+
+    """
+    return C.IsCUDADriverSufficient()
+
+
+def GetDevice():
+    """Get the current active cuda device.
+
+    Returns
+    -------
+    int
+        The device index.
+
+    """
+    return C.cudaGetDevice()
+
+
+def SynchronizeStream(device_id=None, stream_id=0):
+    """Synchronize the specified cuda stream.
+
+    If ``device_id`` is *None*, the current active device will be selected.
+
+    Returns
+    -------
+    device_id : int or None
+        The device index.
+    stream_id : int
+        The stream index.
+
+    """
+    return C.cudaStreamSynchronize(
+        device_id if device_id else -1, stream_id)
\ No newline at end of file
--- a/Dragon/python/dragon/core/gradient_maker.py
+++ b/Dragon/python/dragon/core/gradient_maker.py
@@ -49,9 +49,9 @@ class GraphGradientMaker(object):

        Parameters
        ----------
-        forward_op : dragon_pb2.OperatorDef
+        forward_op : OperatorDef
            The OperatorDef of ``ForwardOp``.
-        g_outputs : list of str or list of None
+        g_outputs : list of str
            The inputs of ``BackwardOp`` (Precomputed grads).
        name : str, optional
            The optional operator name.
@@ -61,13 +61,9 @@ class GraphGradientMaker(object):
        tuple
            The OpDef, outputs and defaults of ``BackwardOp``.

-        References
-        ----------
-        The wrapper of ``CreateGradientDefsCC``.
-
        """
-        g_ops, g_inputs, defaults = \
-            C.CreateGradientDefsCC(forward_op.SerializeToString(), g_outputs)
+        g_ops, g_inputs, defaults = C.CreateGradientDefs(
+            forward_op.SerializeToString(), g_outputs)
        for idx, g_op in enumerate(g_ops):
            new_def = pb.OperatorDef()
            new_def.ParseFromString(g_op)
@@ -80,13 +76,13 @@ class GraphGradientMaker(object):

        Parameters
        ----------
-        forward_op : dragon_pb2.OperatorDef
+        forward_op : OperatorDef
            The OperatorDef of ``ForwardOp``.
        inputs_to_grads : dict
            The dict of <input, g_input>.
        blacklist : set of str
            The set of ``NoGradient`` tensors.
-        targets : list of str
+        targets : sequence of str
            The solving targets.

        Returns
@@ -123,7 +119,7 @@ class GraphGradientMaker(object):

        Parameters
        ----------
-        forward_ops : list of dragon_pb2.OperatorDef
+        forward_ops : sequence of OperatorDef
            The operators of ``ForwardOp``.
        targets : sequence of str
            The solving targets.
@@ -168,12 +164,12 @@ class GraphGradientMaker(object):
            is_skip, gen_grads = \
                cls.CheckGrad(forward_op, inputs_to_grads, blacklist, targets)
            # Missing grads are represented as ``None``
-            g_outputs = list(inputs_to_grads.get(name, None) for name in forward_op.output)
+            g_outputs = list(inputs_to_grads.get(name, 'ignore') for name in forward_op.output)
            g_ops, g_inputs, defaults = cls.CreateGrad(forward_op, g_outputs)

            # Append ops
            if not is_skip:
-                # --> GenOp
+                # GradientGenerateOp
                if len(gen_grads) > 0:
                    op_inputs = []; op_outputs = []; values = []
                    for item in gen_grads:
@@ -185,7 +181,7 @@ class GraphGradientMaker(object):
                    if forward_op.HasField('device_option'):
                        gen_op.device_option.CopyFrom(forward_op.device_option)
                    backward_ops.append(gen_op)
-                # --> GradOp
+                #  GradientOp
                for g_op in g_ops:
                    g_op.name = OperatorHelper.get_name() if auto_names else 'runtime'
                    backward_ops.append(g_op)

--- a/Dragon/python/dragon/core/helper.py
+++ b/Dragon/python/dragon/core/helper.py
@@ -33,7 +33,7 @@ class OperatorHelper(object):
        # Input(0) => Output(0), shape and data type unchanged.
        'Relu', 'PRelu', 'Elu', 'SElu', 'Sigmoid', 'Tanh', 'Dropout', 'Softmax',
        'Add', 'Sub', 'Mul', 'Div', 'Clip', 'Log', 'Exp', 'Pow', 'Square', 'Sqrt',
-        'Affine', 'Copy', 'Compare', 'StopGradient', 'MovingAverage', 'MPIBroadcast',
+        'Accumulate', 'Affine', 'Copy', 'Compare', 'StopGradient',  'MPIBroadcast',
        'BatchNorm', 'GroupNorm', 'L2Norm', 'LRN', 'BiasAdd', 'DropBlock2d',
    )

@@ -885,10 +885,6 @@ class OperatorHelper(object):
    def _apply_BilinearResize(cls, arguments, inputs, outputs):
        return cls._apply_NNResize(arguments, inputs, outputs)

-    @classmethod
-    def _apply_DenseConcat(cls, arguments, inputs, outputs):
-        return cls._apply_Concat(arguments, inputs, outputs)
-

 class GradientHelper(object):
    """A helper to store the known gradient relations.

--- a/Dragon/python/dragon/core/logging.py
+++ b/Dragon/python/dragon/core/logging.py
@@ -43,8 +43,9 @@ def get_logger():

        logger = _logging.getLogger('dragon')
        logger.setLevel(INFO)
+        logger.propagate = False

-        if not _logging.getLogger().handlers:
+        if True:
            # Determine whether we are in an interactive environment
            _interactive = False
            try:

--- a/Dragon/python/dragon/core/mpi.py
+++ b/Dragon/python/dragon/core/mpi.py
@@ -9,31 +9,15 @@
 #
 # ------------------------------------------------------------

+"""List some useful MPI C++ API."""
+
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function

-import numpy as np
-
 import dragon.import_c_api as C


-__all__ = [
-    'Init',
-    'Is_Init',
-    'Rank',
-    'Size',
-    'CreateGroup',
-    'Snapshot',
-    'AllowSnapshot',
-    'Parallel',
-    'AllowParallel',
-    'SetParallelMode',
-    'GetParallelMode',
-    'Finalize',
-]
-
-
 _GLOBAL_MPI_IS_INIT = False
 _GLOBAL_MPI_SNAPSHOT_RANKS = []
 _GLOBAL_MPI_PARALLEL_GROUPS = []
@@ -55,12 +39,8 @@ def Init():
    -----
    This function can only be called once.

-    References
-    ----------
-    The wrapper of ``MPIInitCC``
-
    """
-    C.MPIInitCC()
+    C.MPIInit()
    global _GLOBAL_MPI_IS_INIT
    global _GLOBAL_MPI_SNAPSHOT_RANKS
    _GLOBAL_MPI_IS_INIT = True
@@ -86,13 +66,9 @@ def Rank():
    int
        The world rank.

-    References
-    ----------
-    The wrapper of ``MPIRankCC``.
-
    """
    _check_init()
-    return C.MPIRankCC()
+    return C.MPIRank()


 def Size():
@@ -103,13 +79,9 @@ def Size():
    int
        The world size.

-    References
-    ----------
-    The wrapper of ``MPISizeCC``.
-
    """
    _check_init()
-    return C.MPISizeCC()
+    return C.MPISize()


 def CreateGroup(root=0, incl=[], excl=[]):
@@ -129,14 +101,9 @@ def CreateGroup(root=0, incl=[], excl=[]):
    tuple
        The local common and group id.

-    References
-    ----------
-    The wrapper of ``MPICreateGroupCC``.
-
    """
    _check_init()
-    comm, group = C.MPICreateGroupCC(root, incl, excl)
-    return np.int64(comm), np.int64(group)
+    return C.MPICreateGroup(root, incl, excl)


 def Snapshot(incl):
@@ -193,6 +160,7 @@ def AllowSnapshot():
    Returns
    -------
    boolean
+
    """
    return Rank() in _GLOBAL_MPI_SNAPSHOT_RANKS

@@ -212,12 +180,12 @@ def AllowParallel():


 def SetParallelMode(mode):
-    """Set the mode of data parallelism.
+    """Set the communication mode of data parallelism.

    Parameters
    ----------
-    mode : str
-        The mode, ``MPI``, ``NCCL`` or ``MIXED``.
+    mode : {'MPI', 'NCCL'}, optional
+        The communication mode.

    Returns
    -------
@@ -228,20 +196,18 @@ def SetParallelMode(mode):
    The default mode is ``MPI``.

    """
-    assert mode == 'MPI' or \
-           mode == 'NCCL' or \
-           mode == 'MIXED'
+    assert mode == 'MPI' or mode == 'NCCL'
    global _GLOBAL_MPI_PARALLEL_MODE
    _GLOBAL_MPI_PARALLEL_MODE = mode


 def GetParallelMode():
-    """Get the current mode of data parallelism.
+    """Get the current communication mode of data parallelism.

    Returns
    -------
-    str
-        The mode, ``MPI``, ``NCCL`` or ``MIXED``.
+    str : {'MPI', 'NCCL'}
+        The communication mode.

    """
    return _GLOBAL_MPI_PARALLEL_MODE
@@ -260,4 +226,4 @@ def Finalize():

    """
    _check_init()
-    C.MPIFinalizeCC()
\ No newline at end of file
+    C.MPIFinalize()
\ No newline at end of file
--- a/Dragon/python/dragon/core/proto_utils.py
+++ b/Dragon/python/dragon/core/proto_utils.py
@@ -21,6 +21,7 @@ import numpy as np
 from google.protobuf.message import Message

 import dragon.config as cfg
+import dragon.import_c_api as C
 from dragon.proto import dragon_pb2 as pb
 from dragon.core.scope import get_default_device

@@ -50,14 +51,15 @@ else:
        argument.name = key
        if type(value) is float: argument.f = value
        elif type(value) in (bool, int, long, np.int64) : argument.i = value
-        elif type(value) in (str, unicode): argument.s = value
+        elif type(value) is str: argument.s = value
+        elif type(value) is unicode: argument.s = str(value)
        elif isinstance(value, Message): argument.s = value.SerializeToString()
        elif all(type(v) is float for v in value): argument.floats.extend(value)
        elif all(type(v) is int for v in value): argument.ints.extend(value)
        elif all(type(v) is long for v in value): argument.ints.extend(value)
        elif all(type(v) is str for v in value): argument.strings.extend(value)
-        elif all(type(v) is unicode or type(v) is str for v in value):
-            argument.strings.extend(value)
+        elif all(type(v) is unicode for v in value):
+            argument.strings.extend([str(v) for v in value])
        elif all(isinstance(v, Message) for v in value):
            argument.strings.extend([v.SerializeToString() for v in value])
        else:
@@ -67,8 +69,10 @@ else:
        return argument


-def MakeOperatorDef(op_type, inputs, outputs, name='',
-                    device_option=None, arg=None, engine=None, **kwargs):
+def MakeOperatorDef(
+    op_type, inputs=(), outputs=(),
+        name='', uid=None, device_option=None,
+            arg=None, engine=None, **kwargs):
    operator = pb.OperatorDef()
    operator.type = op_type
    operator.name = name
@@ -81,22 +85,29 @@ def MakeOperatorDef(op_type, inputs, outputs, name='',
    if 'random_seed' in kwargs:
        operator.device_option.random_seed = kwargs['random_seed']
        del kwargs['random_seed']
-    if arg is not None:
-        operator.arg.extend(arg)
+    if uid is not None: operator.uid = uid
+    if arg is not None: operator.arg.extend(arg)
    for k,v in kwargs.items():
        if v is None: continue
        operator.arg.add().CopyFrom(MakeArgument(k,v))
    return operator


-def MutableOperatorDef(meta_def, inputs, outputs):
-    op = pb.OperatorDef(); op.CopyFrom(meta_def)
-    op.ClearField('input'); op.input.extend(inputs)
-    op.ClearField('output'); op.output.extend(outputs)
-    return op
+def MakeCXXOperatorDef(
+    op_type, inputs=(), outputs=(),
+        name='', uid=None, device_option=None,
+            arg=None, engine=None, **kwargs):
+    c_def = C.OperatorDef()
+    py_def = MakeOperatorDef(
+        op_type, inputs, outputs, name, uid,
+            device_option, arg, engine, **kwargs)
+    c_def.ParseFrom(py_def.SerializeToString())
+    return c_def


-def MakeDeviceOption(device_type, device_id, engine=None, rng_seed=None):
+def MakeDeviceOption(
+    device_type, device_id,
+        engine=None, rng_seed=None):
    option = pb.DeviceOption()
    option.device_type = device_type
    option.device_id = device_id
@@ -121,7 +132,9 @@ for i in range(_PREDEFINED_DEVICE_LIMITS):
                MakeDeviceOption(identify, i, 'CUDNN')


-def GetDeviceOption(device_type, device_id=0, engine=None, rng_seed=None):
+def GetDeviceOption(
+    device_type, device_id=0,
+        engine=None, rng_seed=None):
    ctx = (device_type, device_id, engine if engine else '')
    option = _PREDEFINED_DEVICE_OPTION_DICT[ctx]
    if rng_seed is not None:

--- a/Dragon/python/dragon/core/scope.py
+++ b/Dragon/python/dragon/core/scope.py
@@ -88,11 +88,11 @@ class WorkspaceScope(object):
        self.prev = 'default'

    def __enter__(self):
-        self.prev = C.CurrentWorkspaceCC()
-        C.SwitchWorkspaceCC(self.ws, True)
+        self.prev = C.CurrentWorkspace()
+        C.SwitchWorkspace(self.ws, True)

    def __exit__(self, type, value, traceback):
-        C.SwitchWorkspaceCC(self.prev, True)
+        C.SwitchWorkspace(self.prev, True)


 _GLOBAL_TENSOR_STACK = _ThreadLocalStack()

--- a/Dragon/python/dragon/core/tensor.py
+++ b/Dragon/python/dragon/core/tensor.py
@@ -355,10 +355,9 @@ class Tensor(object):
        """
        if inplace:
            return Tensor.CreateOperator(
-                'AsType', [], existing_outputs=[self], dtype=dtype)
+                'Cast', [], existing_outputs=[self], dtype=dtype)
        else:
-            return Tensor.CreateOperator(
-                'AsType', self, dtype=dtype)
+            return Tensor.CreateOperator('Cast', self, dtype=dtype)

    @property
    def extra_targets(self):

--- a/Dragon/python/dragon/core/tensor_utils.py
+++ b/Dragon/python/dragon/core/tensor_utils.py
@@ -9,6 +9,8 @@
 #
 # ------------------------------------------------------------

+"""List some extended Tensor C++ API."""
+
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
@@ -23,21 +25,7 @@ from dragon.core.tensor import Tensor
 from dragon.core.proto_utils import GetDeviceOption


-__all__ = [
-    'FromShape',
-    'SetShape',
-    'FromTensor',
-    'FromPyArray',
-    'SetPyArray',
-    'ToPyArray',
-    'ToPyArrayEx',
-    'ToCPUTensor',
-    'ToCUDATensor',
-    'GetTensorInfo',
-]
-
-
-def FromShape(shape, dtype='float32', ctx=None, name=None):
+def FromShape(shape, dtype='float32', name=None):
    """Create a Tensor from the shape.

    If specifying a existed tensor with larger shape,
@@ -49,8 +37,6 @@ def FromShape(shape, dtype='float32', ctx=None, name=None):
        The shape info.
    dtype : str
        The data type.
-    ctx : dragon_pb2.DeviceOption
-        The context info.
    name : str, optional
        The optional tensor name.

@@ -59,19 +45,14 @@ def FromShape(shape, dtype='float32', ctx=None, name=None):
    Tensor
        The tensor with the specific shape.

-    References
-    ----------
-    The wrapper of ``TensorFromShapeCC``.
-
    """
    tensor = _try_get_tensor(name)
+    tensor.shape = list(shape)
    if not isinstance(shape, (tuple, list)):
        raise TypeError('The shape should be a tuple or list.')
-    if ctx is None: ctx = GetDeviceOption('CPU')
-    C.TensorFromShapeCC(
+    C.TensorFromShape(
        _stringify_tensor(tensor),
-        list(shape), dtype,
-        _stringify_proto(ctx))
+            list(shape), dtype)
    return tensor


@@ -91,12 +72,8 @@ def SetShape(tensor, shape, dtype='float32'):
    -------
    None

-    References
-    ----------
-    The wrapper of ``TensorFromShapeCC``.
-
    """
-    C.TensorFromShapeCC(_stringify_tensor(tensor), shape, dtype)
+    C.TensorFromShape(_stringify_tensor(tensor), shape, dtype)


 def FromTensor(src, src_ctx=None, name=None, ctx=None):
@@ -109,11 +86,11 @@ def FromTensor(src, src_ctx=None, name=None, ctx=None):
    ----------
    src_ctx : str
        The name of source tensor.
-    src_ctx : dragon_pb2.DeviceOption
+    src_ctx : DeviceOption
        The context of source tensor.
    name : str
        The optional tensor name for destination tensor.
-    ctx : dragon_pb2.DeviceOption
+    ctx : DeviceOption
        The context for destination tensor.

    Returns
@@ -121,15 +98,11 @@ def FromTensor(src, src_ctx=None, name=None, ctx=None):
    Tensor
        The tensor with the same data as source.

-    References
-    ----------
-    The wrapper of ``TensorFromTensorCC``.
-
    """
    tensor = _try_get_tensor(name)
    if src_ctx is None: src_ctx = GetDeviceOption('CPU')
    if ctx is None: ctx = GetDeviceOption('CPU')
-    C.TensorFromTensorCC(
+    C.TensorFromTensor(
        _stringify_tensor(tensor), _stringify_tensor(src),
            _stringify_proto(ctx), _stringify_proto(src_ctx))
    return tensor
@@ -155,15 +128,11 @@ def FromPyArray(array, name=None):
    Tensor
        The tensor sharing the memory with original array.

-    References
-    ----------
-    The wrapper of ``TensorFromPyArrayCC``.
-
    """
    tensor = _try_get_tensor(name)
    if not isinstance(array, np.ndarray):
        raise TypeError('The given nd-array should be numpy.ndarray.')
-    C.TensorFromPyArrayCC(_stringify_tensor(tensor), array)
+    C.TensorFromPyArray(_stringify_tensor(tensor), array)
    return tensor


@@ -188,154 +157,58 @@ def SetPyArray(tensor, array):
    The wrapper of ``TensorFromPyArrayCC``.

    """
-    C.TensorFromPyArrayCC(_stringify_tensor(tensor), array)
+    C.TensorFromPyArray(_stringify_tensor(tensor), array)


-def ToPyArray(tensor):
+def ToPyArray(tensor, readonly=False):
    """Create a Array from a existing Tensor.

-    Note that memory of Array are ``zero-copied``.
+    Note that memory of Array are *zero-copied*.

    Parameters
    ----------
    tensor : Tensor or str
        The input tensor.
+    readonly : boolean
+        Whether to sync the contents with device.

    Returns
    -------
    numpy.ndarray
        The array sharing the memory with original tensor.

-    References
-    ----------
-    The wrapper of ``TensorToPyArrayCC``.
-
-    """
-    return C.TensorToPyArrayCC(_stringify_tensor(tensor))
-
-
-def ToPyArrayEx(tensor):
-    """Create a const Array from a existing Tensor.
-
-    Note that memory of Array are ``zero-copied`` and ``const``.
-
-    Parameters
-    ----------
-    tensor : Tensor or str
-        The input tensor.
-
-    Returns
-    -------
-    numpy.ndarray
-        The array sharing the memory with original tensor.
-
-    References
-    ----------
-    The wrapper of ``TensorToPyArrayExCC``.
-
-    """
-    return C.TensorToPyArrayExCC(_stringify_tensor(tensor))
-
-
-def ToCPUTensor(tensor):
-    """Switch the storage of a existing Tensor on cpu memory.
-
-    Parameters
-    ----------
-    tensor : Tensor or str
-        The input tensor.
-
-    Returns
-    -------
-    None
-
-    References
-    ----------
-    The wrapper of ``ToCPUTensorCC``.
-
    """
-    return C.ToCPUTensorCC(_stringify_tensor(tensor))
+    return C.TensorToPyArray(_stringify_tensor(tensor), readonly)


-def ToCUDATensor(tensor, device=0):
-    """Switch the storage of a existing Tensor on cuda memory.
+def GetStorage(tensor):
+    """Get the storage of a existing Tensor.

    Parameters
    ----------
    tensor : Tensor or str
        The input tensor.
-    device : int
-        The id of the device to use.

    Returns
    -------
-    None
-
-    References
-    ----------
-    The wrapper of ``ToCUDATensorCC``.
+    TensorStorage
+        The storage of the backend.

    """
-    return C.ToCUDATensorCC(_stringify_tensor(tensor), device)
-
-
-def GetTensorInfo(tensor, stream=1):
-    """Get the info of a existing Tensor.
-
-    The string info contains following fields:
-
-    stream #1: ``dtype``, ``from_numpy``, ``init``, ``mem``, ``mem_at``, ``device_id``
-
-    stream #2: ``shape``
-
-    stream #3: #1 + #2
-
-    Parameters
-    ----------
-    tensor : Tensor or str
-        The input tensor.
-    stream : int
-        The stream id.
-
-    Returns
-    -------
-    dict
-        The info.
-
-    References
-    ----------
-    The wrapper of ``GetTensorInfoCC``.
-
-    """
-    if not dg.workspace.HasTensor(_stringify_tensor(tensor)): return None
-    info = C.GetTensorInfoCC(_stringify_tensor(tensor), stream)
-    info['mem'] = []
-    if 'CPU' in info:
-        info['mem'].append('CPU'); info['device_id'] = 0
-    if 'CUDA' in info:
-        info['mem'].append('CUDA'); info['device_id'] = int(info['CUDA'])
-    if 'CNML' in info:
-        info['mem'].append('CNML'); info['device_id'] = int(info['CNML'])
-    info['init'] = len(info['mem']) > 0
-    return info
+    tensor = _stringify_tensor(tensor)
+    if not dg.workspace.HasTensor(tensor): return None
+    return C.GetTensor(tensor)


 def _stringify_proto(obj):
    """Try to stringify a proto-buffer structure."""
-    if obj is str: return obj
-    elif isinstance(obj, Message): return obj.SerializeToString()
-    else: raise TypeError('Object can not be serialized as a string.')
+    return obj.SerializeToString()


 def _stringify_tensor(obj):
    """Try to stringify a tensor."""
    if hasattr(obj, 'name'): return obj.name
-    else:
-        try:
-            obj = str(obj)
-        except Exception as e:
-            raise TypeError('Object can bot be used as a tensor. Error: {0}'.format(str(e)))
-        return obj
+    else: return str(obj)


 def _try_get_tensor(name=None):

--- a/Dragon/python/dragon/core/workspace.py
+++ b/Dragon/python/dragon/core/workspace.py
--- a/Dragon/python/dragon/import_c_api.py
+++ b/Dragon/python/dragon/import_c_api.py
@@ -33,8 +33,8 @@ except ImportError as e:
    sys.exit(1)


-REGISTERED_OPERATORS = set(s for s in RegisteredOperatorsCC())
-NO_GRADIENT_OPERATORS = set(s for s in NoGradientOperatorsCC())
+REGISTERED_OPERATORS = set(s for s in RegisteredOperators())
+NO_GRADIENT_OPERATORS = set(s for s in NoGradientOperators())


-atexit.register(OnModuleExitCC)
\ No newline at end of file
+atexit.register(OnModuleExit)
\ No newline at end of file
--- a/Dragon/python/dragon/operators/__init__.py
+++ b/Dragon/python/dragon/operators/__init__.py
@@ -100,8 +100,8 @@ class ArgumentHelper(object):
                        arguments[name] = None
                        arguments[name + '_desc'] = property.name
                    return arguments
-                extra_kwargs = {'gen_desc_{}'.format(name): Generator}
-                return op_func(*args, **kwargs, **extra_kwargs)
+                kwargs.update({'gen_desc_{}'.format(name): Generator})
+                return op_func(*args, **kwargs)
            return Impl
        return Decorator

@@ -138,8 +138,8 @@ class ArgumentHelper(object):
                    else:
                        arguments[desc_name] = properties
                    return arguments
-                extra_kwargs = {'gen_desc_{}'.format(name): Generator}
-                return op_func(*args, **kwargs, **extra_kwargs)
+                kwargs.update({'gen_desc_{}'.format(name): Generator})
+                return op_func(*args, **kwargs)
            return Impl
        return Decorator


--- a/Dragon/python/dragon/operators/arithmetic.py
+++ b/Dragon/python/dragon/operators/arithmetic.py
@@ -140,11 +140,13 @@ def Minimum(inputs, **kwargs):

 @OpSchema.Inputs(1)
 def Moments(inputs, axes=None, keep_dims=False, **kwargs):
-    """Compute the mean and variance of inputs along the given axes.
+    """Calculate the mean and variance of inputs along the given axes.

    The data type of moments will be *float32* typically,
    except the *float64* inputs (*float64* moments instead).

+    If ``axes`` is *None*, a Scalar will be returned.
+
    **Type Constraints**: (*int8*, *uint8*, *int32*, *int64*, *float16*, *float32*, *float64*)

    Parameters
@@ -206,9 +208,9 @@ def Matmul(inputs, transA=False, transB=False, **kwargs):
    ----------
    inputs : sequence of Tensor
        The inputs, A and B.
-    transA : bool
+    transA : bool, optional, default=False
        Whether to transpose A.
-    transB : bool
+    transB : bool, optional, default=False
        Whether to transpose B.

    Returns
@@ -234,9 +236,9 @@ def Dot(inputs, transA=False, transB=False, **kwargs):
    ----------
    inputs : sequence of Tensor
        The inputs, A and B.
-    transA : bool
+    transA : bool, optional, default=False
        Whether to transpose A.
-    transB : bool
+    transB : bool, optional, default=False
        Whether to transpose B.

    Returns
@@ -262,9 +264,9 @@ def FullyConnected(inputs, num_output, axis=1, transW=True, **kwargs):
        The inputs, represent [X, W] + [b].
    num_output : int
        The output dim.
-    axis : int, optional
+    axis : int, optional, default=1
        The start axis to calculate, can be negative.
-    transW : bool, optional
+    transW : bool, optional, default=True
        Whether to transpose the W.

    Returns
@@ -346,7 +348,7 @@ def Exp(inputs, **kwargs):


 @OpSchema.Inputs(1)
-def Pow(inputs, power, shift=None, scale=None, **kwargs):
+def Pow(inputs, power, shift=0., scale=1., **kwargs):
    """Calculate the power of input.

    Formulation: |power_function|
@@ -357,11 +359,11 @@ def Pow(inputs, power, shift=None, scale=None, **kwargs):
    ----------
    inputs : Tensor
        The input tensor.
-    power : float
+    power : float, required
        The power factor.
-    shift : float, optional
+    shift : float, optional, default=0.
        The shift magnitude.
-    scale : float, optional
+    scale : float, optional, default=1.
        The scale factor.

    Returns
@@ -414,7 +416,7 @@ def Sqrt(inputs, **kwargs):
        The sqrt result.

    """
-    return Tensor.CreateOperator('Pow', power=0.5, **ParseArgs(locals()))
+    return Tensor.CreateOperator('Sqrt', **ParseArgs(locals()))


 @OpSchema.Inputs(2, 3)
@@ -433,9 +435,9 @@ def Affine(inputs, axis=1, num_axes=1, **kwargs):
    ----------
    inputs : sequence of Tensor
        The inputs, represent [x, A] + [b].
-    axis : int, optional
+    axis : int, optional, default=1
        The start axis to scale, can be negative.
-    num_axes : int, optional
+    num_axes : int, optional, default=1
        The number of axes to scale.

    Returns
@@ -459,7 +461,7 @@ def GramMatrix(inputs, axis=1, **kwargs):
    ---------=
    inputs : Tensor
        The input tensor.
-    axis : int, optional
+    axis : int, optional, default=1
        The start axis to calculate.

    Returns
@@ -469,3 +471,48 @@ def GramMatrix(inputs, axis=1, **kwargs):

    """
    return Tensor.CreateOperator('GramMatrix', **ParseArgs(locals()))
+
+
+@OpSchema.Inputs(1, INT_MAX)
+def Accumulate(inputs, alpha=1., beta=1., **kwargs):
+    """Calculate *y = alpha * x + beta * y*
+
+    **Type Constraints**: (*int8*, *uint8*, *int32*, *int64*, *float16*, *float32*, *float64*)
+
+    Parameters
+    ----------
+    inputs : sequence of Tensor
+        The inputs, i.e., the *x*.
+    alpha : float, optional, default=1.
+        The alpha value.
+    beta : float, optional, default=1.
+
+    Returns
+    -------
+    sequence of Tensor
+        The outputs, i.e., the *y*.
+
+    """
+    return Tensor.CreateOperator('Accumulate', **ParseArgs(locals()))
+
+
+@OpSchema.Inputs(1, INT_MAX)
+def MovingAverage(inputs, decay, **kwargs):
+    """Calculate the *y = (1 - decay) * x + decay * y*
+
+    **Type Constraints**: (*int8*, *uint8*, *int32*, *int64*, *float16*, *float32*, *float64*)
+
+    Parameters
+    ----------
+    inputs : sequence of Tensor
+        The inputs, i.e., the *x*.
+    decay : float, required
+        The decay factor.
+
+    Returns
+    -------
+    sequence of Tensor
+        The outputs, i.e., the *y*.
+
+    """
+    return Accumulate(inputs, 1 - decay, decay, **kwargs)
\ No newline at end of file
--- a/Dragon/python/dragon/operators/misc.py
+++ b/Dragon/python/dragon/operators/misc.py
@@ -17,7 +17,7 @@ from . import *


 @OpSchema.Inputs(1)
-def AsType(inputs, dtype='float32', inplace=False, **kwargs):
+def Cast(inputs, dtype='float32', inplace=False, **kwargs):
    """Cast the data type of inputs to a specific one.

    If ``inplace`` is ``True``, cast ``self`` instead of returning a new one.
@@ -41,7 +41,7 @@ def AsType(inputs, dtype='float32', inplace=False, **kwargs):
    Examples
    --------
    >>> x = Tensor('x', dtype='float32').Variable()
-    >>> y = AsType(x, 'int32')
+    >>> y = Cast(x, 'int32')
    >>> z = x.astype('int64')
    >>> xx = x.astype('float64', inplace=True)
    >>> print(x.name, xx.name)
@@ -53,7 +53,7 @@ def AsType(inputs, dtype='float32', inplace=False, **kwargs):
        arguments['inputs'] = []
        arguments['existing_outputs'] = [inputs]

-    return Tensor.CreateOperator('AsType', **arguments)
+    return Tensor.CreateOperator('Cast', **arguments)


 def Run(inputs, module, op, param_str='', num_outputs=1, **kwargs):
@@ -174,27 +174,3 @@ def StopGradient(inputs, **kwargs):

    """
    return Tensor.CreateOperator('StopGradient', **ParseArgs(locals()))
\ No newline at end of file
-
-
-@OpSchema.Inputs(1)
-def MovingAverage(inputs, decay, **kwargs):
-    """Calculate the moving average.
-
-    **Type Constraints**: (*int8*, *uint8*, *int32*, *int64*, *float16*, *float32*, *float64*)
-
-    Parameters
-    ----------
-    inputs : Tensor
-        The values to calculate moving average.
-    decay : float
-        The decay factor.
-
-    Returns
-    -------
-    Tensor
-        The output tensor, i.e., ``variable``, calculated as:
-
-        |moving_average_function|
-
-    """
-    return Tensor.CreateOperator('MovingAverage', **ParseArgs(locals()))
\ No newline at end of file
--- a/Dragon/python/dragon/operators/ndarray.py
+++ b/Dragon/python/dragon/operators/ndarray.py
@@ -740,7 +740,6 @@ def Shape(inputs, **kwargs):
    return Tensor.CreateOperator('Shape', **ParseArgs(locals()))


-@OpSchema.Inputs(0)
 @ArgumentHelper.Desc('start')
 @ArgumentHelper.Desc('stop')
 @ArgumentHelper.Desc('step')

--- a/Dragon/python/dragon/operators/vision.py
+++ b/Dragon/python/dragon/operators/vision.py
@@ -62,7 +62,7 @@ def Conv2d(
        The dilation multiple(s) of convolution.
    group : int, optional, default=1
        The group size of convolution.
-    padding : {'VALID', 'SAME, 'SAME_UPPER', 'SAME_LOWER'}, optional
+    padding : {'VALID', 'SAME', 'SAME_UPPER', 'SAME_LOWER'}, optional
        The padding algorithm.
    data_format : {'NCHW', 'NHWC'}, optional
        The data_format.
@@ -119,7 +119,7 @@ def DepthwiseConv2d(
        The stride(s) of convolution.
    pads : sequence of int, optional, default=0
        The zero padding size(s) of convolution.
-    padding : {'VALID', 'SAME, 'SAME_UPPER', 'SAME_LOWER'}, optional
+    padding : {'VALID', 'SAME', 'SAME_UPPER', 'SAME_LOWER'}, optional
        The padding algorithm.
    data_format : {'NCHW', 'NHWC'}, optional
        The data_format.
@@ -183,7 +183,7 @@ def ConvTranspose2d(
        The padding value add to one side(right) of the output.
    output_shape : sequence of (int, Tensor), optional
        The deterministic output shape for **SAME** padding.
-    padding : {'VALID', 'SAME, 'SAME_UPPER', 'SAME_LOWER'}, optional
+    padding : {'VALID', 'SAME', 'SAME_UPPER', 'SAME_LOWER'}, optional
        The padding algorithm.
    data_format : {'NCHW', 'NHWC'}, optional
        The data_format.
@@ -224,7 +224,7 @@ def ConvTranspose2d(

 @OpSchema.Inputs(1)
 def Pool2d(
-    inputs, kernel_shape, strides, pads=0, padding='VALID', ceil=True,
+    inputs, kernel_shape, strides, pads=0, padding='VALID', ceil_mode=True,
        mode='MAX', data_format='NCHW', global_pooling=False, **kwargs):
    """2D Pooling, MAX or AVG.

@@ -248,9 +248,9 @@ def Pool2d(
        The stride(s) of of pooling,
    pads : sequence of int, optional, default=0
        The zero padding size(s) of pooling.
-    padding : {'VALID', 'SAME, 'SAME_UPPER', 'SAME_LOWER'}, optional
+    padding : {'VALID', 'SAME', 'SAME_UPPER', 'SAME_LOWER'}, optional
        The padding algorithm.
-    ceil : bool, optional
+    ceil_mode : bool, optional, default=True
        Whether to ceil the boundary.
    mode : {'MAX', 'AVG'}, optional
        The pooling mode.
@@ -505,48 +505,6 @@ def BiasAdd(inputs, data_format='NCHW', **kwargs):
    return Tensor.CreateOperator('BiasAdd', **arguments)


-@OpSchema.Inputs(2)
-def DenseConcat(inputs, growth_rate=0, axis=1, **kwargs):
-    """Memory-efficient concatenation for DenseNet `[Huang et.al, 2017] <http://arxiv.org/abs/1608.06993>`_.
-
-    This operator is forked from ``Concat``.
-
-    The memory optimization requires the following settings:
-
-    1. Set the ``growth_rate``, the value must larger than ``0``.
-
-    2. Set the ``mirror_stage`` to True.
-
-    Parameters
-    ----------
-    inputs : sequence of Tensor
-        The inputs, represent A(old) and B(new) respectively.
-    growth_rate : int, optional, default=0
-        The growth rate.
-    axis : int, optional
-        The axis to concatenate.
-    mirror_stage : bool, optional
-        Whether to share input A for output C. Default is ``False``.
-
-    Returns
-    -------
-    Tensor
-        The concatenated tensor, represents C.
-
-    Examples
-    --------
-    >>> A = Tensor().Variable()
-    >>> B = Tensor().Variable()
-    >>> C = DenseConcat([A, B], axis=1) # Simple concatenation
-
-    >>> import dragon.memonger as opt
-    >>> C = opt.Drop(DenseConcat, [A, B], axis=1) # Memory-efficient concatenation
-    >>> D = DenseConcat([A, B], axis=1, mirror_stage=True) # Memory-efficient concatenation, equivalent
-
-    """
-    return Tensor.CreateOperator('DenseConcat', **ParseArgs(locals()))
-
-
 @OpSchema.Inputs(1)
 @ArgumentHelper.Desc('keep_prob', as_target=False)
 def DropBlock2d(

--- a/Dragon/python/dragon/ops.py
+++ b/Dragon/python/dragon/ops.py
@@ -52,7 +52,6 @@ LRN = vision_ops.LRN
 NNResize = vision_ops.NNResize
 BilinearResize = vision_ops.BilinearResize
 BiasAdd = vision_ops.BiasAdd
-DenseConcat = vision_ops.DenseConcat
 DropBlock2d = vision_ops.DropBlock2d

 # Recurrent
@@ -104,6 +103,8 @@ FullyConnected = math_ops.FullyConnected
 Eltwise = math_ops.Eltwise
 Affine = math_ops.Affine
 GramMatrix = math_ops.GramMatrix
+Accumulate = math_ops.Accumulate
+MovingAverage = math_ops.MovingAverage

 # Normalization
 BatchNorm = norm_ops.BatchNorm
@@ -137,19 +138,18 @@ Squeeze = array_ops.Squeeze
 Shape = array_ops.Shape
 Arange = array_ops.Arange

-# ControlFlow
+# Control Flow
 Copy = control_flow_ops.Copy
 Equal = control_flow_ops.Equal
 Less = control_flow_ops.Less
 Grater = control_flow_ops.Greater

 # Misc
-Cast = AsType = misc_ops.AsType
+Cast = AsType = misc_ops.Cast
 Run = misc_ops.Run
 Template = misc_ops.Template
 Accuracy = misc_ops.Accuracy
 StopGradient = misc_ops.StopGradient
-MovingAverage = misc_ops.MovingAverage

 # MPI
 MPIBroadcast = mpi_ops.MPIBroadcast

--- a/Dragon/python/dragon/proto/dragon.proto
+++ b/Dragon/python/dragon/proto/dragon.proto
@@ -65,12 +65,13 @@ message Argument {
 }

 message OperatorDef {
-    repeated string input = 1;
-    repeated string output = 2;
-    optional string name = 3;
-    optional string type = 4;
-    repeated Argument arg = 5;
-    optional DeviceOption device_option = 6;
+    optional string uid = 1;
+    repeated string input = 2;
+    repeated string output = 3;
+    optional string name = 4;
+    optional string type = 5;
+    repeated Argument arg = 6;
+    optional DeviceOption device_option = 7;
 }

 message GradientProto {

--- a/Dragon/python/dragon/tools/db.py
+++ b/Dragon/python/dragon/tools/db.py
@@ -83,7 +83,6 @@ class LMDB(object):
            self.env = lmdb.open(database_path, readonly=True, lock=False)
            self._total_size = self.env.info()['map_size']
        if mode == 'w':
-            assert not os.path.isdir(database_path), 'database path is not invalid'
            self.env = lmdb.open(database_path, writemap=True)
        self.txn = self.env.begin(write=(mode == 'w'))
        self.cursor = self.txn.cursor()

--- a/Dragon/python/dragon/tools/summary_writer.py
+++ b/Dragon/python/dragon/tools/summary_writer.py
-# ------------------------------------------------------------
-# Copyright (c) 2017-present, SeetaTech, Co.,Ltd.
-#
-# Licensed under the BSD 2-Clause License.
-# You should have received a copy of the BSD 2-Clause License
-# along with the software. If not, See,
-#
-#      <https://opensource.org/licenses/BSD-2-Clause>
-#
-# ------------------------------------------------------------
-
-import os
-
-from dragon.core.tensor import Tensor
-import dragon.core.workspace as ws
-
-
-class ScalarSummary(object):
-    """Write scalar summary.
-
-    Examples
-    --------
-    >>> sw = ScalarSummary(log_dir='logs')
-    >>> sw.add_summary(('loss', 2.333), 0)
-
-    """
-    def __init__(self, log_dir='logs'):
-        """Construct a ScalarSummary writer.
-
-        Parameters
-        ----------
-        log_dir : str
-            The root folder of logs.
-
-        Returns
-        -------
-        ScalarSummary
-            The scalar writer.
-
-        """
-        self.log_dir = os.path.join(log_dir, 'scalar')
-
-    def add_summary(self, scalar, global_step):
-        """Add a summary.
-
-        Parameters
-        ----------
-        scalar : tuple or Tensor
-            The scalar.
-        global_step : int
-            The time step of this summary.
-
-        Returns
-        -------
-        None
-
-        """
-        if isinstance(scalar, Tensor):
-            key, value = scalar.name, ws.FetchTensor(scalar)[0]
-        elif isinstance(scalar, tuple): key, value = scalar
-        else: raise TypeError()
-        key = key.replace('/', '_')
-
-        if not os.path.exists(self.log_dir): os.makedirs(self.log_dir)
-        with open(os.path.join(self.log_dir, key + '.txt'), 'a') as f:
-            f.write(str(global_step) + ' ' + str(value) + '\n')
\ No newline at end of file
--- a/Dragon/python/dragon/updaters.py
+++ b/Dragon/python/dragon/updaters.py
@@ -32,8 +32,12 @@ class BaseUpdater(object):
    # Store the global unique slot index
    _DEFAULT_UNIQUE_SLOT_ID = 0

-    def __init__(self, scale_gradient=1.0, clip_gradient=-1.0,
-                 l2_decay=-1.0, slot=None, verbose=True):
+    def __init__(self,
+                 scale_gradient=1.0,
+                 clip_gradient=-1.0,
+                 l2_decay=-1.0,
+                 slot=None,
+                 verbose=True):
        """Construct a Updater to optimize the objectives.

        Parameters

--- a/Dragon/python/dragon/utils/vision/data_batch.py
+++ b/Dragon/python/dragon/utils/vision/data_batch.py
@@ -29,6 +29,7 @@ class DataBatch(object):
    """DataBatch aims to prefetch data by *Triple-Buffering*.

    It takes full advantages of the Process/Thread of Python,
+
    which provides remarkable I/O speed up for scalable distributed training.

    """

--- a/Dragon/python/dragon/tools/im2db.py
+++ b/Dragon/python/dragon/tools/im2db.py
@@ -115,11 +115,13 @@ def make_db(args):

    now_time = time.time()
    print('{0} / {1} in {2:.2f} sec'.format(count, total_line, now_time - start_time))
-    db.put('size', str(count))
-    db.put('zfill', str(args.zfill))
    db.commit()
    db.close()

+    # Compress the empty space
+    db.open(args.database, mode='w')
+    db.commit()
+
    shutil.copy(args.list, args.database + '/image_list.txt')
    end_time = time.time()
    print('{0} images have been stored in the database.'.format(total_line))

--- a/Dragon/python/dragon/vm/caffe/layer.py
+++ b/Dragon/python/dragon/vm/caffe/layer.py
@@ -29,7 +29,7 @@ class Layer(object):

        Parameters
        ----------
-        LayerParameter : caffe_pb2.LayerParameter
+        LayerParameter : LayerParameter
            The parameter of ``Layer``.

        Returns

--- a/Dragon/python/dragon/vm/caffe/layers/__init__.py
+++ b/Dragon/python/dragon/vm/caffe/layers/__init__.py
@@ -91,5 +91,5 @@ from .common import (
    ExpandDimsLayer,
    StopGradientLayer,
    ProposalLayer,
-    DenseConcatLayer,
+    CastLayer,
 )
\ No newline at end of file
--- a/Dragon/python/dragon/vm/caffe/layers/common.py
+++ b/Dragon/python/dragon/vm/caffe/layers/common.py
@@ -184,29 +184,6 @@ class SliceLayer(Layer):
        return dragon.ops.Slice(bottom, **self.arguments)


-class DenseConcatLayer(Layer):
-    """The extended implementation for `DenseNet`_.
-
-    Parameters
-    ----------
-    axis : int
-        The axis to concatenate. Refer `ConcatParameter.axis`_.
-    growth_rate : int
-        The growth rate.
-
-    """
-    def __init__(self, LayerParameter):
-        super(DenseConcatLayer, self).__init__(LayerParameter)
-        param = LayerParameter.dense_concat_param
-        self.arguments = {
-            'axis': param.axis,
-            'growth_rate': param.growth_rate,
-        }
-
-    def LayerSetup(self, bottom):
-        return dragon.ops.DenseConcat(bottom, **self.arguments)
-
-
 class CropLayer(Layer):
    """The implementation of ``CropLayer``.

@@ -692,3 +669,21 @@ class ProposalLayer(Layer):

    def LayerSetup(self, bottom):
        return dragon.ops.Proposal(bottom, **self.arguments)
+
+
+class CastLayer(Layer):
+    """The implementation of ``CastLayer``.
+
+    Parameters
+    ----------
+    dtype : str
+        The stride of anchors. Refer ``CastParameter.dtype``.
+
+    """
+    def __init__(self, LayerParameter):
+        super(CastLayer, self).__init__(LayerParameter)
+        param = LayerParameter.cast_param
+        self.arguments = {'dtype': param.dtype.lower()}
+
+    def LayerSetup(self, bottom):
+        return dragon.ops.Cast(bottom, **self.arguments)
\ No newline at end of file
--- a/Dragon/python/dragon/vm/caffe/layers/vision.py
+++ b/Dragon/python/dragon/vm/caffe/layers/vision.py
--- a/Dragon/python/dragon/vm/caffe/net.py
+++ b/Dragon/python/dragon/vm/caffe/net.py
--- a/Dragon/python/dragon/vm/caffe/proto/caffe.proto
+++ b/Dragon/python/dragon/vm/caffe/proto/caffe.proto
--- a/Dragon/python/dragon/vm/caffe/solver.py
+++ b/Dragon/python/dragon/vm/caffe/solver.py
--- a/Dragon/python/dragon/vm/onnx/frontend.py
+++ b/Dragon/python/dragon/vm/onnx/frontend.py
--- a/Dragon/python/dragon/vm/onnx/helper.py
+++ b/Dragon/python/dragon/vm/onnx/helper.py
--- a/Dragon/python/dragon/vm/onnx/nodes/activation.py
+++ b/Dragon/python/dragon/vm/onnx/nodes/activation.py
--- a/Dragon/python/dragon/vm/onnx/nodes/arithmetic.py
+++ b/Dragon/python/dragon/vm/onnx/nodes/arithmetic.py
--- a/Dragon/python/dragon/vm/onnx/nodes/contrib.py
+++ b/Dragon/python/dragon/vm/onnx/nodes/contrib.py
--- a/Dragon/python/dragon/vm/onnx/nodes/factory.py
+++ b/Dragon/python/dragon/vm/onnx/nodes/factory.py
--- a/Dragon/python/dragon/vm/onnx/nodes/misc.py
+++ b/Dragon/python/dragon/vm/onnx/nodes/misc.py
--- a/Dragon/python/dragon/vm/onnx/nodes/ndarray.py
+++ b/Dragon/python/dragon/vm/onnx/nodes/ndarray.py
--- a/Dragon/python/dragon/vm/onnx/nodes/norm.py
+++ b/Dragon/python/dragon/vm/onnx/nodes/norm.py
--- a/Dragon/python/dragon/vm/onnx/nodes/vision.py
+++ b/Dragon/python/dragon/vm/onnx/nodes/vision.py
--- a/Dragon/python/dragon/vm/onnx/utils.py
+++ b/Dragon/python/dragon/vm/onnx/utils.py
--- a/Dragon/python/dragon/vm/tensorflow/training/optimizer.py
+++ b/Dragon/python/dragon/vm/tensorflow/training/optimizer.py
--- a/Dragon/python/dragon/vm/theano/compile/function.py
+++ b/Dragon/python/dragon/vm/theano/compile/function.py
--- a/Dragon/python/dragon/vm/theano/compile/scan.py
+++ b/Dragon/python/dragon/vm/theano/compile/scan.py
--- a/Dragon/python/dragon/vm/torch/__init__.py
+++ b/Dragon/python/dragon/vm/torch/__init__.py
--- a/Dragon/python/dragon/vm/torch/autograd/grad_mode.py
+++ b/Dragon/python/dragon/vm/torch/autograd/grad_mode.py
--- a/Dragon/python/dragon/vm/torch/autograd/variable.py
+++ b/Dragon/python/dragon/vm/torch/autograd/variable.py
--- a/Dragon/python/dragon/vm/torch/c_api.py
+++ b/Dragon/python/dragon/vm/torch/c_api.py
--- a/Dragon/python/dragon/vm/torch/cuda/__init__.py
+++ b/Dragon/python/dragon/vm/torch/cuda/__init__.py
--- a/Dragon/python/dragon/vm/torch/environ.py
+++ b/Dragon/python/dragon/vm/torch/environ.py
--- a/Dragon/python/dragon/vm/torch/execution.py
+++ b/Dragon/python/dragon/vm/torch/execution.py
--- a/Dragon/python/dragon/vm/torch/jit.py
+++ b/Dragon/python/dragon/vm/torch/jit.py
--- a/Dragon/python/dragon/vm/torch/module.py
+++ b/Dragon/python/dragon/vm/torch/module.py
--- a/Dragon/python/dragon/vm/torch/nn/__init__.py
+++ b/Dragon/python/dragon/vm/torch/nn/__init__.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/activation.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/activation.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/affine.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/affine.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/batchnorm.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/batchnorm.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/conv.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/conv.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/depthwise_conv.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/depthwise_conv.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/dropblock.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/dropblock.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/dropout.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/dropout.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/groupnorm.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/groupnorm.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/linear.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/linear.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/loss.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/loss.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/pooling.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/pooling.py
--- a/Dragon/python/dragon/vm/torch/nn/modules/rnn.py
+++ b/Dragon/python/dragon/vm/torch/nn/modules/rnn.py
--- a/Dragon/python/dragon/vm/torch/c_apis.py
+++ b/Dragon/python/dragon/vm/torch/c_apis.py
--- a/Dragon/python/dragon/vm/torch/onnx/utils.py
+++ b/Dragon/python/dragon/vm/torch/onnx/utils.py
--- a/Dragon/python/dragon/vm/torch/ops/__init__.py
+++ b/Dragon/python/dragon/vm/torch/ops/__init__.py
--- a/Dragon/python/dragon/vm/torch/ops/arithmetic.py
+++ b/Dragon/python/dragon/vm/torch/ops/arithmetic.py
--- a/Dragon/python/dragon/vm/torch/ops/ndarray.py
+++ b/Dragon/python/dragon/vm/torch/ops/ndarray.py
--- a/Dragon/python/dragon/vm/torch/ops/builtin.py
+++ b/Dragon/python/dragon/vm/torch/ops/builtin.py
--- a/Dragon/python/dragon/vm/torch/ops/creation.py
+++ b/Dragon/python/dragon/vm/torch/ops/creation.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/arithmetic.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/arithmetic.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/array.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/array.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/axis.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/axis.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/base.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/base.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/control_flow.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/control_flow.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/creation.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/creation.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/dtype.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/dtype.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/indexing.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/indexing.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/init.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/init.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/reduce.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/reduce.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/shape.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/shape.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/update.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/update.py
--- a/Dragon/python/dragon/vm/torch/ops/modules/vision.py
+++ b/Dragon/python/dragon/vm/torch/ops/modules/vision.py
--- a/Dragon/python/dragon/vm/torch/ops/primitive.py
+++ b/Dragon/python/dragon/vm/torch/ops/primitive.py
--- a/Dragon/python/dragon/vm/torch/ops/update.py
+++ b/Dragon/python/dragon/vm/torch/ops/update.py
--- a/Dragon/python/dragon/vm/torch/ops/vision.py
+++ b/Dragon/python/dragon/vm/torch/ops/vision.py
--- a/Dragon/python/dragon/vm/torch/optim/optimizer.py
+++ b/Dragon/python/dragon/vm/torch/optim/optimizer.py
--- a/Dragon/python/dragon/vm/torch/dummy_pool.py
+++ b/Dragon/python/dragon/vm/torch/dummy_pool.py
--- a/Dragon/python/dragon/vm/torch/serialization.py
+++ b/Dragon/python/dragon/vm/torch/serialization.py
--- a/Dragon/python/dragon/vm/torch/tensor.py
+++ b/Dragon/python/dragon/vm/torch/tensor.py
--- a/Dragon/python/dragon/vm/torch/tensor_uitls.py
+++ b/Dragon/python/dragon/vm/torch/tensor_uitls.py
--- a/Dragon/python/dragon/vm/torch/vision/transforms/__init__.py
+++ b/Dragon/python/dragon/vm/torch/vision/transforms/__init__.py
--- a/Dragon/python/dragon/vm/torch/vision/transforms/functional.py
+++ b/Dragon/python/dragon/vm/torch/vision/transforms/functional.py
--- a/Dragon/src/contrib/rcnn/proposal_op.cc
+++ b/Dragon/src/contrib/rcnn/proposal_op.cc
--- a/Dragon/src/contrib/rcnn/proposal_op.h
+++ b/Dragon/src/contrib/rcnn/proposal_op.h
--- a/Dragon/src/core/context.cc
+++ b/Dragon/src/core/context.cc
--- a/Dragon/src/core/graph.cc
+++ b/Dragon/src/core/graph.cc
--- a/Dragon/src/core/graph_gradient.cc
+++ b/Dragon/src/core/graph_gradient.cc
--- a/Dragon/src/core/graph_optimizer.cc
+++ b/Dragon/src/core/graph_optimizer.cc
--- a/Dragon/src/core/mixedmem.cc
+++ b/Dragon/src/core/mixedmem.cc
--- a/Dragon/src/core/operator.cc
+++ b/Dragon/src/core/operator.cc
--- a/Dragon/src/core/workspace.cc
+++ b/Dragon/src/core/workspace.cc
--- a/Dragon/src/kernels/activation/prelu_op_kernel.cc
+++ b/Dragon/src/kernels/activation/prelu_op_kernel.cc
--- a/Dragon/src/kernels/activation/prelu_op_kernel.cu
+++ b/Dragon/src/kernels/activation/prelu_op_kernel.cu
--- a/Dragon/src/kernels/loss/nll_loss_op_kernel.cc
+++ b/Dragon/src/kernels/loss/nll_loss_op_kernel.cc
--- a/Dragon/src/kernels/loss/nll_loss_op_kernel.cu
+++ b/Dragon/src/kernels/loss/nll_loss_op_kernel.cu
--- a/Dragon/src/kernels/loss/sigmoid_ce_loss_op_kernel.cc
+++ b/Dragon/src/kernels/loss/sigmoid_ce_loss_op_kernel.cc
--- a/Dragon/src/kernels/loss/sigmoid_ce_loss_op_kernel.cu
+++ b/Dragon/src/kernels/loss/sigmoid_ce_loss_op_kernel.cu
--- a/Dragon/src/kernels/loss/sigmoid_focal_loss_op_kernel.cc
+++ b/Dragon/src/kernels/loss/sigmoid_focal_loss_op_kernel.cc
--- a/Dragon/src/kernels/loss/sigmoid_focal_loss_op_kernel.cu
+++ b/Dragon/src/kernels/loss/sigmoid_focal_loss_op_kernel.cu
--- a/Dragon/src/kernels/loss/softmax_focal_loss_op_kernel.cc
+++ b/Dragon/src/kernels/loss/softmax_focal_loss_op_kernel.cc
--- a/Dragon/src/kernels/loss/softmax_focal_loss_op_kernel.cu
+++ b/Dragon/src/kernels/loss/softmax_focal_loss_op_kernel.cu
--- a/Dragon/src/kernels/loss/sparse_softmax_ce_loss_op_kernel.cc
+++ b/Dragon/src/kernels/loss/sparse_softmax_ce_loss_op_kernel.cc
--- a/Dragon/src/kernels/loss/sparse_softmax_ce_loss_op_kernel.cu
+++ b/Dragon/src/kernels/loss/sparse_softmax_ce_loss_op_kernel.cu
--- a/Dragon/src/kernels/misc/gradient_op_kernel.cc
+++ b/Dragon/src/kernels/misc/gradient_op_kernel.cc
--- a/Dragon/src/kernels/misc/gradient_op_kernel.cu
+++ b/Dragon/src/kernels/misc/gradient_op_kernel.cu
--- a/Dragon/src/kernels/misc/image_data_op_kernel.cu
+++ b/Dragon/src/kernels/misc/image_data_op_kernel.cu
--- a/Dragon/src/kernels/norm/batch_norm_op_kernel.cc
+++ b/Dragon/src/kernels/norm/batch_norm_op_kernel.cc
--- a/Dragon/src/kernels/norm/batch_norm_op_kernel.cu
+++ b/Dragon/src/kernels/norm/batch_norm_op_kernel.cu
--- a/Dragon/src/kernels/norm/group_norm_op_kernel.cc
+++ b/Dragon/src/kernels/norm/group_norm_op_kernel.cc
--- a/Dragon/src/kernels/norm/group_norm_op_kernel.cu
+++ b/Dragon/src/kernels/norm/group_norm_op_kernel.cu
--- a/Dragon/src/kernels/update/adam_update_op_kernel.cc
+++ b/Dragon/src/kernels/update/adam_update_op_kernel.cc
--- a/Dragon/src/kernels/update/adam_update_op_kernel.cu
+++ b/Dragon/src/kernels/update/adam_update_op_kernel.cu
--- a/Dragon/src/kernels/update/mprec_update_op_kerne.cu
+++ b/Dragon/src/kernels/update/mprec_update_op_kerne.cu
--- a/Dragon/src/kernels/update/mprec_update_op_kernel.cc
+++ b/Dragon/src/kernels/update/mprec_update_op_kernel.cc
--- a/Dragon/src/kernels/update/nesterov_update_op_kernel.cc
+++ b/Dragon/src/kernels/update/nesterov_update_op_kernel.cc
--- a/Dragon/src/kernels/update/nesterov_update_op_kernel.cu
+++ b/Dragon/src/kernels/update/nesterov_update_op_kernel.cu
--- a/Dragon/src/kernels/update/rmsprop_update_op_kernel.cc
+++ b/Dragon/src/kernels/update/rmsprop_update_op_kernel.cc
--- a/Dragon/src/kernels/update/rmsprop_update_op_kernel.cu
+++ b/Dragon/src/kernels/update/rmsprop_update_op_kernel.cu
--- a/Dragon/src/kernels/update/sgd_update_op_kernel.cc
+++ b/Dragon/src/kernels/update/sgd_update_op_kernel.cc
--- a/Dragon/src/kernels/update/sgd_update_op_kernel.cu
+++ b/Dragon/src/kernels/update/sgd_update_op_kernel.cu
--- a/Dragon/src/operators/arithmetic/accumulate.cc
+++ b/Dragon/src/operators/arithmetic/accumulate.cc
--- a/Dragon/src/operators/arithmetic/affine_op.cc
+++ b/Dragon/src/operators/arithmetic/affine_op.cc
--- a/Dragon/src/operators/arithmetic/cudnn_affine_op.cc
+++ b/Dragon/src/operators/arithmetic/cudnn_affine_op.cc
--- a/Dragon/src/operators/arithmetic/div_op.cc
+++ b/Dragon/src/operators/arithmetic/div_op.cc
--- a/Dragon/src/operators/arithmetic/fully_connected_op.cc
+++ b/Dragon/src/operators/arithmetic/fully_connected_op.cc
--- a/Dragon/src/operators/arithmetic/mul_op.cc
+++ b/Dragon/src/operators/arithmetic/mul_op.cc
--- a/Dragon/src/operators/arithmetic/sqrt_op.cc
+++ b/Dragon/src/operators/arithmetic/sqrt_op.cc
--- a/Dragon/src/operators/control_flow/copy_op.cc
+++ b/Dragon/src/operators/control_flow/copy_op.cc
--- a/Dragon/src/operators/control_flow/scan_op.cc
+++ b/Dragon/src/operators/control_flow/scan_op.cc
--- a/Dragon/src/operators/loss/ctc_loss_op.cc
+++ b/Dragon/src/operators/loss/ctc_loss_op.cc
--- a/Dragon/src/operators/loss/l1_loss_op.cc
+++ b/Dragon/src/operators/loss/l1_loss_op.cc
--- a/Dragon/src/operators/loss/l2_loss_op.cc
+++ b/Dragon/src/operators/loss/l2_loss_op.cc
--- a/Dragon/src/operators/loss/nll_loss_op.cc
+++ b/Dragon/src/operators/loss/nll_loss_op.cc
--- a/Dragon/src/operators/loss/sigmoid_ce_loss_op.cc
+++ b/Dragon/src/operators/loss/sigmoid_ce_loss_op.cc
--- a/Dragon/src/operators/loss/sigmoid_focal_loss_op.cc
+++ b/Dragon/src/operators/loss/sigmoid_focal_loss_op.cc
--- a/Dragon/src/operators/loss/smooth_l1_loss_op.cc
+++ b/Dragon/src/operators/loss/smooth_l1_loss_op.cc
--- a/Dragon/src/operators/loss/softmax_ce_loss_op.cc
+++ b/Dragon/src/operators/loss/softmax_ce_loss_op.cc
--- a/Dragon/src/operators/loss/softmax_focal_loss_op.cc
+++ b/Dragon/src/operators/loss/softmax_focal_loss_op.cc
--- a/Dragon/src/operators/loss/sparse_softmax_ce_loss_op.cc
+++ b/Dragon/src/operators/loss/sparse_softmax_ce_loss_op.cc
--- a/Dragon/src/operators/misc/accuracy_op.cc
+++ b/Dragon/src/operators/misc/accuracy_op.cc
--- a/Dragon/src/operators/misc/astype_op.cc
+++ b/Dragon/src/operators/misc/astype_op.cc
--- a/Dragon/src/operators/misc/gradient_op.cc
+++ b/Dragon/src/operators/misc/gradient_op.cc
--- a/Dragon/src/operators/misc/initialize_op.cc
+++ b/Dragon/src/operators/misc/initialize_op.cc
--- a/Dragon/src/operators/misc/python_op.cc
+++ b/Dragon/src/operators/misc/python_op.cc
--- a/Dragon/src/operators/norm/batch_norm.cc
+++ b/Dragon/src/operators/norm/batch_norm.cc
--- a/Dragon/src/operators/norm/cudnn_batch_norm_op.cc
+++ b/Dragon/src/operators/norm/cudnn_batch_norm_op.cc
--- a/Dragon/src/operators/update/adam_update_op.cc
+++ b/Dragon/src/operators/update/adam_update_op.cc
--- a/Dragon/src/operators/update/collective_update_op.cc
+++ b/Dragon/src/operators/update/collective_update_op.cc
--- a/Dragon/src/operators/update/moving_average_op.cc
+++ b/Dragon/src/operators/update/moving_average_op.cc
--- a/Dragon/src/operators/update/nesterov_update_op.cc
+++ b/Dragon/src/operators/update/nesterov_update_op.cc
--- a/Dragon/src/operators/update/rmsprop_update_op.cc
+++ b/Dragon/src/operators/update/rmsprop_update_op.cc
--- a/Dragon/src/operators/update/sgd_update_op.cc
+++ b/Dragon/src/operators/update/sgd_update_op.cc
--- a/Dragon/src/operators/update/update_op_base.cc
+++ b/Dragon/src/operators/update/update_op_base.cc
--- a/Dragon/src/operators/vision/bias_add_op.cc
+++ b/Dragon/src/operators/vision/bias_add_op.cc
--- a/Dragon/src/operators/vision/conv_op_base.cc
+++ b/Dragon/src/operators/vision/conv_op_base.cc
--- a/Dragon/src/operators/vision/cudnn_bias_add_op.cc
+++ b/Dragon/src/operators/vision/cudnn_bias_add_op.cc
--- a/Dragon/src/operators/vision/cudnn_conv2d_op.cc
+++ b/Dragon/src/operators/vision/cudnn_conv2d_op.cc
--- a/Dragon/src/operators/vision/cudnn_conv2d_transpose_op.cc
+++ b/Dragon/src/operators/vision/cudnn_conv2d_transpose_op.cc
--- a/Dragon/src/operators/vision/cudnn_depthwise_conv2d_op.cc
+++ b/Dragon/src/operators/vision/cudnn_depthwise_conv2d_op.cc
--- a/Dragon/src/operators/vision/dense_concat_op.cc
+++ b/Dragon/src/operators/vision/dense_concat_op.cc
--- a/Dragon/src/operators/vision/depthwise_conv2d_op.cc
+++ b/Dragon/src/operators/vision/depthwise_conv2d_op.cc
--- a/Dragon/src/operators/vision/lrn_op.cc
+++ b/Dragon/src/operators/vision/lrn_op.cc
--- a/Dragon/src/operators/vision/nn_resize_op.cc
+++ b/Dragon/src/operators/vision/nn_resize_op.cc
--- a/Dragon/src/operators/vision/roi_align_op.cc
+++ b/Dragon/src/operators/vision/roi_align_op.cc
--- a/Dragon/src/operators/vision/roi_pool_op.cc
+++ b/Dragon/src/operators/vision/roi_pool_op.cc
--- a/Dragon/src/proto/dragon.proto
+++ b/Dragon/src/proto/dragon.proto
--- a/Dragon/src/utils/math_functions.cc
+++ b/Dragon/src/utils/math_functions.cc
--- a/Dragon/src/utils/math_functions.cu
+++ b/Dragon/src/utils/math_functions.cu
--- a/Dragon/src/utils/math_functions.fp16.cc
+++ b/Dragon/src/utils/math_functions.fp16.cc
--- a/Dragon/src/utils/math_functions.fp16.cu
+++ b/Dragon/src/utils/math_functions.fp16.cu
--- a/pybind11 @ 25abf7ef
+++ b/pybind11 @ 25abf7ef