Add Model Zoo

Ting PAN
Commit 9d12d142 authored Oct 19, 2020 by Ting PAN
Showing with 1877 additions and 1614 deletions
csrc/cxx/.clang-format → .clang-format
.flake8
.gitignore
CHANGES
LICENSE
MODEL_ZOO.md
README.md
configs/faster_rcnn/README.md
configs/faster_rcnn/coco_faster_rcnn_R-101-FPN_1x.yml → configs/faster_rcnn/coco_faster_rcnn_R-50-FPN_800_1x.yml
configs/faster_rcnn/coco_faster_rcnn_R-101-FPN_2x.yml → configs/faster_rcnn/coco_faster_rcnn_R-50-FPN_800_2x.yml
configs/faster_rcnn/voc_faster_rcnn_R-50-FPN.yml → configs/faster_rcnn/voc_faster_rcnn_R-50-FPN_640.yml
configs/faster_rcnn/voc_faster_rcnn_VGG-16-C4.yml
configs/mask_rcnn/README.md
configs/mask_rcnn/coco_mask_rcnn_R-101-FPN_1x.yml → configs/mask_rcnn/coco_mask_rcnn_R-50-FPN_800_1x.yml
configs/mask_rcnn/coco_mask_rcnn_R-101-FPN_2x.yml → configs/mask_rcnn/coco_mask_rcnn_R-50-FPN_800_2x.yml
configs/retinanet/README.md
configs/retinanet/coco_retinanet_R-50-FPN_416_6x.yml
configs/retinanet/coco_retinanet_R-50-FPN_512_6x.yml
configs/retinanet/coco_retinanet_R-50-FPN_800_1x.yml
configs/retinanet/coco_retinanet_416_R-50-FPN.yml → configs/retinanet/coco_retinanet_R-50-FPN_800_2x.yml
--- a/csrc/cxx/.clang-format
+++ b/csrc/cxx/.clang-format
--- a/.flake8
+++ b/.flake8
+[flake8]
+max-line-length = 120
+ignore = E741, # ambiguous variable name
+         F403, # ‘from module import *’ used; unable to detect undefined names
+         F405, # name may be undefined, or defined from star imports: module
+         F811, # redefinition of unused name from line N
+         F821, # undefined name
+         W503, # line break before binary operator
+         W504, # line break after binary operator
+# module imported but unused
+per-file-ignores = __init__.py: F401
+exclude = seetadet/utils/pycocotools
--- a/.gitignore
+++ b/.gitignore
@@ -43,8 +43,13 @@ __pycache__
 # VSCode files
 .vscode
-# PyCharm files
+# IDEA files
 .idea
 # OSX dir files
 .DS_Store
+# Android files
+.gradle
+*.iml
+local.properties
--- a/CHANGES
+++ b/CHANGES
------------------------------------------------------------------------
-The list of most significant changes made over time in SeetaDet.
-SeetaDet 0.4.3 (20200724)
-Dragon Minimum Required (Version 0.3.0.dev20200723)
-Changes:
- Adapt to the latest dragon preview version.
-Preview Features:
- None
-Bugs fixed:
- None
------------------------------------------------------------------------
-SeetaDet 0.4.2 (20200707)
-Dragon Minimum Required (Version 0.3.0.dev20200707)
-Changes:
- Adapt to the latest dragon preview version.
-Preview Features:
- None
-Bugs fixed:
- None
------------------------------------------------------------------------
-SeetaDet 0.4.1 (20200421)
-Dragon Minimum Required (Version 0.3.0.dev20200421)
-Changes:
- Plan the queueing of testing images instead of reading them all.
-Preview Features:
- None
-Bugs fixed:
- None
------------------------------------------------------------------------
-SeetaDet 0.4.0 (20200408)
-Dragon Minimum Required (Version 0.3.0.dev20200408)
-Changes:
-Preview Features:
- Optimize the code structure.
- DALI support for SSD, RetinaNet, and Faster-RCNN.
- Use KPLRecord instead of SeetaRecord.
-Bugs fixed:
- Fix the frozen Affine issue.
------------------------------------------------------------------------
-SeetaDet 0.3.0 (20191121)
-Dragon Minimum Required (Version 0.3.0.dev20191121)
-Changes:
-Preview Features:
- New algorithm: Mask R-CNN.
- Add MobileNet(V2 and NAS) as backbone.
- Refactor testing module, multi-GPU is supported.
-Bugs fixed:
- Remove rotated boxes, use Mask R-CNN instead.
------------------------------------------------------------------------
-SeetaDet 0.2.3 (20191101)
-Dragon Minimum Required (Version 0.3.0.dev20191021)
-Changes:
-Preview Features:
- Refactor the API of rotated boxes.
- Simplify the solver by adding LRScheduler.
- Change the ``ITER`` naming to ``STEP``.
-Bugs fixed:
- None
------------------------------------------------------------------------
-SeetaDet 0.2.2 (20191021)
-Dragon Minimum Required (Version 0.3.0.dev20191021)
-Changes:
-Preview Features:
- Add the dumping if detection results.
-Bugs fixed:
- None
------------------------------------------------------------------------
-SeetaDet 0.2.1 (20191017)
-Dragon Minimum Required (Version 0.3.0.dev20191017)
-Changes:
-Preview Features:
- Rotated boxes and FPN support for SSD.
- Frozen the graph to speed up inference.
-Bugs fixed:
- None
------------------------------------------------------------------------
-SeetaDet 0.2.0 (20190929)
-Dragon Minimum Required (Version 0.3.0.dev20190929)
-Changes:
-Preview Features:
- Use SeetaRecord instead of LMDB.
- Flatten the implementation of layers.
-Bugs fixed:
- None
------------------------------------------------------------------------
-SeetaDet 0.1.2 (20190723)
-Dragon Minimum Required (Version 0.3.0.0)
-Changes:
-Preview Features:
- Change to the PEP8 code style.
- Adapt the new Dragon API.
-Bugs fixed:
- None
------------------------------------------------------------------------
-SeetaDet 0.1.1 (20190409)
-Dragon Minimum Required (Version 0.3.0.0)
-Changes:
-Preview Features:
- Add RandomCrop/RandomPad for ScaleJittering.
- Add ResNet18/ResNet34/AirNet for R-CNN and RetinaNet.
- Use C++ Implemented Decoder for RetinaNet instead.
-Bugs fixed:
- None
------------------------------------------------------------------------
-SeetaDet 0.1.0 (20190314)
-Dragon Minimum Required (Version 0.3.0.0)
-Changes:
-Preview Features:
- Init repository.
-Bugs fixed:
- None
--- a/LICENSE
+++ b/LICENSE
+Copyright (c) 2017, SeetaTech, Co.,Ltd. All rights reserved.
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+1. Redistributions of source code must retain the above copyright notice, this
+   list of conditions and the following disclaimer.
+2. Redistributions in binary form must reproduce the above copyright notice,
+   this list of conditions and the following disclaimer in the documentation
+   and/or other materials provided with the distribution.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
+ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--- a/MODEL_ZOO.md
+++ b/MODEL_ZOO.md
+# Benchmark and Model Zoo
+## Introduction
+### ImageNet Pretrained Models
+#### ResNet Models
+- [R-50.pkl](https://dragon.seetatech.com/download/models/seetadet/imagenet/R-50.pkl)
+- [R-101.pkl](https://dragon.seetatech.com/download/models/seetadet/imagenet/R-101.pkl)
+#### VGG Models
+- [VGG16.SSD.pkl](https://dragon.seetatech.com/download/models/seetadet/imagenet/VGG16.SSD.pkl)
+#### MobileNet Models
+- [MobileNetV2.pkl](https://dragon.seetatech.com/download/models/seetadet/imagenet/MobileNetV2.pkl)
+- [ProxylessMobile.pkl](https://dragon.seetatech.com/download/models/seetadet/imagenet/ProxylessMobile.pkl)
+#### AirNet Models
+- [AirNet.pkl](https://dragon.seetatech.com/download/models/seetadet/imagenet/AirNet.pkl)
+## Baselines
+### Faster R-CNN
+Please refer to [Faster R-CNN](configs/faster_rcnn) for details.
+### Mask R-CNN
+Please refer to [Mask R-CNN](configs/mask_rcnn) for details.
+### RetinaNet
+Please refer to [RetinaNet](configs/retinanet) for details.
+### SSD
+Please refer to [SSD](configs/ssd) for details.
--- a/README.md
+++ b/README.md
-## SeetaDet
+# SeetaDet
-## WHAT's SeetaDet?
+SeetaDet is a platform implementing popular object detection algorithms.
-SeetaDet is a platform implementing popular object detection algorithms,
+This repository is based on [seeta-dragon](https://github.com/seetaresearch/dragon),
-including R-CNN series, SSD, and RetinaNet.
+while the style of codes is torch. 
-We have achieved the same or higher performance than the baseline reported by the original paper.
-This repository is based on [Dragon](https://github.com/seetaresearch/dragon),
-while the style of codes is PyTorch.
 The torch-style codes help us to simplify the hierarchical pipeline of modern detection.
 ## Requirements
-seeta-dragon >= 0.3.0.dev20200723
+seeta-dragon >= 0.3.0.dev20201014
 ## Installation
-#### Build From Source
+### Build From Source
 If you prefer to develop modules as well as running experiments,
 following commands will build but not install to ***site-packages***:
 ```bash
-cd SeetaDet && python setup.py build
+cd seetadet && python setup.py build
 ```
-#### Install From Source
+### Install From Source
 Clone this repository to local disk and install:
 ```bash
-cd SeetaDet && python setup.py install
+cd seetadet && python setup.py install
 ```
-#### Install From Git
+### Install From Git
 You can also install it from remote repository: 
@@ -45,16 +40,16 @@ pip install git+https://gitlab.seetatech.com/seetaresearch/seetadet.git@master
 ## Quick Start
-#### Train a detection model
+### Train a detection model
 ```bash
 cd tools
 python train.py --cfg <MODEL_YAML>
 ```
-We have provided the default YAML examples into ``seetadet/configs``.
+We have provided the default YAML examples into [configs](configs).
-#### Test a detection model
+### Test a detection model
 ```bash
 cd tools
@@ -64,42 +59,33 @@ Or
 ```bash
 cd tools
-python test_all.py --cfg <MODEL_YAML> --exp_dir <EXP_DIR>
+python test_all.py --cfg <MODEL_YAML> --exp_dir <EXP_DIR> --last 1
 ```
-#### Export a detection model to ONNX
+### Export a detection model to ONNX
 ```bash
 cd tools
 python export.py --cfg <MODEL_YAML> --exp_dir <EXP_DIR> --iter <ITERATION>
 ```
-## Resources
+## Benchmark and Model Zoo
-#### Pre-trained ImageNet models
-| Model | Usage |
-| :------: | :------: |
-| [VGG16.SSD](https://dragon.seetatech.com/download/models/seetadet/imagenet/VGG16.SSD.pth)| SSD |
-| [VGG16.RCNN](https://dragon.seetatech.com/download/models/seetadet/imagenet/VGG16.RCNN.pth)| R-CNN |
-| [R-18.Affine](https://dragon.seetatech.com/download/models/seetadet/imagenet/R-18.Affine.pth)| R-CNN, RetinaNet, SSD |
-| [R-34.Affine](https://dragon.seetatech.com/download/models/seetadet/imagenet/R-34.Affine.pth)| R-CNN, RetinaNet, SSD |
-| [R-50.Affine](https://dragon.seetatech.com/download/models/seetadet/imagenet/R-50.Affine.pth)| R-CNN, RetinaNet, SSD |
-| [R-101.Affine](https://dragon.seetatech.com/download/models/seetadet/imagenet/R-101.Affine.pth)| R-CNN, RetinaNet, SSD |
-| [AirNet.Affine](https://dragon.seetatech.com/download/models/seetadet/imagenet/AirNet.Affine.pth)| R-CNN, RetinaNet, SSD |
-## References
-[1] [Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks](https://arxiv.org/abs/1506.01497). Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. NIPS, 2015.
-[2] [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385). Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. CVPR, 2016.
+Results and models are available in the [Model Zoo](MODEL_ZOO.md).
-[3] [SSD: Single Shot MultiBox Detector](https://arxiv.org/abs/1512.02325). Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. ECCV, 2016.
+### Supported Backbones
-[4] [Feature Pyramid Networks for Object Detection](https://arxiv.org/abs/1612.03144). Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. CVPR, 2017.
+- [ResNet](MODEL_ZOO.md#resnet-models)
+- [VGG](MODEL_ZOO.md#vgg-models)
+- [MobileNet](MODEL_ZOO.md#mobilenet-models)
+- [AirNet](MODEL_ZOO.md#airnet-models)
-[5] [Focal Loss for Dense Object Detection](https://arxiv.org/abs/1708.02002). Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. ICCV, 2017.
+### Supported Algorithms
-[6] [Mask R-CNN](https://arxiv.org/abs/1703.06870). Kaiming He, Georgia Gkioxari, Piotr Dollár and Ross Girshick. ICCV, 2017.
+- [Faster R-CNN](configs/faster_rcnn)
+- [Mask R-CNN](configs/mask_rcnn)
+- [SSD](configs/ssd)
+- [RetinaNet](configs/retinanet)
-[7] [Detectron](https://github.com/facebookresearch/Detectron). Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Dollar and Kaiming He. 2018.
+## License
+[BSD 2-Clause license](LICENSE)
--- a/configs/faster_rcnn/README.md
+++ b/configs/faster_rcnn/README.md
+# Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
+## Introduction
+```
+@article{Ren_2017,
+   title={Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks},
+   journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
+   publisher={Institute of Electrical and Electronics Engineers (IEEE)},
+   author={Ren, Shaoqing and He, Kaiming and Girshick, Ross and Sun, Jian},
+   year={2017},
+   month={Jun},
+}
+```
+## COCO Object Detection Baselines
+| Model | Lr sched | Infer time (s/im) | box AP | Download |
+| :---: | :------: | :---------------: | :----: | :------: |
+| [R-50-FPN-800](coco_faster_rcnn_R-50-FPN_800_1x.yml) | 1x | 0.046 | 38.3 | [model](https://dragon.seetatech.com/download/models/seetadet/faster_rcnn/coco_faster_rcnn_R-50-FPN_800_1x/model_final.pkl) |
+| [R-50-FPN-800](coco_faster_rcnn_R-50-FPN_800_2x.yml) | 2x | 0.046 | 39.7 | [model](https://dragon.seetatech.com/download/models/seetadet/faster_rcnn/coco_faster_rcnn_R-50-FPN_800_2x/model_final.pkl) |
+## Pascal VOC Object Detection Baselines
+| Model | Infer time (s/im) | AP@0.5 | Download |
+| :---: | :---------------: | :----: | :------: |
+| [R-50-FPN-640](voc_faster_rcnn_R-50-FPN_640.yml) | 0.030 | 80.8 | [model](https://dragon.seetatech.com/download/models/seetadet/faster_rcnn/voc_faster_rcnn_R-50-FPN_640_1x/model_final.pkl) |
--- a/configs/faster_rcnn/coco_faster_rcnn_R-101-FPN_1x.yml
+++ b/configs/faster_rcnn/coco_faster_rcnn_R-101-FPN_1x.yml
 NUM_GPUS: 8
-VIS: False
+PIXEL_STDS: [57.375, 57.12, 58.395]
-ENABLE_TENSOR_BOARD: False
+PIXEL_MEANS: [103.53, 116.28, 123.675]
 MODEL:
  TYPE: faster_rcnn
-  BACKBONE: resnet101.fpn
+  BACKBONE: resnet50.fpn
  CLASSES: ['__background__',
            'person', 'bicycle', 'car', 'motorcycle', 'airplane',
            'bus', 'train', 'truck', 'boat', 'traffic light',
@@ -19,30 +19,28 @@ MODEL:
            'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven',
            'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors',
            'teddy bear', 'hair drier', 'toothbrush']
-  NUM_CLASSES: 81
 SOLVER:
  BASE_LR: 0.02
+  LR_POLICY: steps_with_decay
  DECAY_STEPS: [60000, 80000]
  MAX_STEPS: 90000
  SNAPSHOT_EVERY: 5000
-  SNAPSHOT_PREFIX: coco_faster_rcnn
+  SNAPSHOT_PREFIX: coco_faster_rcnn_R-50-FPN_800_1x
 FRCNN:
-  ROI_XFORM_METHOD: RoIAlign
+  BATCH_SIZE: 512
  ROI_XFORM_RESOLUTION: 7
 TRAIN:
-  WEIGHTS: '/model/R-101.Affine.pth'
+  WEIGHTS: '/model/R-50.pkl'
  DATASET: '/data/coco_2014_trainval35k'
  IMS_PER_BATCH: 2
-  BATCH_SIZE: 512
+  SCALES: [640, 672, 704, 736, 768, 800]
-  SCALES: [800]
  MAX_SIZE: 1333
  USE_DIFF: False # Do not use crowd objects
 TEST:
  DATASET: '/data/coco_2014_minival'
  JSON_FILE: '/data/instances_minival2014.json'
  PROTOCOL: 'coco'
+  IMS_PER_BATCH: 1
  SCALES: [800]
  MAX_SIZE: 1333
  NMS: 0.5
-  RPN_POST_NMS_TOP_N: 1000
--- a/configs/faster_rcnn/coco_faster_rcnn_R-101-FPN_2x.yml
+++ b/configs/faster_rcnn/coco_faster_rcnn_R-101-FPN_2x.yml
 NUM_GPUS: 8
-VIS: False
+PIXEL_STDS: [57.375, 57.12, 58.395]
-ENABLE_TENSOR_BOARD: False
+PIXEL_MEANS: [103.53, 116.28, 123.675]
 MODEL:
  TYPE: faster_rcnn
-  BACKBONE: resnet101.fpn
+  BACKBONE: resnet50.fpn
  CLASSES: ['__background__',
            'person', 'bicycle', 'car', 'motorcycle', 'airplane',
            'bus', 'train', 'truck', 'boat', 'traffic light',
@@ -19,29 +19,28 @@ MODEL:
            'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven',
            'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors',
            'teddy bear', 'hair drier', 'toothbrush']
-  NUM_CLASSES: 81
 SOLVER:
  BASE_LR: 0.02
+  LR_POLICY: steps_with_decay
  DECAY_STEPS: [120000, 160000]
  MAX_STEPS: 180000
  SNAPSHOT_EVERY: 5000
-  SNAPSHOT_PREFIX: coco_faster_rcnn
+  SNAPSHOT_PREFIX: coco_faster_rcnn_R-50-FPN_800_2x
 FRCNN:
-  ROI_XFORM_METHOD: RoIAlign
+  BATCH_SIZE: 512
  ROI_XFORM_RESOLUTION: 7
 TRAIN:
-  WEIGHTS: '/model/R-101.Affine.pth'
+  WEIGHTS: '/model/R-50.pkl'
  DATASET: '/data/coco_2014_trainval35k'
  IMS_PER_BATCH: 2
-  BATCH_SIZE: 512
+  SCALES: [640, 672, 704, 736, 768, 800]
-  SCALES: [800]
  MAX_SIZE: 1333
  USE_DIFF: False # Do not use crowd objects
 TEST:
  DATASET: '/data/coco_2014_minival'
  JSON_FILE: '/data/instances_minival2014.json'
  PROTOCOL: 'coco'
+  IMS_PER_BATCH: 1
  SCALES: [800]
  MAX_SIZE: 1333
  NMS: 0.5
-  RPN_POST_NMS_TOP_N: 1000
--- a/configs/faster_rcnn/voc_faster_rcnn_R-50-FPN.yml
+++ b/configs/faster_rcnn/voc_faster_rcnn_R-50-FPN.yml
 NUM_GPUS: 1
-VIS: False
+PIXEL_STDS: [57.375, 57.12, 58.395]
-ENABLE_TENSOR_BOARD: False
+PIXEL_MEANS: [103.53, 116.28, 123.675]
 MODEL:
  TYPE: faster_rcnn
  BACKBONE: resnet50.fpn
@@ -10,27 +10,26 @@ MODEL:
            'cow', 'diningtable', 'dog', 'horse',
            'motorbike', 'person', 'pottedplant',
            'sheep', 'sofa', 'train', 'tvmonitor']
-  NUM_CLASSES: 21
+FRCNN:
+  BATCH_SIZE: 128
+  ROI_XFORM_RESOLUTION: 7
 SOLVER:
  BASE_LR: 0.002
-  DECAY_STEPS: [100000, 140000]
+  DECAY_STEPS: [80000, 100000]
-  MAX_STEPS: 140000
+  MAX_STEPS: 120000
  SNAPSHOT_EVERY: 5000
-  SNAPSHOT_PREFIX: voc_faster_rcnn
+  SNAPSHOT_PREFIX: voc_faster_rcnn_R-50-FPN_640
-FRCNN:
-  ROI_XFORM_METHOD: RoIAlign
-  ROI_XFORM_RESOLUTION: 7
 TRAIN:
-  WEIGHTS: '/model/R-50.Affine.pth'
+  WEIGHTS: '/model/R-50.pkl'
  DATASET: '/data/voc_0712_trainval'
  IMS_PER_BATCH: 2
-  BATCH_SIZE: 128
+  SCALES: [480, 512, 544, 576, 608, 640]
-  SCALES: [600]
+  MAX_SIZE: 1066
-  MAX_SIZE: 1000
+  USE_COLOR_JITTER: True
 TEST:
  DATASET: '/data/voc_2007_test'
-  PROTOCOL: 'voc2007' # 'voc2007', 'voc2010', 'coco'
+  PROTOCOL: 'voc2007'
-  SCALES: [600]
+  IMS_PER_BATCH: 1
-  MAX_SIZE: 1000
+  SCALES: [640]
+  MAX_SIZE: 1066
  NMS: 0.45
-  RPN_POST_NMS_TOP_N: 1000
\ No newline at end of file
--- a/configs/faster_rcnn/voc_faster_rcnn_VGG-16-C4.yml
+++ b/configs/faster_rcnn/voc_faster_rcnn_VGG-16-C4.yml
-NUM_GPUS: 1
-VIS: False
-ENABLE_TENSOR_BOARD: False
-MODEL:
-  TYPE: faster_rcnn
-  BACKBONE: vgg16.c4
-  CLASSES: ['__background__',
-            'aeroplane', 'bicycle', 'bird', 'boat',
-            'bottle', 'bus', 'car', 'cat', 'chair',
-            'cow', 'diningtable', 'dog', 'horse',
-            'motorbike', 'person', 'pottedplant',
-            'sheep', 'sofa', 'train', 'tvmonitor']
-  NUM_CLASSES: 21
-SOLVER:
-  BASE_LR: 0.001
-  WEIGHT_DECAY: 0.0005
-  DECAY_STEPS: [100000, 140000]
-  MAX_STEPS: 140000
-  SNAPSHOT_EVERY: 5000
-  SNAPSHOT_PREFIX: voc_faster_rcnn
-RPN:
-  STRIDES: [16]
-  SCALES: [8, 16, 32] # RField: [128, 256, 512]
-  ASPECT_RATIOS: [0.5, 1.0, 2.0]
-FRCNN:
-  ROI_XFORM_METHOD: RoIPool
-  ROI_XFORM_RESOLUTION: 7
-  MLP_HEAD_DIM: 4096
-TRAIN:
-  WEIGHTS: '/model/VGG16.RCNN.pth'
-  DATASET: '/data/voc_0712_trainval'
-  IMS_PER_BATCH: 2
-  BATCH_SIZE: 128
-  SCALES: [600]
-  MAX_SIZE: 1000
-  RPN_MIN_SIZE: 16
-TEST:
-  DATASET: '/data/voc_2007_test'
-  PROTOCOL: 'voc2007' # 'voc2007', 'voc2010', 'coco'
-  SCALES: [600]
-  MAX_SIZE: 1000
-  RPN_MIN_SIZE: 16
-  NMS: 0.45
-  RPN_POST_NMS_TOP_N: 300
\ No newline at end of file
--- a/configs/mask_rcnn/README.md
+++ b/configs/mask_rcnn/README.md
+# Mask R-CNN
+## Introduction
+```
+@article{He_2017,
+   title={Mask R-CNN},
+   journal={2017 IEEE International Conference on Computer Vision (ICCV)},
+   publisher={IEEE},
+   author={He, Kaiming and Gkioxari, Georgia and Dollar, Piotr and Girshick, Ross},
+   year={2017},
+   month={Oct}
+}
+```
+## COCO Instance Segmentation Baselines
+| Model | Lr sched | Infer time (s/im) | box AP | mask AP | Download |
+| :---: | :------: | :---------------: | :----: | :-----: | :------: |
+| [R-50-FPN-800](coco_mask_rcnn_R-50-FPN_800_1x.yml) | 1x | 0.056 | 39.2 | 34.8 | [model](https://dragon.seetatech.com/download/models/seetadet/mask_rcnn/coco_mask_rcnn_R-50-FPN_800_1x/model_final.pkl) |
+| [R-50-FPN-800](coco_mask_rcnn_R-50-FPN_800_2x.yml) | 2x | 0.056 | 41.4 | 36.5 | [model](https://dragon.seetatech.com/download/models/seetadet/mask_rcnn/coco_mask_rcnn_R-50-FPN_800_2x/model_final.pkl) |
--- a/configs/mask_rcnn/coco_mask_rcnn_R-101-FPN_1x.yml
+++ b/configs/mask_rcnn/coco_mask_rcnn_R-101-FPN_1x.yml
 NUM_GPUS: 8
-VIS: False
+PIXEL_STDS: [57.375, 57.12, 58.395]
-ENABLE_TENSOR_BOARD: False
+PIXEL_MEANS: [103.53, 116.28, 123.675]
 MODEL:
  TYPE: mask_rcnn
-  BACKBONE: resnet101.fpn
+  BACKBONE: resnet50.fpn
  CLASSES: ['__background__',
            'person', 'bicycle', 'car', 'motorcycle', 'airplane',
            'bus', 'train', 'truck', 'boat', 'traffic light',
@@ -19,25 +19,22 @@ MODEL:
            'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven',
            'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors',
            'teddy bear', 'hair drier', 'toothbrush']
-  NUM_CLASSES: 81
 SOLVER:
  BASE_LR: 0.02
  DECAY_STEPS: [60000, 80000]
  MAX_STEPS: 90000
  SNAPSHOT_EVERY: 5000
-  SNAPSHOT_PREFIX: coco_mask_rcnn
+  SNAPSHOT_PREFIX: coco_mask_rcnn_R-50-FPN_800_1x
 FRCNN:
-  ROI_XFORM_METHOD: RoIAlign
+  BATCH_SIZE: 512
  ROI_XFORM_RESOLUTION: 7
 MRCNN:
-  ROI_XFORM_METHOD: RoIAlign
  ROI_XFORM_RESOLUTION: 14
 TRAIN:
-  WEIGHTS: '/model/R-101.Affine.pth'
+  WEIGHTS: '/model/R-50.pkl'
  DATASET: '/data/coco_2014_trainval35k'
  IMS_PER_BATCH: 2
-  BATCH_SIZE: 512
+  SCALES: [640, 672, 704, 736, 768, 800]
-  SCALES: [800]
  MAX_SIZE: 1333
  USE_DIFF: False # Do not use crowd objects
 TEST:
@@ -47,5 +44,3 @@ TEST:
  SCALES: [800]
  MAX_SIZE: 1333
  NMS: 0.5
-  RPN_POST_NMS_TOP_N: 1000
--- a/configs/mask_rcnn/coco_mask_rcnn_R-101-FPN_2x.yml
+++ b/configs/mask_rcnn/coco_mask_rcnn_R-101-FPN_2x.yml
 NUM_GPUS: 8
-VIS: False
+PIXEL_STDS: [57.375, 57.12, 58.395]
-ENABLE_TENSOR_BOARD: False
+PIXEL_MEANS: [103.53, 116.28, 123.675]
 MODEL:
  TYPE: mask_rcnn
-  BACKBONE: resnet101.fpn
+  BACKBONE: resnet50.fpn
  CLASSES: ['__background__',
            'person', 'bicycle', 'car', 'motorcycle', 'airplane',
            'bus', 'train', 'truck', 'boat', 'traffic light',
@@ -19,25 +19,22 @@ MODEL:
            'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven',
            'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors',
            'teddy bear', 'hair drier', 'toothbrush']
-  NUM_CLASSES: 81
 SOLVER:
  BASE_LR: 0.02
  DECAY_STEPS: [120000, 160000]
  MAX_STEPS: 180000
  SNAPSHOT_EVERY: 5000
-  SNAPSHOT_PREFIX: coco_mask_rcnn
+  SNAPSHOT_PREFIX: coco_mask_rcnn_R-50-FPN_800_2x
 FRCNN:
-  ROI_XFORM_METHOD: RoIAlign
+  BATCH_SIZE: 512
  ROI_XFORM_RESOLUTION: 7
 MRCNN:
-  ROI_XFORM_METHOD: RoIAlign
  ROI_XFORM_RESOLUTION: 14
 TRAIN:
-  WEIGHTS: '/model/R-101.Affine.pth'
+  WEIGHTS: '/model/R-50.pkl'
  DATASET: '/data/coco_2014_trainval35k'
  IMS_PER_BATCH: 2
-  BATCH_SIZE: 512
+  SCALES: [640, 672, 704, 736, 768, 800]
-  SCALES: [800]
  MAX_SIZE: 1333
  USE_DIFF: False # Do not use crowd objects
 TEST:
@@ -47,4 +44,3 @@ TEST:
  SCALES: [800]
  MAX_SIZE: 1333
  NMS: 0.5
-  RPN_POST_NMS_TOP_N: 1000
--- a/configs/retinanet/README.md
+++ b/configs/retinanet/README.md
+# Focal Loss for Dense Object Detection
+## Introduction
+```
+@inproceedings{lin2017focal,
+  title={Focal loss for dense object detection},
+  author={Lin, Tsung-Yi and Goyal, Priya and Girshick, Ross and He, Kaiming and Doll{\'a}r, Piotr},
+  booktitle={Proceedings of the IEEE international conference on computer vision},
+  year={2017}
+}
+```
+## COCO Object Detection Baselines
+| Model | Lr sched | Infer time (s/im) | box AP | Download |
+| :---: | :------: | :---------------: | :----: | :------: |
+| [R-50-FPN-416](coco_retinanet_R-50-FPN_416_6x.yml) | 6x | 0.019 | 34.4 | [model](https://dragon.seetatech.com/download/models/seetadet/retinanet/coco_retinanet_R-50-FPN_416_6x/model_final.pkl) |
+| [R-50-FPN-512](coco_retinanet_R-50-FPN_512_6x.yml) | 6x | 0.022 | 36.4 | [model](https://dragon.seetatech.com/download/models/seetadet/retinanet/coco_retinanet_R-50-FPN_512_6x/model_final.pkl) |
+| [R-50-FPN-800](coco_retinanet_R-50-FPN_800_1x.yml) | 1x | 0.051 | 37.4 | [model](https://dragon.seetatech.com/download/models/seetadet/retinanet/coco_retinanet_R-50-FPN_800_1x/model_final.pkl) |
+| [R-50-FPN-800](coco_retinanet_R-50-FPN_800_2x.yml) | 2x | 0.051 | 39.1 | [model](https://dragon.seetatech.com/download/models/seetadet/retinanet/coco_retinanet_R-50-FPN_800_2x/model_final.pkl) |
+## Pascal VOC Object Detection Baselines
+| Model | Infer time (s/im) | AP@0.5 | Download |
+| :---: | :---------------: | :----: | :------: |
+| [R-50-FPN-416](voc_retinanet_R-50-FPN_416.yml) | 0.015 | 82.3 | [model](https://dragon.seetatech.com/download/models/seetadet/retinanet/voc_retinanet_R-50-FPN_416/model_final.pkl) |
+| [R-50-FPN-512](voc_retinanet_R-50-FPN_512.yml) | 0.017 | 83.0 | [model](https://dragon.seetatech.com/download/models/seetadet/retinanet/voc_retinanet_R-50-FPN_512/model_final.pkl) |
--- a/configs/retinanet/coco_retinanet_R-50-FPN_416_6x.yml
+++ b/configs/retinanet/coco_retinanet_R-50-FPN_416_6x.yml
+NUM_GPUS: 8
+PIXEL_STDS: [57.375, 57.12, 58.395]
+PIXEL_MEANS: [103.53, 116.28, 123.675]
+MODEL:
+  TYPE: retinanet
+  BACKBONE: resnet50.fpn
+  CLASSES: ['__background__',
+            'person', 'bicycle', 'car', 'motorcycle', 'airplane',
+            'bus', 'train', 'truck', 'boat', 'traffic light',
+            'fire hydrant', 'stop sign', 'parking meter', 'bench',
+            'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant',
+            'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag',
+            'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
+            'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
+            'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife',
+            'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli',
+            'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
+            'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop',
+            'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven',
+            'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors',
+            'teddy bear', 'hair drier', 'toothbrush']
+FPN:
+  RPN_MIN_LEVEL: 3
+  RPN_MAX_LEVEL: 7
+SOLVER:
+  BASE_LR: 0.01
+  LR_POLICY: steps_with_decay
+  DECAY_STEPS: [90000, 120000]
+  MAX_STEPS: 135000
+  SNAPSHOT_EVERY: 2500
+  SNAPSHOT_PREFIX: coco_retinanet_R-50-FPN_416_6x
+PIPELINE:
+  TYPE: 'ssd'
+TRAIN:
+  WEIGHTS: '/model/R-50.pkl'
+  DATASET: '/data/coco_2014_trainval35k'
+  IMS_PER_BATCH: 8
+  SCALES: [416]
+  USE_DIFF: False # Do not use crowd objects
+TEST:
+  DATASET: '/data/coco_2014_minival'
+  JSON_FILE: '/data/instances_minival2014.json'
+  PROTOCOL: 'coco'
+  IMS_PER_BATCH: 1
+  SCALES: [416]
+  NMS: 0.5
--- a/configs/retinanet/coco_retinanet_R-50-FPN_512_6x.yml
+++ b/configs/retinanet/coco_retinanet_R-50-FPN_512_6x.yml
+NUM_GPUS: 8
+PIXEL_STDS: [57.375, 57.12, 58.395]
+PIXEL_MEANS: [103.53, 116.28, 123.675]
+MODEL:
+  TYPE: retinanet
+  BACKBONE: resnet50.fpn
+  CLASSES: ['__background__',
+            'person', 'bicycle', 'car', 'motorcycle', 'airplane',
+            'bus', 'train', 'truck', 'boat', 'traffic light',
+            'fire hydrant', 'stop sign', 'parking meter', 'bench',
+            'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant',
+            'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag',
+            'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
+            'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
+            'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife',
+            'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli',
+            'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
+            'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop',
+            'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven',
+            'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors',
+            'teddy bear', 'hair drier', 'toothbrush']
+FPN:
+  RPN_MIN_LEVEL: 3
+  RPN_MAX_LEVEL: 7
+SOLVER:
+  BASE_LR: 0.01
+  LR_POLICY: steps_with_decay
+  DECAY_STEPS: [90000, 120000]
+  MAX_STEPS: 135000
+  SNAPSHOT_EVERY: 2500
+  SNAPSHOT_PREFIX: coco_retinanet_R-50-FPN_512_6x
+PIPELINE:
+  TYPE: 'ssd'
+TRAIN:
+  WEIGHTS: '/model/R-50.pkl'
+  DATASET: '/data/coco_2014_trainval35k'
+  IMS_PER_BATCH: 8
+  SCALES: [512]
+  USE_DIFF: False # Do not use crowd objects
+TEST:
+  DATASET: '/data/coco_2014_minival'
+  JSON_FILE: '/data/instances_minival2014.json'
+  PROTOCOL: 'coco'
+  IMS_PER_BATCH: 1
+  SCALES: [512]
+  NMS: 0.5
--- a/configs/retinanet/coco_retinanet_R-50-FPN_800_1x.yml
+++ b/configs/retinanet/coco_retinanet_R-50-FPN_800_1x.yml
+NUM_GPUS: 8
+PIXEL_STDS: [57.375, 57.12, 58.395]
+PIXEL_MEANS: [103.53, 116.28, 123.675]
+MODEL:
+  TYPE: retinanet
+  BACKBONE: resnet50.fpn
+  CLASSES: ['__background__',
+            'person', 'bicycle', 'car', 'motorcycle', 'airplane',
+            'bus', 'train', 'truck', 'boat', 'traffic light',
+            'fire hydrant', 'stop sign', 'parking meter', 'bench',
+            'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant',
+            'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag',
+            'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
+            'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
+            'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife',
+            'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli',
+            'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
+            'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop',
+            'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven',
+            'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors',
+            'teddy bear', 'hair drier', 'toothbrush']
+FPN:
+  RPN_MIN_LEVEL: 3
+  RPN_MAX_LEVEL: 7
+SOLVER:
+  BASE_LR: 0.01
+  LR_POLICY: steps_with_decay
+  DECAY_STEPS: [60000, 80000]
+  MAX_STEPS: 90000
+  SNAPSHOT_EVERY: 5000
+  SNAPSHOT_PREFIX: coco_retinanet_R-50-FPN_800_1x
+TRAIN:
+  WEIGHTS: '/model/R-50.pkl'
+  DATASET: '/data/coco_2014_trainval35k'
+  IMS_PER_BATCH: 2
+  SCALES: [640, 672, 704, 736, 768, 800]
+  MAX_SIZE: 1333
+  USE_DIFF: False # Do not use crowd objects
+TEST:
+  DATASET: '/data/coco_2014_minival'
+  JSON_FILE: '/data/instances_minival2014.json'
+  PROTOCOL: 'coco'
+  IMS_PER_BATCH: 1
+  SCALES: [800]
+  MAX_SIZE: 1333
+  NMS: 0.5
--- a/configs/retinanet/coco_retinanet_416_R-50-FPN.yml
+++ b/configs/retinanet/coco_retinanet_416_R-50-FPN.yml
-NUM_GPUS: 4
+NUM_GPUS: 8
-VIS: False
+PIXEL_STDS: [57.375, 57.12, 58.395]
-ENABLE_TENSOR_BOARD: False
+PIXEL_MEANS: [103.53, 116.28, 123.675]
 MODEL:
  TYPE: retinanet
  BACKBONE: resnet50.fpn
@@ -19,28 +19,28 @@ MODEL:
            'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven',
            'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors',
            'teddy bear', 'hair drier', 'toothbrush']
-  NUM_CLASSES: 81
+FPN:
+  RPN_MIN_LEVEL: 3
+  RPN_MAX_LEVEL: 7
 SOLVER:
  BASE_LR: 0.01
+  LR_POLICY: steps_with_decay
  DECAY_STEPS: [120000, 160000]
  MAX_STEPS: 180000
  SNAPSHOT_EVERY: 5000
-  SNAPSHOT_PREFIX: coco_retinanet_416
+  SNAPSHOT_PREFIX: coco_retinanet_R-50-FPN_800_2x
-FPN:
-  RPN_MIN_LEVEL: 3
-  RPN_MAX_LEVEL: 7
 TRAIN:
-  WEIGHTS: '/model/R-50.Affine.pth'
+  WEIGHTS: '/model/R-50.pkl'
  DATASET: '/data/coco_2014_trainval35k'
-  IMS_PER_BATCH: 16
+  IMS_PER_BATCH: 2
-  SCALES: [416]
+  SCALES: [640, 672, 704, 736, 768, 800]
-  RANDOM_SCALES: [0.25, 1.0]
+  MAX_SIZE: 1333
-  USE_DIFF: False  # Do not use crowd objects
+  USE_DIFF: False # Do not use crowd objects
-  USE_COLOR_JITTER: False
 TEST:
  DATASET: '/data/coco_2014_minival'
  JSON_FILE: '/data/instances_minival2014.json'
  PROTOCOL: 'coco'
  IMS_PER_BATCH: 1
-  SCALES: [416]
+  SCALES: [800]
-  NMS: 0.5
+  MAX_SIZE: 1333
\ No newline at end of file
+  NMS: 0.5
--- a/configs/retinanet/voc_retinanet_320_AirNet-FPN.yml
+++ b/configs/retinanet/voc_retinanet_320_AirNet-FPN.yml
 NUM_GPUS: 1
-VIS: False
+PIXEL_STDS: [57.375, 57.12, 58.395]
-VIS_ON_FILE: False
+PIXEL_MEANS: [103.53, 116.28, 123.675]
 MODEL:
  TYPE: retinanet
-  BACKBONE: airnet.fpn
+  BACKBONE: resnet50.fpn
  CLASSES: ['__background__',
            'aeroplane', 'bicycle', 'bird', 'boat',
            'bottle', 'bus', 'car', 'cat', 'chair',
            'cow', 'diningtable', 'dog', 'horse',
            'motorbike', 'person', 'pottedplant',
            'sheep', 'sofa', 'train', 'tvmonitor']
-  NUM_CLASSES: 21
-SOLVER:
-  BASE_LR: 0.01
-  DECAY_STEPS: [40000, 50000, 60000]
-  MAX_STEPS: 60000
-  SNAPSHOT_EVERY: 5000
-  SNAPSHOT_PREFIX: voc_retinanet_320
 FPN:
  RPN_MIN_LEVEL: 3
  RPN_MAX_LEVEL: 7
+RETINANET:
+  NUM_CONVS: 2
+SOLVER:
+  BASE_LR: 0.01
+  DECAY_STEPS: [80000, 100000]
+  MAX_STEPS: 120000
+  SNAPSHOT_EVERY: 5000
+  SNAPSHOT_PREFIX: voc_retinanet_R-50-FPN_416
+PIPELINE:
+  TYPE: 'ssd'
 TRAIN:
-  WEIGHTS: '/model/AirNet.Affine.pth'
+  WEIGHTS: '/model/R-50.pkl'
  DATASET: '/data/voc_0712_trainval'
-  IMS_PER_BATCH: 32
+  IMS_PER_BATCH: 16
-  SCALES: [320]
+  SCALES: [416]
  RANDOM_SCALES: [0.25, 1.0]
  USE_COLOR_JITTER: True
 TEST:
  DATASET: '/data/voc_2007_test'
-  PROTOCOL: 'voc2007' # 'voc2007', 'voc2010', 'coco'
+  PROTOCOL: 'voc2007'
  IMS_PER_BATCH: 1
-  SCALES: [320]
+  SCALES: [416]
  NMS: 0.45
\ No newline at end of file
+  RETINANET_PRE_NMS_TOP_N: 1000
--- a/configs/retinanet/voc_retinanet_320_R-50-FPN.yml
+++ b/configs/retinanet/voc_retinanet_320_R-50-FPN.yml
-NUM_GPUS: 1
+NUM_GPUS: 2
-VIS: False
+PIXEL_STDS: [57.375, 57.12, 58.395]
-VIS_ON_FILE: False
+PIXEL_MEANS: [103.53, 116.28, 123.675]
 MODEL:
  TYPE: retinanet
-  BACKBONE: resnet34.fpn
+  BACKBONE: resnet50.fpn
  CLASSES: ['__background__',
            'aeroplane', 'bicycle', 'bird', 'boat',
            'bottle', 'bus', 'car', 'cat', 'chair',
            'cow', 'diningtable', 'dog', 'horse',
            'motorbike', 'person', 'pottedplant',
            'sheep', 'sofa', 'train', 'tvmonitor']
-  NUM_CLASSES: 21
-SOLVER:
-  BASE_LR: 0.01
-  DECAY_STEPS: [40000, 50000, 60000]
-  WARM_UP_STEPS: 2000
-  MAX_STEPS: 60000
-  SNAPSHOT_EVERY: 5000
-  SNAPSHOT_PREFIX: voc_retinanet_320
 FPN:
  RPN_MIN_LEVEL: 3
  RPN_MAX_LEVEL: 7
+RETINANET:
+  NUM_CONVS: 2
+SOLVER:
+  BASE_LR: 0.01
+  DECAY_STEPS: [80000, 100000]
+  MAX_STEPS: 120000
+  SNAPSHOT_EVERY: 5000
+  SNAPSHOT_PREFIX: voc_retinanet_R-50-FPN_512
+PIPELINE:
+  TYPE: 'ssd'
 TRAIN:
-  WEIGHTS: '/model/R-50.Affine.pth'
+  WEIGHTS: '/model/R-50.pkl'
  DATASET: '/data/voc_0712_trainval'
-  IMS_PER_BATCH: 32
+  IMS_PER_BATCH: 8
-  SCALES: [320]
+  SCALES: [512]
-  RANDOM_SCALES: [0.25, 2.0]
+  RANDOM_SCALES: [0.25, 1.0]
  USE_COLOR_JITTER: True
 TEST:
  DATASET: '/data/voc_2007_test'
-  PROTOCOL: 'voc2007' # 'voc2007', 'voc2010', 'coco'
+  PROTOCOL: 'voc2007'
  IMS_PER_BATCH: 1
-  SCALES: [320]
+  SCALES: [512]
  NMS: 0.45
\ No newline at end of file
+  RETINANET_PRE_NMS_TOP_N: 1000
--- a/configs/ssd/README.md
+++ b/configs/ssd/README.md
+# SSD: Single Shot MultiBox Detector
+## Introduction
+```
+@article{Liu_2016,
+   title={SSD: Single Shot MultiBox Detector},
+   journal={ECCV},
+   author={Liu, Wei and Anguelov, Dragomir and Erhan, Dumitru and Szegedy, Christian and Reed, Scott and Fu, Cheng-Yang and Berg, Alexander C.},
+   year={2016},
+}
+```
+## Pascal VOC Object Detection Baselines
+| Model | Infer time (s/im) | AP@0.5 | Download |
+| :---: | :---------------: | :----: | :------: |
+| [VGG-16-300](voc_ssd_VGG-16_300.yml) | 0.012 | 78.3 | [model](https://dragon.seetatech.com/download/models/seetadet/ssd/voc_ssd_VGG-16_300/model_final.pkl) |
+| [VGG-16-512](voc_ssd_VGG-16_512.yml) | 0.021 | 80.1 | [model](https://dragon.seetatech.com/download/models/seetadet/ssd/voc_ssd_VGG-16_512/model_final.pkl) |
--- a/configs/ssd/voc_ssd_320_AirNet-FPN.yml
+++ b/configs/ssd/voc_ssd_320_AirNet-FPN.yml
-NUM_GPUS: 1
-VIS: False
-ENABLE_TENSOR_BOARD: False
-MODEL:
-  TYPE: ssd
-  BACKBONE: airnet.fpn
-  CLASSES: ['__background__',
-            'aeroplane', 'bicycle', 'bird', 'boat',
-            'bottle', 'bus', 'car', 'cat', 'chair',
-            'cow', 'diningtable', 'dog', 'horse',
-            'motorbike', 'person', 'pottedplant',
-            'sheep', 'sofa', 'train', 'tvmonitor']
-  NUM_CLASSES: 21
-SOLVER:
-  BASE_LR: 0.001
-  DECAY_STEPS: [80000, 100000, 120000]
-  MAX_STEPS: 120000
-  SNAPSHOT_EVERY: 5000
-  SNAPSHOT_PREFIX: voc_ssd_320
-FPN:
-  RPN_MIN_LEVEL: 3
-  RPN_MAX_LEVEL: 8
-SSD:
-  NUM_CONVS: 2
-  MULTIBOX:
-    STRIDES: [8, 16, 32, 64, 100, 300]
-    MIN_SIZES: [30, 60, 110, 162, 213, 264]
-    MAX_SIZES: [60, 110, 162, 213, 264, 315]
-    ASPECT_RATIOS: [
-      [1, 2, 0.5],
-      [1, 2, 0.5, 3, 0.33],
-      [1, 2, 0.5, 3, 0.33],
-      [1, 2, 0.5, 3, 0.33],
-      [1, 2, 0.5],
-      [1, 2, 0.5],
-    ]
-TRAIN:
-  WEIGHTS: '/model/AirNet.Affine.pth'
-  DATASET: '/data/voc_0712_trainval'
-  IMS_PER_BATCH: 32
-  SCALES: [320]
-  RANDOM_SCALES: [0.25, 1.00]
-  USE_COLOR_JITTER: True
-TEST:
-  DATASET: '/data/voc_2007_test'
-  PROTOCOL: 'voc2007' # 'voc2007', 'voc2010', 'coco'
-  IMS_PER_BATCH: 8
-  SCALES: [320]
-  NMS_TOP_K: 400
-  NMS: 0.45
-  SCORE_THRESH: 0.01
-  DETECTIONS_PER_IM: 200
\ No newline at end of file
--- a/configs/ssd/voc_ssd_320_R-50-FPN.yml
+++ b/configs/ssd/voc_ssd_320_R-50-FPN.yml
-NUM_GPUS: 1
-VIS: False
-ENABLE_TENSOR_BOARD: False
-MODEL:
-  TYPE: ssd
-  BACKBONE: resnet50.fpn
-  CLASSES: ['__background__',
-            'aeroplane', 'bicycle', 'bird', 'boat',
-            'bottle', 'bus', 'car', 'cat', 'chair',
-            'cow', 'diningtable', 'dog', 'horse',
-            'motorbike', 'person', 'pottedplant',
-            'sheep', 'sofa', 'train', 'tvmonitor']
-  NUM_CLASSES: 21
-FPN:
-  RPN_MIN_LEVEL: 3
-  RPN_MAX_LEVEL: 8
-SOLVER:
-  BASE_LR: 0.001
-  DECAY_STEPS: [80000, 100000, 120000]
-  MAX_STEPS: 120000
-  SNAPSHOT_EVERY: 5000
-  SNAPSHOT_PREFIX: voc_ssd_320
-SSD:
-  NUM_CONVS: 2
-  MULTIBOX:
-    STRIDES: [8, 16, 32, 64, 100, 300]
-    MIN_SIZES: [30, 60, 110, 162, 213, 264]
-    MAX_SIZES: [60, 110, 162, 213, 264, 315]
-    ASPECT_RATIOS: [
-      [1, 2, 0.5],
-      [1, 2, 0.5, 3, 0.33],
-      [1, 2, 0.5, 3, 0.33],
-      [1, 2, 0.5, 3, 0.33],
-      [1, 2, 0.5],
-      [1, 2, 0.5]
-    ]
-TRAIN:
-  WEIGHTS: '/model/R-50.Affine.pth'
-  DATASET: '/data/voc_0712_trainval'
-  IMS_PER_BATCH: 32
-  SCALES: [320]
-  RANDOM_SCALES: [0.25, 1.00]
-  USE_COLOR_JITTER: True
-TEST:
-  DATASET: '/data/voc_2007_test'
-  PROTOCOL: 'voc2007' # 'voc2007', 'voc2010', 'coco'
-  IMS_PER_BATCH: 8
-  SCALES: [320]
-  NMS_TOP_K: 400
-  NMS: 0.45
-  SCORE_THRESH: 0.01
-  DETECTIONS_PER_IM: 200
--- a/configs/ssd/voc_ssd_300_VGG-16.yml
+++ b/configs/ssd/voc_ssd_300_VGG-16.yml
 NUM_GPUS: 1
-VIS: False
+PIXEL_STDS: [1.0, 1.0, 1.0]
-ENABLE_TENSOR_BOARD: False
+PIXEL_MEANS: [103.53, 116.28, 123.675]
 MODEL:
  TYPE: ssd
-  BACKBONE: vgg16_reduced_300.mbox
+  BACKBONE: vgg16_reduced_300
-  FREEZE_AT: 0
+  COARSEST_STRIDE: 0
  CLASSES: ['__background__',
            'aeroplane', 'bicycle', 'bird', 'boat',
            'bottle', 'bus', 'car', 'cat', 'chair',
            'cow', 'diningtable', 'dog', 'horse',
            'motorbike', 'person', 'pottedplant',
            'sheep', 'sofa', 'train', 'tvmonitor']
-  NUM_CLASSES: 21
+SSD:
+  STRIDES: [8, 16, 32, 64, 100, 300]
+  ANCHOR_SIZES: [[30, 60],
+                 [60, 110],
+                 [110, 162],
+                 [162, 213],
+                 [213, 264],
+                 [264, 315]]
+  ASPECT_RATIOS: [[1, 2, 0.5],
+                  [1, 2, 0.5, 3, 0.33],
+                  [1, 2, 0.5, 3, 0.33],
+                  [1, 2, 0.5, 3, 0.33],
+                  [1, 2, 0.5],
+                  [1, 2, 0.5]]
 SOLVER:
  BASE_LR: 0.001
  WEIGHT_DECAY: 0.0005
-  DECAY_STEPS: [80000, 100000, 120000]
+  DECAY_STEPS: [80000, 100000]
  MAX_STEPS: 120000
  SNAPSHOT_EVERY: 5000
-  SNAPSHOT_PREFIX: voc_ssd_300
+  SNAPSHOT_PREFIX: voc_ssd_VGG-16_300
-SSD:
-  MULTIBOX:
-    STRIDES: [8, 16, 32, 64, 100, 300]
-    MIN_SIZES: [30, 60, 110, 162, 213, 264]
-    MAX_SIZES: [60, 110, 162, 213, 264, 315]
-    ASPECT_RATIOS: [
-      [1, 2, 0.5],
-      [1, 2, 0.5, 3, 0.33],
-      [1, 2, 0.5, 3, 0.33],
-      [1, 2, 0.5, 3, 0.33],
-      [1, 2, 0.5],
-      [1, 2, 0.5]
-    ]
 TRAIN:
-  WEIGHTS: '/model/VGG16.SSD.pth'
+  WEIGHTS: '/model/VGG16.SSD.pkl'
  DATASET: '/data/voc_0712_trainval'
-  IMS_PER_BATCH: 32
+  IMS_PER_BATCH: 16
  SCALES: [300]
-  RANDOM_SCALES: [0.25, 1.00]
+  RANDOM_SCALES: [0.25, 1.0]
  USE_COLOR_JITTER: True
 TEST:
  DATASET: '/data/voc_2007_test'
-  PROTOCOL: 'voc2007' # 'voc2007', 'voc2010', 'coco'
+  PROTOCOL: 'voc2007'
-  IMS_PER_BATCH: 8
+  IMS_PER_BATCH: 1
  SCALES: [300]
-  NMS_TOP_K: 400
  NMS: 0.45
  SCORE_THRESH: 0.01
-  DETECTIONS_PER_IM: 200
--- a/configs/ssd/voc_ssd_VGG-16_512.yml
+++ b/configs/ssd/voc_ssd_VGG-16_512.yml
+NUM_GPUS: 2
+PIXEL_STDS: [1.0, 1.0, 1.0]
+PIXEL_MEANS: [103.53, 116.28, 123.675]
+MODEL:
+  TYPE: ssd
+  BACKBONE: vgg16_reduced_512
+  CLASSES: ['__background__',
+            'aeroplane', 'bicycle', 'bird', 'boat',
+            'bottle', 'bus', 'car', 'cat', 'chair',
+            'cow', 'diningtable', 'dog', 'horse',
+            'motorbike', 'person', 'pottedplant',
+            'sheep', 'sofa', 'train', 'tvmonitor']
+SSD:
+  STRIDES: [8, 16, 32, 64, 128, 256, 512]
+  ANCHOR_SIZES: [[35.84, 76.8],
+                 [76.8, 153.6],
+                 [153.6, 230.4],
+                 [230.4, 307.2],
+                 [307.2, 384.0],
+                 [384.0, 460.8],
+                 [460.8, 537.6]]
+  ASPECT_RATIOS: [[1, 2, 0.5],
+                  [1, 2, 0.5, 3, 0.33],
+                  [1, 2, 0.5, 3, 0.33],
+                  [1, 2, 0.5, 3, 0.33],
+                  [1, 2, 0.5, 3, 0.33],
+                  [1, 2, 0.5],
+                  [1, 2, 0.5]]
+SOLVER:
+  BASE_LR: 0.001
+  WEIGHT_DECAY: 0.0005
+  DECAY_STEPS: [80000, 100000]
+  MAX_STEPS: 120000
+  SNAPSHOT_EVERY: 5000
+  SNAPSHOT_PREFIX: voc_ssd_VGG-16_512
+TRAIN:
+  WEIGHTS: '/model/VGG16.SSD.pkl'
+  DATASET: '/data/voc_0712_trainval'
+  IMS_PER_BATCH: 8
+  SCALES: [512]
+  RANDOM_SCALES: [0.25, 1.0]
+  USE_COLOR_JITTER: True
+TEST:
+  DATASET: '/data/voc_2007_test'
+  PROTOCOL: 'voc2007'
+  IMS_PER_BATCH: 1
+  SCALES: [512]
+  NMS: 0.45
+  SCORE_THRESH: 0.01
--- a/csrc/cxx/operators/nms_op.cc
+++ b/csrc/cxx/operators/nms_op.cc
@@ -7,7 +7,6 @@ template <class Context>
 template <typename T>
 void NonMaxSuppressionOp<Context>::DoRunWithType() {
  int num_selected;
  utils::detection::ApplyNMS(
      Output(0)->count(),
      Output(0)->count(),
@@ -16,7 +15,6 @@ void NonMaxSuppressionOp<Context>::DoRunWithType() {
      Output(0)->template mutable_data<int64_t, CPUContext>(),
      num_selected,
      ctx());
  Output(0)->Reshape({num_selected});
 }
@@ -24,14 +22,13 @@ template <class Context>
 void NonMaxSuppressionOp<Context>::RunOnDevice() {
  CHECK(Input(0).ndim() == 2 && Input(0).dim(1) == 5)
      << "\nThe dimensions of boxes should be (num_boxes, 5).";
  Output(0)->Reshape({Input(0).dim(0)});
  DispatchHelper<TensorTypes<float>>::Call(this, Input(0));
 }
-DEPLOY_CPU(NonMaxSuppression);
+DEPLOY_CPU_OPERATOR(NonMaxSuppression);
 #ifdef USE_CUDA
-DEPLOY_CUDA(NonMaxSuppression);
+DEPLOY_CUDA_OPERATOR(NonMaxSuppression);
 #endif
 OPERATOR_SCHEMA(NonMaxSuppression).NumInputs(1).NumOutputs(1);

--- a/csrc/cxx/operators/nms_op.h
+++ b/csrc/cxx/operators/nms_op.h
@@ -22,7 +22,7 @@ class NonMaxSuppressionOp final : public Operator<Context> {
 public:
  NonMaxSuppressionOp(const OperatorDef& def, Workspace* ws)
      : Operator<Context>(def, ws),
-        iou_threshold_(OpArg<float>("iou_threshold", 0.5f)) {}
+        iou_threshold_(OP_SINGLE_ARG(float, "iou_threshold", 0.5f)) {}
  USE_OPERATOR_FUNCTIONS;
  void RunOnDevice() override;

--- a/csrc/cxx/operators/retinanet_decoder_op.cc
+++ b/csrc/cxx/operators/retinanet_decoder_op.cc
@@ -10,50 +10,48 @@ template <typename T>
 void RetinaNetDecoderOp<Context>::DoRunWithType() {
  using BT = float; // DType of BBox
  using BC = CPUContext; // Context of BBox
-  int feat_h, feat_w;
-  int C = Input(-3).dim(2), A, K;
  int total_proposals = 0;
-  int num_candidates, num_boxes, num_proposals;
-  auto* batch_scores = Input(-3).template data<T, BC>();
+  auto* batch_scores = Input(SCORES).template data<T, Context>();
-  auto* batch_deltas = Input(-2).template data<T, BC>();
+  auto* batch_deltas = Input(DELTAS).template data<T, BC>();
-  auto* im_info = Input(-1).template data<BT, BC>();
+  auto* im_info = Input(IMAGE_INFO).template data<BT, BC>();
-  auto* y = Output(0)->template mutable_data<BT, BC>();
+  auto* all_proposals = Output(0)->template mutable_data<BT, BC>();
-  for (int n = 0; n < num_images_; ++n) {
+  for (int im_idx = 0; im_idx < num_images_; ++im_idx) {
    BT im_h = im_info[0];
    BT im_w = im_info[1];
    BT im_scale_h = im_info[2];
    BT im_scale_w = im_info[2];
-    if (Input(-1).dim(1) == 4) im_scale_w = im_info[3];
+    if (Input(IMAGE_INFO).dim(1) == 4) im_scale_w = im_info[3];
-    auto* scores = batch_scores + n * Input(-3).stride(0);
-    auto* deltas = batch_deltas + n * Input(-2).stride(0);
    CHECK_EQ(strides_.size(), InputSize() - 3)
        << "\nGiven " << strides_.size() << " strides "
        << "and " << InputSize() - 3 << " features";
    // Select the top-k candidates as proposals
-    num_boxes = Input(-3).dim(1);
+    auto num_boxes = Input(SCORES).dim(1);
-    num_candidates = Input(-3).count(1);
+    auto num_classes = Input(SCORES).dim(2);
-    roi_indices_.resize(num_candidates);
+    utils::detection::SelectProposals(
-    num_candidates = 0;
+        Input(SCORES).count(1),
-    for (int i = 0; i < roi_indices_.size(); ++i)
+        score_thr_,
-      if (scores[i] > score_thr_) roi_indices_[num_candidates++] = i;
+        batch_scores + im_idx * Input(SCORES).stride(0),
-    scores_.resize(num_candidates);
+        roi_scores_,
-    for (int i = 0; i < num_candidates; ++i)
+        roi_indices_,
-      scores_[i] = scores[roi_indices_[i]];
+        ctx());
-    num_proposals = std::min(num_candidates, (int)pre_nms_topn_);
+    auto num_candidates = (int)roi_scores_.size();
-    utils::math::ArgPartition(
+    auto num_proposals = std::min(num_candidates, (int)pre_nms_topn_);
-        num_candidates, num_proposals, true, scores_.data(), indices_);
+    utils::detection::ArgPartition(
-    for (int i = 0; i < num_proposals; ++i)
+        num_candidates, num_proposals, true, roi_scores_.data(), indices_);
+    scores_.resize(indices_.size());
+    for (int i = 0; i < num_proposals; ++i) {
+      scores_[i] = roi_scores_[indices_[i]];
      indices_[i] = roi_indices_[indices_[i]];
-    // Decode the candidates
+    }
-    int base_offset = 0;
+    // Decode proposals via anchors
+    int stride_offset = 0;
    for (int i = 0; i < strides_.size(); i++) {
-      feat_h = Input(i).dim(2);
+      auto feature_h = Input(i).dim(2);
-      feat_w = Input(i).dim(3);
+      auto feature_w = Input(i).dim(3);
-      K = feat_h * feat_w;
+      auto K = feature_h * feature_w;
-      A = int(ratios_.size() * scales_.size());
+      auto A = int(ratios_.size() * scales_.size());
      anchors_.resize((size_t)(A * 4));
      utils::detection::GenerateAnchors(
          strides_[i],
@@ -62,35 +60,35 @@ void RetinaNetDecoderOp<Context>::DoRunWithType() {
          ratios_.data(),
          scales_.data(),
          anchors_.data());
-      utils::detection::GenerateGridAnchors(
+      utils::detection::GetShiftedAnchors(
          num_proposals,
-          C,
+          num_classes,
          A,
-          feat_h,
+          feature_h,
-          feat_w,
+          feature_w,
          strides_[i],
-          base_offset,
+          stride_offset,
          anchors_.data(),
          indices_.data(),
-          y);
+          all_proposals);
-      base_offset += (A * K);
+      stride_offset += (A * K);
    }
-    utils::detection::GenerateMCProposals(
+    utils::detection::GenerateDetections(
        num_proposals,
        num_boxes,
-        C,
+        num_classes,
-        n,
+        im_idx,
        im_h,
        im_w,
        im_scale_h,
        im_scale_w,
-        scores,
+        scores_.data(),
-        deltas,
+        batch_deltas + im_idx * Input(DELTAS).stride(0),
        indices_.data(),
-        y);
+        all_proposals);
    total_proposals += num_proposals;
-    y += (num_proposals * 7);
+    all_proposals += (num_proposals * 7);
-    im_info += Input(-1).dim(1);
+    im_info += Input(IMAGE_INFO).dim(1);
  }
  Output(0)->Reshape({total_proposals, 7});
@@ -99,20 +97,20 @@ void RetinaNetDecoderOp<Context>::DoRunWithType() {
 template <class Context>
 void RetinaNetDecoderOp<Context>::RunOnDevice() {
  num_images_ = Input(0).dim(0);
  CHECK_EQ(Input(-1).dim(0), num_images_)
      << "\nExcepted " << num_images_ << " groups info, got "
      << Input(-1).dim(0) << ".";
  Output(0)->Reshape({num_images_ * pre_nms_topn_, 7});
-  DispatchHelper<TensorTypes<float>>::Call(this, Input(-3));
+  DispatchHelper<TensorTypes<float>>::Call(this, Input(SCORES));
 }
-DEPLOY_CPU(RetinaNetDecoder);
+DEPLOY_CPU_OPERATOR(RetinaNetDecoder);
 #ifdef USE_CUDA
-DEPLOY_CUDA(RetinaNetDecoder);
+DEPLOY_CUDA_OPERATOR(RetinaNetDecoder);
 #endif
 OPERATOR_SCHEMA(RetinaNetDecoder).NumInputs(3, INT_MAX).NumOutputs(1, INT_MAX);
+NO_GRADIENT(RetinaNetDecoder);
 } // namespace dragon
--- a/csrc/cxx/operators/retinanet_decoder_op.h
+++ b/csrc/cxx/operators/retinanet_decoder_op.h
@@ -22,11 +22,11 @@ class RetinaNetDecoderOp final : public Operator<Context> {
 public:
  RetinaNetDecoderOp(const OperatorDef& def, Workspace* ws)
      : Operator<Context>(def, ws),
-        strides_(OpArgs<int64_t>("strides")),
+        strides_(OP_REPEATED_ARG(int64_t, "strides")),
-        ratios_(OpArgs<float>("ratios")),
+        ratios_(OP_REPEATED_ARG(float, "ratios")),
-        scales_(OpArgs<float>("scales")),
+        scales_(OP_REPEATED_ARG(float, "scales")),
-        pre_nms_topn_(OpArg<int64_t>("pre_nms_top_n", 6000)),
+        pre_nms_topn_(OP_SINGLE_ARG(int64_t, "pre_nms_top_n", 6000)),
-        score_thr_(OpArg<float>("score_thresh", 0.05f)) {}
+        score_thr_(OP_SINGLE_ARG(float, "score_thresh", 0.05f)) {}
  USE_OPERATOR_FUNCTIONS;
  void RunOnDevice() override;
@@ -34,10 +34,13 @@ class RetinaNetDecoderOp final : public Operator<Context> {
  template <typename T>
  void DoRunWithType();
+  enum INPUT_TAGS { SCORES = -3, DELTAS = -2, IMAGE_INFO = -1 };
 protected:
  float score_thr_;
  vec64_t strides_, indices_, roi_indices_;
-  vector<float> ratios_, scales_, scores_, anchors_;
+  vector<float> ratios_, scales_, anchors_;
+  vector<float> scores_, roi_scores_;
  int64_t num_images_, pre_nms_topn_;
 };

--- a/csrc/cxx/operators/rpn_decoder_op.cc
+++ b/csrc/cxx/operators/rpn_decoder_op.cc
@@ -15,153 +15,81 @@ void RPNDecoderOp<Context>::DoRunWithType() {
  int total_rois = 0, num_rois;
  int num_candidates, num_proposals;
-  auto* batch_scores = Input(-3).template data<T, BC>();
+  auto* batch_scores = Input(SCORES).template data<T, BC>();
-  auto* batch_deltas = Input(-2).template data<T, BC>();
+  auto* batch_deltas = Input(DELTAS).template data<T, BC>();
-  auto* im_info = Input(-1).template data<BT, BC>();
+  auto* im_info = Input(IMAGE_INFO).template data<BT, BC>();
-  auto* y = Output(0)->template mutable_data<BT, BC>();
+  auto* all_rois = Output(0)->template mutable_data<BT, BC>();
-  for (int n = 0; n < num_images_; ++n) {
+  for (int im_idx = 0; im_idx < num_images_; ++im_idx) {
    const BT im_h = im_info[0];
    const BT im_w = im_info[1];
-    const BT scale = im_info[2];
+    auto* scores = batch_scores + im_idx * Input(SCORES).stride(0);
-    const BT min_box_h = min_size_ * scale;
+    auto* deltas = batch_deltas + im_idx * Input(DELTAS).stride(0);
-    const BT min_box_w = min_size_ * scale;
+    CHECK_EQ(strides_.size(), InputSize() - 3)
-    auto* scores = batch_scores + n * Input(-3).stride(0);
+        << "\nGiven " << strides_.size() << " strides "
-    auto* deltas = batch_deltas + n * Input(-2).stride(0);
+        << "and " << InputSize() - 3 << " feature inputs";
-    if (strides_.size() == 1) {
+    CHECK_EQ(strides_.size(), scales_.size())
-      // Case 1: single stride
+        << "\nGiven " << strides_.size() << " strides "
-      feat_h = Input(0).dim(2);
+        << "and " << scales_.size() << " scales";
-      feat_w = Input(0).dim(3);
+    // Select the top-k candidates as proposals
+    num_candidates = Input(SCORES).dim(1);
+    num_proposals = std::min(num_candidates, (int)pre_nms_top_n_);
+    utils::math::ArgPartition(
+        num_candidates, num_proposals, true, scores, indices_);
+    // Decode the candidates
+    int stride_offset = 0;
+    proposals_.Reshape({num_proposals, 5});
+    auto* proposals = proposals_.template mutable_data<BT, BC>();
+    for (int i = 0; i < strides_.size(); i++) {
+      feat_h = Input(i).dim(2);
+      feat_w = Input(i).dim(3);
      K = feat_h * feat_w;
-      A = int(ratios_.size() * scales_.size());
+      A = (int)ratios_.size();
-      // Select the Top-K candidates as proposals
-      num_candidates = A * K;
-      num_proposals = std::min(num_candidates, (int)pre_nms_topn_);
-      utils::math::ArgPartition(
-          num_candidates, num_proposals, true, scores, indices_);
-      // Decode the candidates
      anchors_.resize((size_t)(A * 4));
-      proposals_.Reshape({num_proposals, 5});
      utils::detection::GenerateAnchors(
-          strides_[0],
+          strides_[i],
          (int)ratios_.size(),
-          (int)scales_.size(),
+          1,
          ratios_.data(),
          scales_.data(),
          anchors_.data());
-      utils::detection::GenerateGridAnchors(
+      utils::detection::GetShiftedAnchors(
          num_proposals,
          A,
          feat_h,
          feat_w,
-          strides_[0],
+          strides_[i],
-          0,
+          stride_offset,
          anchors_.data(),
          indices_.data(),
-          proposals_.template mutable_data<BT, BC>());
-      utils::detection::GenerateSSProposals(
-          K,
-          num_proposals,
-          im_h,
-          im_w,
-          min_box_h,
-          min_box_w,
-          scores,
-          deltas,
-          indices_.data(),
-          proposals_.template mutable_data<BT, BC>());
-      // Sort, NMS and Retrieve
-      utils::detection::SortProposals(
-          0,
-          num_proposals - 1,
-          num_proposals,
-          proposals_.template mutable_data<BT, BC>());
-      utils::detection::ApplyNMS(
-          num_proposals,
-          post_nms_topn_,
-          nms_thr_,
-          proposals_.template mutable_data<BT, Context>(),
-          roi_indices_.data(),
-          num_rois,
-          ctx());
-      utils::detection::RetrieveRoIs(
-          num_rois,
-          n,
-          proposals_.template data<BT, BC>(),
-          roi_indices_.data(),
-          y);
-    } else if (strides_.size() > 1) {
-      // Case 2: multiple strides
-      CHECK_EQ(strides_.size(), InputSize() - 3)
-          << "\nGiven " << strides_.size() << " strides "
-          << "and " << InputSize() - 3 << " feature inputs";
-      CHECK_EQ(strides_.size(), scales_.size())
-          << "\nGiven " << strides_.size() << " strides "
-          << "and " << scales_.size() << " scales";
-      // Select the top-k candidates as proposals
-      num_candidates = Input(-3).dim(1);
-      num_proposals = std::min(num_candidates, (int)pre_nms_topn_);
-      utils::math::ArgPartition(
-          num_candidates, num_proposals, true, scores, indices_);
-      // Decode the candidates
-      int base_offset = 0;
-      proposals_.Reshape({num_proposals, 5});
-      auto* proposals = proposals_.template mutable_data<BT, BC>();
-      for (int i = 0; i < strides_.size(); i++) {
-        feat_h = Input(i).dim(2);
-        feat_w = Input(i).dim(3);
-        K = feat_h * feat_w;
-        A = (int)ratios_.size();
-        anchors_.resize((size_t)(A * 4));
-        utils::detection::GenerateAnchors(
-            strides_[i],
-            (int)ratios_.size(),
-            1,
-            ratios_.data(),
-            scales_.data(),
-            anchors_.data());
-        utils::detection::GenerateGridAnchors(
-            num_proposals,
-            A,
-            feat_h,
-            feat_w,
-            strides_[i],
-            base_offset,
-            anchors_.data(),
-            indices_.data(),
-            proposals);
-        base_offset += (A * K);
-      }
-      utils::detection::GenerateMSProposals(
-          num_candidates,
-          num_proposals,
-          im_h,
-          im_w,
-          min_box_h,
-          min_box_w,
-          scores,
-          deltas,
-          &indices_[0],
          proposals);
-      // Sort, NMS and Retrieve
+      stride_offset += (A * K);
-      utils::detection::SortProposals(
-          0, num_proposals - 1, num_proposals, proposals);
-      utils::detection::ApplyNMS(
-          num_proposals,
-          post_nms_topn_,
-          nms_thr_,
-          proposals_.template mutable_data<BT, Context>(),
-          roi_indices_.data(),
-          num_rois,
-          ctx());
-      utils::detection::RetrieveRoIs(
-          num_rois, n, proposals, roi_indices_.data(), y);
-    } else {
-      LOG(FATAL) << "Excepted at least one stride for proposals.";
    }
+    utils::detection::GenerateProposals(
+        num_candidates,
+        num_proposals,
+        im_h,
+        im_w,
+        scores,
+        deltas,
+        &indices_[0],
+        proposals);
+    // Sort, NMS and Retrieve
+    utils::detection::SortProposals(
+        0, num_proposals - 1, num_proposals, proposals);
+    utils::detection::ApplyNMS(
+        num_proposals,
+        post_nms_top_n_,
+        nms_thr_,
+        proposals_.template mutable_data<BT, Context>(),
+        roi_indices_.data(),
+        num_rois,
+        ctx());
+    utils::detection::RetrieveRoIs(
+        num_rois, im_idx, proposals, roi_indices_.data(), all_rois);
    total_rois += num_rois;
-    y += (num_rois * 5);
+    all_rois += (num_rois * 5);
-    im_info += Input(-1).dim(1);
+    im_info += Input(IMAGE_INFO).dim(1);
  }
  Output(0)->Reshape({total_rois, 5});
@@ -202,22 +130,21 @@ void RPNDecoderOp<Context>::DoRunWithType() {
 template <class Context>
 void RPNDecoderOp<Context>::RunOnDevice() {
  num_images_ = Input(0).dim(0);
+  CHECK_EQ(Input(IMAGE_INFO).dim(0), num_images_)
-  CHECK_EQ(Input(-1).dim(0), num_images_)
      << "\nExcepted " << num_images_ << " groups info, got "
-      << Input(-1).dim(0) << ".";
+      << Input(IMAGE_INFO).dim(0) << ".";
+  roi_indices_.resize(post_nms_top_n_);
-  roi_indices_.resize(post_nms_topn_);
+  Output(0)->Reshape({num_images_ * post_nms_top_n_, 5});
-  Output(0)->Reshape({num_images_ * post_nms_topn_, 5});
+  DispatchHelper<TensorTypes<float>>::Call(this, Input(SCORES));
-  DispatchHelper<TensorTypes<float>>::Call(this, Input(-3));
 }
-DEPLOY_CPU(RPNDecoder);
+DEPLOY_CPU_OPERATOR(RPNDecoder);
 #ifdef USE_CUDA
-DEPLOY_CUDA(RPNDecoder);
+DEPLOY_CUDA_OPERATOR(RPNDecoder);
 #endif
 OPERATOR_SCHEMA(RPNDecoder).NumInputs(3, INT_MAX).NumOutputs(1, INT_MAX);
+NO_GRADIENT(RPNDecoder);
 } // namespace dragon
--- a/csrc/cxx/operators/rpn_decoder_op.h
+++ b/csrc/cxx/operators/rpn_decoder_op.h
@@ -22,17 +22,16 @@ class RPNDecoderOp final : public Operator<Context> {
 public:
  RPNDecoderOp(const OperatorDef& def, Workspace* ws)
      : Operator<Context>(def, ws),
-        strides_(OpArgs<int64_t>("strides")),
+        strides_(OP_REPEATED_ARG(int64_t, "strides")),
-        ratios_(OpArgs<float>("ratios")),
+        ratios_(OP_REPEATED_ARG(float, "ratios")),
-        scales_(OpArgs<float>("scales")),
+        scales_(OP_REPEATED_ARG(float, "scales")),
-        pre_nms_topn_(OpArg<int64_t>("pre_nms_top_n", 6000)),
+        pre_nms_top_n_(OP_SINGLE_ARG(int64_t, "pre_nms_top_n", 6000)),
-        post_nms_topn_(OpArg<int64_t>("post_nms_top_n", 300)),
+        post_nms_top_n_(OP_SINGLE_ARG(int64_t, "post_nms_top_n", 1000)),
-        nms_thr_(OpArg<float>("nms_thresh", 0.7f)),
+        nms_thr_(OP_SINGLE_ARG(float, "nms_thresh", 0.7f)),
-        min_size_(OpArg<int64_t>("min_size", 16)),
+        min_level_(OP_SINGLE_ARG(int64_t, "min_level", 2)),
-        min_level_(OpArg<int64_t>("min_level", 2)),
+        max_level_(OP_SINGLE_ARG(int64_t, "max_level", 5)),
-        max_level_(OpArg<int64_t>("max_level", 5)),
+        canonical_level_(OP_SINGLE_ARG(int64_t, "canonical_level", 4)),
-        canonical_level_(OpArg<int64_t>("canonical_level", 4)),
+        canonical_scale_(OP_SINGLE_ARG(int64_t, "canonical_scale", 224)) {}
-        canonical_scale_(OpArg<int64_t>("canonical_scale", 224)) {}
  USE_OPERATOR_FUNCTIONS;
  void RunOnDevice() override;
@@ -40,11 +39,13 @@ class RPNDecoderOp final : public Operator<Context> {
  template <typename T>
  void DoRunWithType();
+  enum INPUT_TAGS { SCORES = -3, DELTAS = -2, IMAGE_INFO = -1 };
 protected:
  float nms_thr_;
  vec64_t strides_, indices_, roi_indices_;
  vector<float> ratios_, scales_, scores_, anchors_;
-  int64_t min_size_, pre_nms_topn_, post_nms_topn_;
+  int64_t pre_nms_top_n_, post_nms_top_n_;
  int64_t num_images_, min_level_, max_level_;
  int64_t canonical_level_, canonical_scale_;
  Tensor proposals_;

--- a/csrc/cxx/setup.py
+++ b/csrc/cxx/setup.py
@@ -8,7 +8,6 @@
 #     <https://opensource.org/licenses/BSD-2-Clause>
 #
 # ------------------------------------------------------------
 """Build cxx sources."""
 from __future__ import absolute_import
@@ -16,14 +15,14 @@ from __future__ import division
 from __future__ import print_function
 import glob
-from distutils.core import setup
 from dragon.tools import cpp_extension
-if cpp_extension.CUDA_HOME is not None and \
+from setuptools import setup
-        cpp_extension._cuda.is_available():
-    Extension = cpp_extension.CUDAExtension
+Extension = cpp_extension.CppExtension
-else:
+if cpp_extension.CUDA_HOME is not None:
-    Extension = cpp_extension.CppExtension
+    if cpp_extension._cuda.is_available():
+        Extension = cpp_extension.CUDAExtension
 def find_sources(*dirs):
@@ -44,11 +43,12 @@ ext_modules = [
    Extension(
        name='install.lib.modules._C',
        sources=find_sources('**'),
+        define_macros=[('THRUST_IGNORE_CUB_VERSION_CHECK', None)],
    ),
 ]
 setup(
    name='SeetaDet',
    ext_modules=ext_modules,
-    cmdclass={'build_ext': cpp_extension.BuildExtension}
+    cmdclass={'build_ext': cpp_extension.BuildExtension},
 )
--- a/csrc/cxx/utils/detection_utils.cc
+++ b/csrc/cxx/utils/detection_utils.cc
@@ -47,6 +47,26 @@ void ApplyNMS<float, CPUContext>(
  num_keep = count;
 }
+template <>
+void SelectProposals<float, CPUContext>(
+    const int count,
+    const float score_thresh,
+    const float* input_scores,
+    vector<float>& output_scores,
+    vector<int64_t>& output_indices,
+    CPUContext* ctx) {
+  int num_proposals = 0;
+  for (int i = 0; i < count; ++i) {
+    if (input_scores[i] > score_thresh) {
+      output_indices[num_proposals++] = i;
+    }
+  }
+  output_scores.resize(num_proposals);
+  for (int i = 0; i < num_proposals; ++i) {
+    output_scores[i] = input_scores[output_indices[i]];
+  }
+}
 } // namespace detection
 } // namespace utils

--- a/csrc/cxx/utils/detection_utils.cu
+++ b/csrc/cxx/utils/detection_utils.cu
 #ifdef USE_CUDA
 #include <dragon/core/context_cuda.h>
+#include <dragon/core/workspace.h>
+#include <dragon/utils/device/common_cub.h>
+#include <dragon/utils/device/common_thrust.h>
 #include "detection_utils.h"
 namespace dragon {
@@ -15,6 +18,16 @@ namespace detection {
 namespace {
 template <typename T>
+struct ThresholdFunctor {
+  ThresholdFunctor(float thresh) : thresh_(thresh) {}
+  inline __device__ bool operator()(
+      const thrust::tuple<int64_t, T>& key_val) const {
+    return thrust::get<1>(key_val) > thresh_;
+  }
+  float thresh_;
+};
+template <typename T>
 __device__ bool _CheckIoU(const T* a, const T* b, const float thresh) {
  const T x1 = max(a[0], b[0]);
  const T y1 = max(a[1], b[1]);
@@ -72,6 +85,41 @@ __global__ void _NonMaxSuppression(
 } // namespace
 template <>
+void SelectProposals<float, CUDAContext>(
+    const int count,
+    const float score_thresh,
+    const float* in_scores,
+    vector<float>& out_scores,
+    vector<int64_t>& out_indices,
+    CUDAContext* ctx) {
+  auto* in_indices = ctx->workspace()->template data<int64_t, CUDAContext>(
+      {count}, "data:1")[0];
+  auto iter = thrust::make_zip_iterator(
+      thrust::make_tuple(in_indices, const_cast<float*>(in_scores)));
+  auto policy = thrust::cuda::par.on(ctx->cuda_stream());
+  thrust::counting_iterator<int64_t> offset(0);
+  thrust::copy(policy, offset, offset + count, in_indices);
+  auto last = thrust::partition(
+      policy, iter, iter + count, ThresholdFunctor<float>(score_thresh));
+  size_t num_proposals = last - iter;
+  out_scores.resize(num_proposals);
+  out_indices.resize(num_proposals);
+  CUDA_CHECK(cudaMemcpyAsync(
+      out_scores.data(),
+      in_scores,
+      num_proposals * sizeof(float),
+      cudaMemcpyDeviceToHost,
+      ctx->cuda_stream()));
+  CUDA_CHECK(cudaMemcpyAsync(
+      out_indices.data(),
+      in_indices,
+      num_proposals * sizeof(int64_t),
+      cudaMemcpyDeviceToHost,
+      ctx->cuda_stream()));
+  ctx->FinishDeviceComputation();
+}
+template <>
 void ApplyNMS<float, CUDAContext>(
    const int num_boxes,
    const int max_keeps,
@@ -83,7 +131,8 @@ void ApplyNMS<float, CUDAContext>(
  const int num_blocks = DIV_UP(num_boxes, NUM_THREADS);
  vector<uint64_t> mask_host(num_boxes * num_blocks);
-  auto* mask_dev = (uint64_t*)ctx->New(mask_host.size() * sizeof(uint64_t));
+  auto* mask_dev = (uint64_t*)ctx->workspace()->data<CUDAContext>(
+      {mask_host.size() * sizeof(uint64_t)}, "data:1")[0];
  _NonMaxSuppression<<<
      dim3(num_blocks, num_blocks),
@@ -115,9 +164,7 @@ void ApplyNMS<float, CUDAContext>(
      if (num_selected == max_keeps) break;
    }
  }
  num_keep = num_selected;
-  ctx->Delete(mask_dev);
 }
 } // namespace detection

--- a/csrc/cxx/utils/detection_utils.h
+++ b/csrc/cxx/utils/detection_utils.h
@@ -24,45 +24,37 @@ namespace detection {
 #define ROUND(x) ((int)((x) + (T)0.5))
 /*!
- * Box API
+ * Functional API
 */
 template <typename T>
-inline int FilterBoxes(
+inline void ArgPartition(
-    const T dx,
+    const int count,
-    const T dy,
+    const int kth,
-    const T d_log_w,
+    const bool descend,
-    const T d_log_h,
+    const T* v,
-    const T im_w,
+    vec64_t& indices) {
-    const T im_h,
+  indices.resize(count);
-    const T min_box_w,
+  std::iota(indices.begin(), indices.end(), 0);
-    const T min_box_h,
+  if (descend) {
-    T* bbox) {
+    std::nth_element(
-  const T w = bbox[2] - bbox[0] + 1;
+        indices.begin(),
-  const T h = bbox[3] - bbox[1] + 1;
+        indices.begin() + kth,
-  const T ctr_x = bbox[0] + (T)0.5 * w;
+        indices.end(),
-  const T ctr_y = bbox[1] + (T)0.5 * h;
+        [&v](int64_t lhs, int64_t rhs) { return v[lhs] > v[rhs]; });
+  } else {
-  const T pred_ctr_x = dx * w + ctr_x;
+    std::nth_element(
-  const T pred_ctr_y = dy * h + ctr_y;
+        indices.begin(),
-  const T pred_w = exp(d_log_w) * w;
+        indices.begin() + kth,
-  const T pred_h = exp(d_log_h) * h;
+        indices.end(),
+        [&v](int64_t lhs, int64_t rhs) { return v[lhs] < v[rhs]; });
-  bbox[0] = pred_ctr_x - (T)0.5 * pred_w;
+  }
-  bbox[1] = pred_ctr_y - (T)0.5 * pred_h;
-  bbox[2] = pred_ctr_x + (T)0.5 * pred_w;
-  bbox[3] = pred_ctr_y + (T)0.5 * pred_h;
-  bbox[0] = std::max((T)0, std::min(bbox[0], im_w - 1));
-  bbox[1] = std::max((T)0, std::min(bbox[1], im_h - 1));
-  bbox[2] = std::max((T)0, std::min(bbox[2], im_w - 1));
-  bbox[3] = std::max((T)0, std::min(bbox[3], im_h - 1));
-  const T bbox_w = bbox[2] - bbox[0] + 1;
-  const T bbox_h = bbox[3] - bbox[1] + 1;
-  return (bbox_w >= min_box_w) * (bbox_h >= min_box_h);
 }
+/*!
+ * Box API
+ */
 template <typename T>
 inline void BBoxTransform(
    const T dx,
@@ -126,28 +118,28 @@ inline void GenerateAnchors(
 }
 template <typename T>
-inline void GenerateGridAnchors(
+inline void GetShiftedAnchors(
    const int num_proposals,
    const int num_anchors,
    const int feat_h,
    const int feat_w,
    const int stride,
-    const int base_offset,
+    const int stride_offset,
-    const T* anchors,
+    const T* base_anchors,
    const int64_t* indices,
-    T* proposals) {
+    T* shifted_anchors) {
  T x, y;
  int idx_3d, a, h, w;
  int idx_range = num_anchors * feat_h * feat_w;
  for (int i = 0; i < num_proposals; ++i) {
-    idx_3d = (int)indices[i] - base_offset;
+    idx_3d = (int)indices[i] - stride_offset;
    if (idx_3d >= 0 && idx_3d < idx_range) {
      w = idx_3d % feat_w;
      h = (idx_3d / feat_w) % feat_h;
      a = idx_3d / feat_w / feat_h;
      x = (T)w * stride, y = (T)h * stride;
-      auto* A = anchors + a * 4;
+      auto* A = base_anchors + a * 4;
-      auto* P = proposals + i * 5;
+      auto* P = shifted_anchors + i * 5;
      P[0] = x + A[0], P[1] = y + A[1];
      P[2] = x + A[2], P[3] = y + A[3];
    }
@@ -155,20 +147,20 @@ inline void GenerateGridAnchors(
 }
 template <typename T>
-inline void GenerateGridAnchors(
+inline void GetShiftedAnchors(
    const int num_proposals,
    const int num_classes,
    const int num_anchors,
    const int feat_h,
    const int feat_w,
    const int stride,
-    const int base_offset,
+    const int stride_offset,
-    const T* anchors,
+    const T* base_anchors,
    const int64_t* indices,
-    T* proposals) {
+    T* shifted_anchors) {
  T x, y;
  int idx_4d, a, h, w;
-  int lr = num_classes * base_offset;
+  int lr = num_classes * stride_offset;
  int rr = num_classes * (num_anchors * feat_h * feat_w);
  for (int i = 0; i < num_proposals; ++i) {
    idx_4d = (int)indices[i] - lr;
@@ -178,8 +170,8 @@ inline void GenerateGridAnchors(
      h = (idx_4d / feat_w) % feat_h;
      a = idx_4d / feat_w / feat_h;
      x = (T)w * stride, y = (T)h * stride;
-      auto* A = anchors + a * 4;
+      auto* A = base_anchors + a * 4;
-      auto* P = proposals + i * 7 + 1;
+      auto* P = shifted_anchors + i * 7 + 1;
      P[0] = x + A[0], P[1] = y + A[1];
      P[2] = x + A[2], P[3] = y + A[3];
    }
@@ -190,22 +182,30 @@ inline void GenerateGridAnchors(
 * Proposal API
 */
+template <typename T, class Context>
+void SelectProposals(
+    const int count,
+    const float score_thresh,
+    const T* input_scores,
+    vector<T>& output_scores,
+    vector<int64_t>& output_indices,
+    Context* ctx);
 template <typename T>
-void GenerateSSProposals(
+void GenerateProposals_v1(
    const int K,
    const int num_proposals,
    const float im_h,
    const float im_w,
-    const float min_box_h,
-    const float min_box_w,
    const T* scores,
    const T* deltas,
    const int64_t* indices,
    T* proposals) {
+  // Shifted anchors in format: [K, A, 4]
  int64_t index, a, k;
-  const float* delta;
+  const T* delta;
-  float* proposal = proposals;
+  T* proposal = proposals;
-  float dx, dy, d_log_w, d_log_h;
+  T dx, dy, d_log_w, d_log_h;
  for (int i = 0; i < num_proposals; ++i) {
    index = indices[i];
    a = index / K, k = index % K;
@@ -214,61 +214,42 @@ void GenerateSSProposals(
    dy = delta[(a * 4 + 1) * K];
    d_log_w = delta[(a * 4 + 2) * K];
    d_log_h = delta[(a * 4 + 3) * K];
-    proposal[4] = FilterBoxes(
+    BBoxTransform(dx, dy, d_log_w, d_log_h, im_w, im_h, T(1), T(1), proposal);
-                      dx,
+    proposal[4] = scores[index];
-                      dy,
-                      d_log_w,
-                      d_log_h,
-                      im_w,
-                      im_h,
-                      min_box_w,
-                      min_box_h,
-                      proposal) *
-        scores[index];
    proposal += 5;
  }
 }
 template <typename T>
-void GenerateMSProposals(
+void GenerateProposals(
    const int num_candidates,
    const int num_proposals,
    const float im_h,
    const float im_w,
-    const float min_box_h,
-    const float min_box_w,
    const T* scores,
    const T* deltas,
    const int64_t* indices,
    T* proposals) {
+  // Shifted anchors in format: [4, A, K]
  int64_t index;
  int64_t num_candidates_2x = 2 * num_candidates;
  int64_t num_candidates_3x = 3 * num_candidates;
-  float* proposal = proposals;
+  T* proposal = proposals;
-  float dx, dy, d_log_w, d_log_h;
+  T dx, dy, d_log_w, d_log_h;
  for (int i = 0; i < num_proposals; ++i) {
    index = indices[i];
    dx = deltas[index];
    dy = deltas[num_candidates + index];
    d_log_w = deltas[num_candidates_2x + index];
    d_log_h = deltas[num_candidates_3x + index];
-    proposal[4] = FilterBoxes(
+    BBoxTransform(dx, dy, d_log_w, d_log_h, im_w, im_h, T(1), T(1), proposal);
-                      dx,
+    proposal[4] = scores[index];
-                      dy,
-                      d_log_w,
-                      d_log_h,
-                      im_w,
-                      im_h,
-                      min_box_w,
-                      min_box_h,
-                      proposal) *
-        scores[index];
    proposal += 5;
  }
 }
 template <typename T>
-void GenerateMCProposals(
+void GenerateDetections(
    const int num_proposals,
    const int num_boxes,
    const int num_classes,
@@ -280,11 +261,11 @@ void GenerateMCProposals(
    const T* scores,
    const T* deltas,
    const int64_t* indices,
-    T* proposals) {
+    T* detections) {
  int64_t index, cls;
  int64_t num_boxes_2x = 2 * num_boxes;
  int64_t num_boxes_3x = 3 * num_boxes;
-  float* proposal = proposals;
+  T* detection = detections;
  float dx, dy, d_log_w, d_log_h;
  for (int i = 0; i < num_proposals; ++i) {
    cls = indices[i] % num_classes;
@@ -293,7 +274,7 @@ void GenerateMCProposals(
    dy = deltas[num_boxes + index];
    d_log_w = deltas[num_boxes_2x + index];
    d_log_h = deltas[num_boxes_3x + index];
-    proposal[0] = im_idx;
+    detection[0] = im_idx;
    BBoxTransform(
        dx,
        dy,
@@ -303,10 +284,11 @@ void GenerateMCProposals(
        im_h,
        im_scale_h,
        im_scale_w,
-        proposal + 1);
+        detection + 1);
-    proposal[5] = scores[indices[i]];
+    // detection[5] = scores[indices[i]];
-    proposal[6] = cls + 1;
+    detection[5] = scores[i];
-    proposal += 7;
+    detection[6] = cls + 1;
+    detection += 7;
  }
 }

--- a/csrc/pyx/setup.py
+++ b/csrc/pyx/setup.py
@@ -8,7 +8,6 @@
 #     <https://opensource.org/licenses/BSD-2-Clause>
 #
 # ------------------------------------------------------------
 """Compile the cython extensions."""
 from __future__ import absolute_import
@@ -36,7 +35,7 @@ ext_modules = [
        include_dirs=[np.get_include()]
    ),
    Extension(
-        'install.lib.pycocotools._mask',
+        'install.lib.utils.pycocotools._mask',
        ['maskApi.c', '_mask.pyx'],
        include_dirs=[np.get_include(), os.path.dirname(os.path.abspath(__file__))],
        extra_compile_args=['-w']

--- a/scripts/coco/im2rec.py
+++ b/scripts/coco/im2rec.py
@@ -8,7 +8,6 @@
 #     <https://opensource.org/licenses/BSD-2-Clause>
 #
 # ------------------------------------------------------------
 """Make record file for COCO dataset."""
 from __future__ import absolute_import
@@ -27,14 +26,12 @@ if __name__ == '__main__':
    # Encode masks to RLE bytes
    if not os.path.exists('build'):
-         os.makedirs('build')
+        os.makedirs('build')
    make_mask('train', '2014', COCO_ROOT)
    make_mask('valminusminival', '2014', COCO_ROOT)
    make_mask('minival', '2014', COCO_ROOT)
-    merge_mask('trainval35k', '2014', [
+    merge_mask('trainval35k', '2014', ['build/coco_2014_train_mask.pkl',
-         'build/coco_2014_train_mask.pkl',
+                                       'build/coco_2014_valminusminival_mask.pkl'])
-         'build/coco_2014_valminusminival_mask.pkl']
-    )
    # coco_2014_trainval35k
    make_record(

--- a/scripts/coco/maker.py
+++ b/scripts/coco/maker.py
@@ -10,17 +10,13 @@
 # ------------------------------------------------------------
 import os
+import pickle
 import time
 import cv2
 import dragon
 import numpy as np
-try:
-    import cPickle
-except:
-    import pickle as cPickle
 def make_example(image_file, mask_objects, im_scale=None):
    filename = os.path.split(image_file)[-1]
@@ -52,6 +48,7 @@ def make_example(image_file, mask_objects, im_scale=None):
            'xmax': x2,
            'ymax': y2,
            'mask': obj['mask'],
+            'polygons': obj['polygons'],
            'difficult': obj.get('crowd', 0),
        })
@@ -80,7 +77,7 @@ def make_record(
    if mask_file is not None:
        with open(mask_file, 'rb') as f:
-            all_masks = cPickle.load(f)
+            all_masks = pickle.load(f)
    else:
        all_masks = {}
@@ -101,6 +98,7 @@ def make_record(
                'xmax': 'float64',
                'ymax': 'float64',
                'mask': 'bytes',
+                'polygons': [['float64']],
                'difficult': 'int64',
            }]
        }
@@ -111,10 +109,22 @@ def make_record(
    for db_idx, split in enumerate(splits):
        split_file = os.path.join(splits_path[db_idx], split + '.txt')
-        assert os.path.exists(split_file)
+        if not os.path.exists(split_file):
-        with open(split_file, 'r') as f:
+            # Fallback to try if split provided as json format
-            lines = f.readlines()
+            split_file = os.path.join(splits_path[db_idx], split + '.json')
-            total_line += len(lines)
+            if not os.path.exists(split_file):
+                raise FileNotFoundError('Unable to find the split:', split)
+            with open(split_file, 'r') as f:
+                import json
+                images_info = json.load(f)
+                total_line = len(images_info['images'])
+                lines = []
+                for info in images_info['images']:
+                    lines.append(os.path.splitext(info['file_name'])[0])
+        else:
+            with open(split_file, 'r') as f:
+                lines = f.readlines()
+                total_line += len(lines)
        for line in lines:
            count += 1
            if count % 2000 == 0:
@@ -123,10 +133,8 @@ def make_record(
                    count, total_line, now_time - start_time))
            filename = line.strip()
            image_file = os.path.join(images_path[db_idx], filename + ext)
-            mask_objects = all_masks[filename] if filename in all_masks else None
+            mask_objects = all_masks[filename] if filename in all_masks else {}
-            if mask_objects is None:
+            writer.write(make_example(image_file, mask_objects, im_scale))
-                raise ValueError('The image({}) takes invalid mask settings.'.format(filename))
-            writer.write( make_example(image_file, mask_objects, im_scale))
    now_time = time.time()
    print('{} / {} in {:.2f} sec'.format(count, total_line, now_time - start_time))

--- a/scripts/coco/maskgen.py
+++ b/scripts/coco/maskgen.py
@@ -9,19 +9,17 @@
 #
 # ------------------------------------------------------------
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import collections
 import os
-import sys
 import os.path as osp
-from collections import OrderedDict
+import pickle
-try:
-    import cPickle
-except:
-    import pickle as cPickle
-sys.path.insert(0, '../..')
+from seetadet.utils.pycocotools import mask_utils
-from seetadet.pycocotools.coco import COCO
+from seetadet.utils.pycocotools.coco import COCO
-from seetadet.pycocotools import mask_utils
 class COCOWrapper(object):
@@ -31,7 +29,7 @@ class COCOWrapper(object):
        self._data_path = osp.join(data_dir)
        self.invalid_cnt = 0
        self.ignore_cnt = 0
        # Load COCO API, classes, class <-> id mappings
        self._COCO = COCO(self._get_ann_file())
        cats = self._COCO.loadCats(self._COCO.getCatIds())
@@ -39,9 +37,8 @@ class COCOWrapper(object):
        self._class_to_ind = dict(zip(self._classes, range(self.num_classes)))
        self._ind_to_class = dict(zip(range(self.num_classes), self._classes))
        self._class_to_cat_id = dict(zip([c['name'] for c in cats], self._COCO.getCatIds()))
-        self._cat_id_to_class_id = dict([(self._class_to_cat_id[cls], 
+        self._cat_id_to_class_id = dict([(self._class_to_cat_id[cls], self._class_to_ind[cls])
-                                          self._class_to_ind[cls])
+                                         for cls in self._classes[1:]])
-                                          for cls in self._classes[1:]])
        self._data_name = {
            # 5k ``val2014`` subset
            'minival2014': 'val2014',
@@ -56,10 +53,10 @@ class COCOWrapper(object):
            if self._image_set.find('test') == -1 \
            else 'image_info'
        return osp.join(
-            self._data_path, 
+            self._data_path,
            'annotations',
-            prefix + '_' + 
+            prefix + '_' +
-            self._image_set + 
+            self._image_set +
            self._year + '.json'
        )
@@ -107,31 +104,32 @@ class COCOWrapper(object):
            y1 = float(max(0, obj['bbox'][1]))
            x2 = float(min(width - 1, x1 + max(0, obj['bbox'][2] - 1)))
            y2 = float(min(height - 1, y1 + max(0, obj['bbox'][3] - 1)))
+            mask, polygons = b'', []
            if isinstance(obj['segmentation'], list):
                for p in obj['segmentation']:
                    if len(p) < 6:
                        print('Remove Invalid segm.')
                # Valid polygons have >= 3 points, so require >= 6 coordinates
-                poly = [p for p in obj['segmentation'] if len(p) >= 6]
+                polygons = [p for p in obj['segmentation'] if len(p) >= 6]
-                mask_bytes = mask_utils.poly2bytes(poly, height, width)
+                # mask_bytes = mask_utils.poly2bytes(poly, height, width)
            else:
                # Crowd masks
                # Some are encoded with height or width
                # running out of the image bound
                # Do not use them or decoding error is inevitable
-                mask_bytes = mask_utils.poly2bytes(obj['segmentation'], height, width)
+                mask = mask_utils.poly2bytes(obj['segmentation'], height, width)
            if obj['area'] > 0 and x2 > x1 and y2 > y1:
                obj['clean_bbox'] = [x1, y1, x2, y2]
                valid_objects.append({
                    'bbox': [x1, y1, x2, y2],
-                    'mask': mask_bytes,
+                    'mask': mask,
+                    'polygons': polygons,
                    'category_id': obj['category_id'],
                    'class_id': self._cat_id_to_class_id[obj['category_id']],
                    'crowd': obj['iscrowd'],
                })
                valid_objects[-1]['name'] = \
                    self._ind_to_class[valid_objects[-1]['class_id']]
        return height, width, valid_objects
    @property
@@ -150,31 +148,35 @@ def make_mask(split, year, data_dir):
    if not osp.exists(osp.join(coco._data_path, 'splits')):
        os.makedirs(osp.join(coco._data_path, 'splits'))
-    gt_recs = OrderedDict()
+    gt_recs = collections.OrderedDict()
    for i in range(coco.num_images):
-        filename = (coco.image_path_at(i).split('/')[-1]).split('.')[0]
+        filename = osp.basename(coco.image_path_at(i)).split('.')[0]
        h, w, objects = coco.annotation_at(i)
        gt_recs[filename] = objects
-    with open(osp.join('build', 'coco_' + year + '_' + split + '_mask.pkl'), 'wb') as f:
+    with open(osp.join('build',
-        cPickle.dump(gt_recs, f, cPickle.HIGHEST_PROTOCOL)
+                       'coco_' + year +
+                       '_' + split + '_mask.pkl'), 'wb') as f:
+        pickle.dump(gt_recs, f, pickle.HIGHEST_PROTOCOL)
    with open(osp.join(coco._data_path, 'splits', split + '.txt'), 'w') as f:
        for i in range(coco.num_images):
-            filename = (coco.image_path_at(i).split('/')[-1]).split('.')[0]
+            filename = str(osp.basename(coco.image_path_at(i)).split('.')[0])
            if i != coco.num_images - 1:
                filename += '\n'
            f.write(filename)
 def merge_mask(split, year, mask_files):
-    gt_recs = OrderedDict()
+    gt_recs = collections.OrderedDict()
    data_path = os.path.dirname(mask_files[0])
    for mask_file in mask_files:
        with open(mask_file, 'rb') as f:
-            recs = cPickle.load(f)
+            recs = pickle.load(f)
            gt_recs.update(recs)
-    with open(osp.join(data_path, 'coco_' + year + '_' + split + '_mask.pkl'), 'wb') as f:
+    with open(osp.join(data_path,
-        cPickle.dump(gt_recs, f, cPickle.HIGHEST_PROTOCOL)
+                       'coco_' + year +
+                       '_' + split + '_mask.pkl'), 'wb') as f:
+        pickle.dump(gt_recs, f, pickle.HIGHEST_PROTOCOL)
--- a/scripts/rotated/maker.py
+++ b/scripts/rotated/maker.py
@@ -132,4 +132,3 @@ def make_record(
    data_size = os.path.getsize(record_file + '/root.data') * 1e-6
    print('{} images take {:.2f} MB in {:.2f} sec.'
          .format(len(entries), data_size, end_time - start_time))
--- a/scripts/voc/im2rec.py
+++ b/scripts/voc/im2rec.py
@@ -8,7 +8,6 @@
 #     <https://opensource.org/licenses/BSD-2-Clause>
 #
 # ------------------------------------------------------------
 """Make record file for VOC dataset."""
 from __future__ import absolute_import
@@ -29,7 +28,7 @@ if __name__ == '__main__':
        annotations_path=[osp.join(voc_root, 'VOCdevkit2007/VOC2007/Annotations'),
                          osp.join(voc_root, 'VOCdevkit2012/VOC2012/Annotations')],
        splits_path=[osp.join(voc_root, 'VOCdevkit2007/VOC2007/ImageSets/Main'),
-                        osp.join(voc_root, 'VOCdevkit2012/VOC2012/ImageSets/Main')],
+                     osp.join(voc_root, 'VOCdevkit2012/VOC2012/ImageSets/Main')],
        splits=['trainval', 'trainval']
    )

--- a/seetadet/__init__.py
+++ b/seetadet/__init__.py
@@ -8,3 +8,11 @@
 #     <https://opensource.org/licenses/BSD-2-Clause>
 #
 # ------------------------------------------------------------
+"""A platform implementing popular object detection algorithms."""
+from __future__ import absolute_import as _absolute_import
+from __future__ import division as _division
+from __future__ import print_function as _print_function
+# Version
+from seetadet.version import version as __version__
--- a/seetadet/dali/__init__.py
+++ b/seetadet/dali/__init__.py
@@ -8,3 +8,9 @@
 #     <https://opensource.org/licenses/BSD-2-Clause>
 #
 # ------------------------------------------------------------
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from seetadet.algo.common.anchor_sampler import AnchorSampler
--- a/seetadet/algo/common/anchor_sampler.py
+++ b/seetadet/algo/common/anchor_sampler.py
+# ------------------------------------------------------------
+# Copyright (c) 2017-present, SeetaTech, Co.,Ltd.
+#
+# Licensed under the BSD 2-Clause License.
+# You should have received a copy of the BSD 2-Clause License
+# along with the software. If not, See,
+#
+#     <https://opensource.org/licenses/BSD-2-Clause>
+#
+# ------------------------------------------------------------
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from seetadet.core.config import cfg
+class AnchorSampler(object):
+    """Sample precomputed anchors asynchronously."""
+    def __init__(self):
+        self._rpn_target = None
+        self._retinanet_target = None
+        self._ssd_target = None
+        if 'rcnn' in cfg.MODEL.TYPE:
+            from seetadet.algo.faster_rcnn import anchor_target
+            self._rpn_target = anchor_target.AnchorTarget()
+        elif cfg.MODEL.TYPE == 'retinanet':
+            from seetadet.algo.retinanet import anchor_target
+            self._retinanet_target = anchor_target.AnchorTarget()
+        elif cfg.MODEL.TYPE == 'ssd':
+            from seetadet.algo.ssd import anchor_target
+            self._ssd_target = anchor_target.AnchorTarget()
+    def __call__(self, **inputs):
+        """Return the sample anchors."""
+        if self._rpn_target:
+            fg_inds, bg_inds = \
+                self._rpn_target.sample_anchors(
+                    gt_boxes=inputs['gt_boxes'],
+                    im_info=inputs['im_info'],
+                )
+            return {'fg_inds': fg_inds, 'bg_inds': bg_inds}
+        if self._retinanet_target:
+            fg_inds, ignore_inds = \
+                self._retinanet_target.sample_anchors(
+                    gt_boxes=inputs['gt_boxes'],
+                    im_info=inputs['im_info'],
+                )
+            return {'fg_inds': fg_inds, 'bg_inds': ignore_inds}
+        if self._ssd_target:
+            fg_inds, neg_inds = \
+                self._ssd_target.sample_anchors(
+                    gt_boxes=inputs['gt_boxes'],
+                )
+            return {'fg_inds': fg_inds, 'bg_inds': neg_inds}
+        return {}
--- a/seetadet/algo/faster_rcnn/__init__.py
+++ b/seetadet/algo/faster_rcnn/__init__.py
@@ -17,7 +17,3 @@ from seetadet.algo.faster_rcnn.anchor_target import AnchorTarget
 from seetadet.algo.faster_rcnn.data_loader import DataLoader
 from seetadet.algo.faster_rcnn.proposal import Proposal
 from seetadet.algo.faster_rcnn.proposal_target import ProposalTarget
-from seetadet.algo.faster_rcnn.utils import generate_grid_anchors
-from seetadet.algo.faster_rcnn.utils import map_blobs_by_levels
-from seetadet.algo.faster_rcnn.utils import map_rois_to_levels
-from seetadet.algo.faster_rcnn.utils import map_returns_to_blobs
--- a/seetadet/algo/faster_rcnn/anchor_target.py
+++ b/seetadet/algo/faster_rcnn/anchor_target.py
--- a/seetadet/algo/faster_rcnn/data_loader.py
+++ b/seetadet/algo/faster_rcnn/data_loader.py
@@ -13,8 +13,11 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
+import collections
 import multiprocessing as mp
 import time
+import threading
+import queue
 import dragon
 import dragon.vm.torch as torch
@@ -23,8 +26,8 @@ import numpy as np
 from seetadet.algo.faster_rcnn import data_transformer
 from seetadet.core.config import cfg
 from seetadet.datasets.factory import get_dataset
+from seetadet.utils import blob as blob_util
 from seetadet.utils import logger
-from seetadet.utils.blob import im_list_to_blob
 class DataLoader(object):
@@ -33,28 +36,24 @@ class DataLoader(object):
    def __init__(self):
        super(DataLoader, self).__init__()
        dataset = get_dataset(cfg.TRAIN.DATASET)
-        if cfg.USE_DALI:
+        self.iterator = Iterator(**{
-            from seetadet.dali import rcnn_pipeline as pipe
+            'dataset': dataset.cls,
-            self.iterator = pipe.new_iterator(dataset.source)
+            'source': dataset.source,
-        else:
+            'classes': dataset.classes,
-            self.iterator = Iterator(**{
+            'shuffle': cfg.TRAIN.USE_SHUFFLE,
-                'dataset': dataset.cls,
+            'batch_size': cfg.TRAIN.IMS_PER_BATCH * 2,
-                'source': dataset.source,
+            'num_transformers': cfg.TRAIN.NUM_THREADS - 1,
-                'classes': dataset.classes,
+        })
-                'shuffle': cfg.TRAIN.USE_SHUFFLE,
+        self.iterator.start()
-                'num_chunks': cfg.TRAIN.SHUFFLE_CHUNKS,
-                'batch_size': cfg.TRAIN.IMS_PER_BATCH * 2,
-                'num_transformers': cfg.TRAIN.NUM_THREADS - 1,
-            })
    def __call__(self):
        outputs = self.iterator.next()
-        if isinstance(outputs['data'], np.ndarray):
+        if isinstance(outputs['image'], np.ndarray):
-            outputs['data'] = torch.from_numpy(outputs['data'])
+            outputs['image'] = torch.from_numpy(outputs['image'])
        return outputs
-class Iterator(mp.Process):
+class Iterator(threading.Thread):
    """Iterator to return the batch of data."""
    def __init__(self, **kwargs):
@@ -68,17 +67,16 @@ class Iterator(mp.Process):
            rank = dragon.distributed.get_rank(process_group)
        # Configuration
-        self._prefetch = kwargs.get('prefetch', 5)
        self._batch_size = kwargs.get('batch_size', 2)
        self._num_readers = kwargs.get('num_readers', 1)
        self._num_transformers = kwargs.get('num_transformers', 3)
        self.daemon = True
        # Initialize queues
-        num_batches = self._prefetch * self._num_readers
+        num_batches = self._num_readers
-        self.q_in = mp.Queue(num_batches * self._batch_size)
+        self._queue1 = mp.Queue(num_batches * self._batch_size)
-        self.q1_out = mp.Queue(num_batches * self._batch_size)
+        self._queue2 = mp.Queue(num_batches * self._batch_size)
-        self.q2_out = mp.Queue(num_batches * self._batch_size)
+        self._queue3 = queue.Queue(num_batches)
        # Initialize readers
        self._readers = []
@@ -89,7 +87,7 @@ class Iterator(mp.Process):
            self._readers.append(dragon.io.DataReader(
                part_idx=part_idx, num_parts=num_parts, **kwargs))
            self._readers[i]._seed += part_idx
-            self._readers[i].q_out = self.q_in
+            self._readers[i].q_out = self._queue1
            self._readers[i].start()
            time.sleep(0.1)
@@ -98,8 +96,7 @@ class Iterator(mp.Process):
        for i in range(self._num_transformers):
            p = data_transformer.DataTransformer(**kwargs)
            p._seed += (i + rank * self._num_transformers)
-            p.q_in = self.q_in
+            p.q_in, p.q_out = self._queue1, self._queue2
-            p.q1_out, p.q2_out = self.q1_out, self.q2_out
            p.start()
            self._transformers.append(p)
            time.sleep(0.1)
@@ -122,35 +119,43 @@ class Iterator(mp.Process):
        """Return the next batch of data."""
        return self.__next__()
+    def run(self):
+        """Main loop."""
+        num_images = cfg.TRAIN.IMS_PER_BATCH
+        num_batches = cfg.TRAIN.ASPECT_GROUPING
+        logger.info('Initialize prefetching batches...')
+        example_buffer = [self._queue2.get()
+                          for _ in range(num_images * num_batches)]
+        next_examples = []
+        while True:
+            # Use cached buffer for next N examples
+            # Examples are sorted to simulate aspect grouping
+            if len(next_examples) == 0:
+                next_examples = example_buffer
+                next_examples.sort(key=lambda d: d['aspect_ratio'])
+                example_buffer = []
+            # Prepare the next batch
+            outputs = collections.defaultdict(list)
+            for i in range(num_images):
+                example = next_examples.pop(0)
+                outputs['image'].append(example['image'])
+                outputs['gt_boxes'].append(example['boxes'])
+                outputs['im_info'].append(example['im_info'])
+                outputs['fg_inds'].append(example.get('fg_inds', None))
+                outputs['bg_inds'].append(example.get('bg_inds', None))
+                example_buffer.append(self._queue2.get())
+            outputs['image'] = blob_util.im_list_to_blob(
+                outputs['image'], coarsest_stride=cfg.MODEL.COARSEST_STRIDE)
+            # Send batch data to consumer
+            self._queue3.put(outputs)
    def __iter__(self):
        """Return the iterator self."""
        return self
    def __next__(self):
        """Return the next batch of data."""
-        q_out = None
+        return self._queue3.get()
-        # Two queues to implement aspect-grouping
-        # This is necessary to reduce the gpu memory
-        # from fetching a huge square batch blob
-        while q_out is None:
-            if self.q1_out.qsize() >= cfg.TRAIN.IMS_PER_BATCH:
-                q_out = self.q1_out
-            elif self.q2_out.qsize() >= cfg.TRAIN.IMS_PER_BATCH:
-                q_out = self.q2_out
-        self.q1_out, self.q2_out = self.q2_out, self.q1_out
-        images, images_info, boxes_to_pack = [], [], []
-        for i in range(cfg.TRAIN.IMS_PER_BATCH):
-            image, image_scale, boxes = q_out.get()
-            images.append(image)
-            images_info.append(list(image.shape[:2]) + [image_scale])
-            gt_boxes = np.zeros((boxes.shape[0], boxes.shape[1] + 1), 'float32')
-            gt_boxes[:, :boxes.shape[1]], gt_boxes[:, -1] = boxes, i
-            boxes_to_pack.append(gt_boxes)
-        return {
-            'data': im_list_to_blob(images),
-            'ims_info': np.array(images_info, dtype=np.float32),
-            'gt_boxes': np.concatenate(boxes_to_pack),
-        }
--- a/seetadet/algo/faster_rcnn/data_transformer.py
+++ b/seetadet/algo/faster_rcnn/data_transformer.py
@@ -15,109 +15,122 @@ from __future__ import print_function
 import multiprocessing
+import cv2
 import numpy as np
+import numpy.random as npr
+from seetadet.algo import common as algo_common
 from seetadet.core.config import cfg
 from seetadet.datasets.example import Example
 from seetadet.utils import boxes as box_util
-from seetadet.utils.blob import prep_im_for_blob
+from seetadet.utils import image as image_util
 class DataTransformer(multiprocessing.Process):
    def __init__(self, **kwargs):
        super(DataTransformer, self).__init__()
        self._scales = cfg.TRAIN.SCALES
+        self._random_scales = cfg.TRAIN.RANDOM_SCALES
        self._max_size = cfg.TRAIN.MAX_SIZE
        self._seed = cfg.RNG_SEED
-        self._use_flipped = cfg.TRAIN.USE_FLIPPED
        self._use_diff = cfg.TRAIN.USE_DIFF
+        self._use_flipped = cfg.TRAIN.USE_FLIPPED
+        self._use_distort = cfg.TRAIN.USE_COLOR_JITTER
        self._classes = kwargs.get('classes', ('__background__',))
        self._num_classes = len(self._classes)
        self._class_to_ind = dict(zip(self._classes, range(self._num_classes)))
-        self.q_in = self.q1_out = self.q2_out = None
+        self._anchor_sampler = algo_common.AnchorSampler()
+        self.q_in = self.q_out = None
        self.daemon = True
-    def make_roi_dict(self, example, im_scale, apply_flip=False):
+    def get_boxes(self, example, im_scale):
-        objects, n_objects = example.objects, 0
+        objects, num_objects = example.objects, 0
        height, width = example.height, example.width
        if not self._use_diff:
            for obj in objects:
                if obj.get('difficult', 0) == 0:
-                    n_objects += 1
+                    num_objects += 1
        else:
-            n_objects = len(objects)
+            num_objects = len(objects)
-        roi_dict = {
+        boxes = np.zeros((num_objects, 4), 'float32')
-            'boxes': np.zeros((n_objects, 4), 'float32'),
+        gt_classes = np.zeros((num_objects,), 'float32')
-            'gt_classes': np.zeros((n_objects,), 'int32'),
-        }
        # Filter the difficult instances
        object_idx = 0
        for obj in objects:
-            if not self._use_diff and \
+            if not self._use_diff and obj.get('difficult', 0) > 0:
-                    obj.get('difficult', 0) > 0:
                continue
            bbox = obj['bbox']
-            roi_dict['boxes'][object_idx, :] = [
+            boxes[object_idx, :] = [max(0, bbox[0]),
-                max(0, bbox[0]),
+                                    max(0, bbox[1]),
-                max(0, bbox[1]),
+                                    min(bbox[2], width - 1),
-                min(bbox[2], width - 1),
+                                    min(bbox[3], height - 1)]
-                min(bbox[3], height - 1),
+            gt_classes[object_idx] = self._class_to_ind[obj['name']]
-            ]
-            roi_dict['gt_classes'][object_idx] = \
-                self._class_to_ind[obj['name']]
            object_idx += 1
-        # Flip the boxes if necessary
-        if apply_flip:
-            roi_dict['boxes'] = \
-                box_util.flip_boxes(
-                    roi_dict['boxes'],
-                    width,
-                )
        # Scale the boxes to the detecting scale
-        roi_dict['boxes'] *= im_scale
+        boxes *= im_scale
+        # Attach the classes
+        gt_boxes = np.empty((num_objects, 5), dtype=np.float32)
+        gt_boxes[:, :4], gt_boxes[:, 4] = boxes, gt_classes
-        return roi_dict
+        return gt_boxes
    def get(self, example):
        example = Example(example)
-        img = example.image
-        # Scale
+        # Resize
-        target_size = self._scales[np.random.randint(len(self._scales))]
+        img, im_scale = image_util.resize_image_with_target_size(
-        img, im_scale = prep_im_for_blob(img, target_size, self._max_size)
+            example.image,
+            target_size=npr.choice(self._scales),
+            max_size=self._max_size,
+            random_scales=self._random_scales,
+        )
        # Flip
-        apply_flip = False
+        flipped = False
-        if self._use_flipped:
+        if self._use_flipped and npr.randint(2) > 0:
-            if np.random.randint(2) > 0:
+            img = img[:, ::-1]
-                img = img[:, ::-1]
+            flipped = True
-                apply_flip = True
+        # Distort
+        if self._use_distort:
+            img = image_util.distort_image(img)
+        # Boxes
+        boxes = self.get_boxes(example, im_scale)
+        # Flip the boxes if necessary
+        if flipped:
+            boxes = box_util.flip_boxes(boxes, img.shape[1])
-        # Example -> RoIDict
+        # Standard outputs.
-        roi_dict = self.make_roi_dict(example, im_scale, apply_flip)
+        outputs = {'image': img,
+                   'boxes': boxes,
+                   'im_info': img.shape[:2] + (im_scale,)}
-        # Post-Process for gt boxes
+        # Attach precomputed targets.
-        # Shape like: [num_objects, {x1, y1, x2, y2, cls}]
+        if len(boxes) > 0:
-        gt_boxes = np.empty((len(roi_dict['gt_classes']), 5), dtype=np.float32)
+            outputs.update(
-        gt_boxes[:, :4], gt_boxes[:, 4] = roi_dict['boxes'], roi_dict['gt_classes']
+                self._anchor_sampler(
+                    gt_boxes=boxes,
+                    im_info=outputs['im_info']))
-        return img, im_scale, gt_boxes
+        return outputs
    def run(self):
-        # Fix the process-local random seed
+        # Disable the opencv threading.
+        cv2.setNumThreads(1)
+        # Fix the process-local random seed.
        np.random.seed(self._seed)
        # Main prefetch loop
        while True:
            outputs = self.get(self.q_in.get())
-            if len(outputs[2]) < 1:
+            if len(outputs['boxes']) < 1:
-                continue  # Ignore the non-object image
+                continue  # Ignore non-object image.
-            aspect_ratio = float(outputs[0].shape[0]) / outputs[0].shape[1]
+            height, width = outputs['image'].shape[:2]
-            if aspect_ratio > 1.:
+            outputs['aspect_ratio'] = float(height) / float(width)
-                self.q1_out.put(outputs)
+            self.q_out.put(outputs)
-            else:
-                self.q2_out.put(outputs)
--- a/seetadet/algo/faster_rcnn/proposal.py
+++ b/seetadet/algo/faster_rcnn/proposal.py
@@ -17,8 +17,8 @@ import collections
 import numpy as np
-from seetadet.algo.faster_rcnn.generate_anchors import generate_anchors
+from seetadet.algo.faster_rcnn import generate_anchors as anchor_util
-from seetadet.algo.faster_rcnn.utils import generate_grid_anchors
+from seetadet.algo.faster_rcnn import utils as rcnn_util
 from seetadet.core.config import cfg
 from seetadet.utils import boxes as box_util
 from seetadet.utils import nms
@@ -29,59 +29,50 @@ class Proposal(object):
    def __init__(self):
        super(Proposal, self).__init__()
-        # Load the basic configs
+        # Load basic configs
        self.scales = cfg.RPN.SCALES
        self.strides = cfg.RPN.STRIDES
        self.ratios = cfg.RPN.ASPECT_RATIOS
        self.num_strides = len(self.strides)
        self.defaults = collections.OrderedDict([
-            ('rois', np.array([[-1, 0, 0, 1, 1]], 'float32')),
+            ('rois', np.array([[-1, 0, 0, 1, 1]], 'float32'))])
-        ])
+        self.bbox_transform_clip = \
+            np.log(cfg.TRAIN.MAX_SIZE / min(self.strides))
        # Generate base anchors
        self.base_anchors = []
        for i in range(self.num_strides):
            self.base_anchors.append(
-                generate_anchors(
+                anchor_util.generate_anchors(
                    self.strides[i],
                    self.ratios,
                    np.array([self.scales[i]])
                    if self.num_strides > 1
-                    else np.array(self.scales)
+                    else np.array(self.scales)))
-                )
-            )
-    def __call__(self, features, cls_prob, bbox_pred, ims_info):
+    def __call__(self, **inputs):
+        num_images = cfg.TRAIN.IMS_PER_BATCH
        pre_nms_top_n = cfg.TRAIN.RPN_PRE_NMS_TOP_N
        post_nms_top_n = cfg.TRAIN.RPN_POST_NMS_TOP_N
        nms_thresh = cfg.TRAIN.RPN_NMS_THRESH
-        min_size = cfg.TRAIN.RPN_MIN_SIZE
        # Get resources
-        num_images = ims_info.shape[0]
+        shapes = [f.shape[-2:] for f in inputs['features']]
-        grid_shapes = [f.shape[-2:] for f in features]
+        all_anchors = rcnn_util.get_shifted_anchors(
-        all_anchors = generate_grid_anchors(
+            shapes, self.base_anchors, self.strides)
-            grid_shapes, self.base_anchors, self.strides)
        # Prepare for the outputs
        batch_rois = []
-        cls_prob = cls_prob.numpy()
+        cls_prob = inputs['cls_prob'].numpy()
-        bbox_pred = bbox_pred.numpy()
+        # (?, 4, A * K) -> (?, A * K, 4)
-        if self.num_strides > 1:
+        bbox_pred = inputs['bbox_pred'].numpy()
-            # (?, 4, A * K) -> (?, A * K, 4)
+        bbox_pred = bbox_pred.transpose((0, 2, 1))
-            bbox_pred = bbox_pred.transpose((0, 2, 1))
-        else:
-            # (?, A * 4, H, W) -> (?, H, W, A * 4)
-            cls_prob = cls_prob.transpose((0, 2, 3, 1))
-            bbox_pred = bbox_pred.transpose((0, 2, 3, 1))
        # Extract RoIs separately
        for ix in range(num_images):
            # [?, N] -> [? * N, 1]
            scores = cls_prob[ix].reshape((-1, 1))
-            if self.num_strides > 1:
+            deltas = bbox_pred[ix]
-                deltas = bbox_pred[ix]
+            im_info = inputs['im_info'][ix]
-            else:
-                deltas = bbox_pred[ix].reshape((-1, 4))
            if pre_nms_top_n <= 0 or pre_nms_top_n >= len(scores):
                order = np.argsort(-scores.squeeze())
@@ -97,15 +88,11 @@ class Proposal(object):
            scores = scores[order]
            # Convert anchors into proposals via bbox transformations
-            proposals = box_util.bbox_transform_inv(anchors, deltas)
+            proposals = box_util.bbox_transform_inv(
+                anchors, deltas, clip=self.bbox_transform_clip)
            # Clip predicted boxes to image
-            proposals = box_util.clip_tiled_boxes(proposals, ims_info[ix, :2])
+            proposals = box_util.clip_tiled_boxes(proposals, im_info[:2])
-            # Remove predicted boxes with either height or width < threshold
-            keep = box_util.filter_boxes(proposals, min_size * ims_info[ix, 2])
-            proposals = proposals[keep, :]
-            scores = scores[keep]
            # Apply nms (e.g. threshold = 0.7)
            # Take after_nms_topN (e.g. 300)

--- a/seetadet/algo/faster_rcnn/proposal_target.py
+++ b/seetadet/algo/faster_rcnn/proposal_target.py
@@ -30,19 +30,17 @@ class ProposalTarget(object):
    def __init__(self):
        super(ProposalTarget, self).__init__()
        self.num_strides = len(cfg.RPN.STRIDES)
-        self.num_classes = cfg.MODEL.NUM_CLASSES
+        self.num_classes = len(cfg.MODEL.CLASSES)
        self.defaults = collections.OrderedDict([
            ('rois', np.array([[-1, 0, 0, 1, 1]], 'float32')),
            ('labels', np.array([-1], 'int64')),
            ('bbox_targets', np.zeros((1, 4), 'float32')),
        ])
-    def __call__(self, rpn_rois, gt_boxes):
+    def __call__(self, **inputs):
        num_images = cfg.TRAIN.IMS_PER_BATCH
        # Proposal ROIs (0, x1, y1, x2, y2) coming from RPN
-        all_rois = rpn_rois
+        all_rois = inputs['rois']
-        # GT boxes (x1, y1, x2, y2, label)
-        gt_boxes_wide = box_util.dismantle_boxes(gt_boxes, num_images)
        # Prepare for the outputs
        keys = self.defaults.keys()
@@ -50,22 +48,22 @@ class ProposalTarget(object):
        # Generate targets separately
        for ix in range(num_images):
-            gt_boxes = gt_boxes_wide[ix]
+            # GT boxes (x1, y1, x2, y2, label)
+            gt_boxes = inputs['gt_boxes'][ix]
            # Extract proposals for this image
            rois = all_rois[np.where(all_rois[:, 0].astype('int32') == ix)[0]]
            # Include ground-truth boxes in the set of candidate rois
            inds = np.ones((gt_boxes.shape[0], 1), gt_boxes.dtype) * ix
            rois = np.vstack((rois, np.hstack((inds, gt_boxes[:, :4]))))
            # Sample a batch of RoIs for training
-            rois_per_image = cfg.TRAIN.BATCH_SIZE
+            rois_per_image = cfg.FRCNN.BATCH_SIZE
-            fg_rois_per_image = np.round(cfg.TRAIN.FG_FRACTION * rois_per_image)
+            fg_rois_per_image = np.round(cfg.FRCNN.FG_FRACTION * rois_per_image)
            rcnn_util.map_returns_to_blobs(
-                sample_rois(
+                sample_rois(rois,
-                    rois,
+                            gt_boxes,
-                    gt_boxes,
+                            rois_per_image,
-                    rois_per_image,
+                            fg_rois_per_image),
-                    fg_rois_per_image,
+                blobs, keys,
-                ), blobs, keys,
            )
        # Stack into continuous blobs
@@ -95,7 +93,7 @@ class ProposalTarget(object):
        return {
            'rois': [new_tensor(rois) for rois in rois_wide],
            'labels': new_tensor(blobs['labels']),
-            'bbox_indices': new_tensor(cls_inds[fg_inds] + blobs['labels'][fg_inds]),
+            'bbox_inds': new_tensor(cls_inds[fg_inds] + blobs['labels'][fg_inds]),
            'bbox_targets': new_tensor(blobs['bbox_targets'][fg_inds].astype('float32')),
            'bbox_anchors': new_tensor(blobs['rois'][fg_inds, 1:].astype('float32')),
        }
@@ -108,8 +106,8 @@ def sample_rois(all_rois, gt_boxes, num_rois, num_fg_rois):
    max_overlaps = overlaps.max(axis=1)
    labels = gt_boxes[gt_assignment, 4].astype('int64')
-    # Select foreground RoIs as those with >= FG_THRESH overlap
+    # Select foreground RoIs as those with >= POSITIVE_OVERLAP
-    fg_thresh = cfg.TRAIN.FG_THRESH
+    fg_thresh = cfg.FRCNN.POSITIVE_OVERLAP
    fg_inds = np.where(max_overlaps >= fg_thresh)[0]
    while fg_inds.size == 0:
        fg_thresh -= 0.01
@@ -119,9 +117,10 @@ def sample_rois(all_rois, gt_boxes, num_rois, num_fg_rois):
    fg_rois_per_this_image = int(min(num_fg_rois, fg_inds.size))
    fg_inds = npr.choice(fg_inds, fg_rois_per_this_image, False)
-    # Select background RoIs as those within [BG_THRESH_LO, BG_THRESH_HI)
+    # Select background RoIs as those within
-    bg_inds = np.where((max_overlaps < cfg.TRAIN.BG_THRESH_HI) &
+    # [NEGATIVE_OVERLAP_LO, NEGATIVE_OVERLAP_HI)
-                       (max_overlaps >= cfg.TRAIN.BG_THRESH_LO))[0]
+    bg_inds = np.where((max_overlaps < cfg.FRCNN.NEGATIVE_OVERLAP_HI) &
+                       (max_overlaps >= cfg.FRCNN.NEGATIVE_OVERLAP_LO))[0]
    # Compute number of background RoIs to take from this image
    bg_rois_per_this_image = num_rois - fg_rois_per_this_image
    bg_rois_per_this_image = min(bg_rois_per_this_image, bg_inds.size)
@@ -129,7 +128,7 @@ def sample_rois(all_rois, gt_boxes, num_rois, num_fg_rois):
    if bg_inds.size > 0:
        bg_inds = npr.choice(bg_inds, bg_rois_per_this_image, False)
-    # The indices that we're selecting (both fg and bg)
+    # The selecting indices (both fg and bg)
    keep_inds = np.append(fg_inds, bg_inds)
    # Select sampled values from various arrays
    rois, labels = all_rois[keep_inds], labels[keep_inds]
@@ -137,12 +136,9 @@ def sample_rois(all_rois, gt_boxes, num_rois, num_fg_rois):
    labels[fg_rois_per_this_image:] = 0
    # Compute the target from RoIs
-    return [
+    outputs = [rois, labels]
-        rois,
+    outputs += [box_util.bbox_transform(
-        labels,
+        rois[:, 1:5],
-        box_util.bbox_transform(
+        gt_boxes[gt_assignment[keep_inds], :4],
-            rois[:, 1:5],
+        cfg.BBOX_REG_WEIGHTS)]
-            gt_boxes[gt_assignment[keep_inds], :4],
+    return outputs
-            cfg.BBOX_REG_WEIGHTS,
-        )
-    ]
--- a/seetadet/algo/faster_rcnn/test.py
+++ b/seetadet/algo/faster_rcnn/test.py
@@ -20,97 +20,131 @@ import numpy as np
 from seetadet.core.config import cfg
 from seetadet.modeling.detector import new_detector
+from seetadet.utils import blob as blob_util
 from seetadet.utils import boxes as box_util
+from seetadet.utils import image as image_util
+from seetadet.utils import logger
 from seetadet.utils import nms as nms_util
 from seetadet.utils import time_util
-from seetadet.utils.blob import im_list_to_blob
-from seetadet.utils.image import scale_image
-def im_detect(detector, raw_image):
+def get_data(raw_images):
-    """Detect a image, with single or multiple scales."""
+    """Return the test data."""
-    ims, ims_scale = scale_image(raw_image)
+    max_size = cfg.TEST.MAX_SIZE
+    images_wide = []
-    # Prepare blobs
+    image_shapes_wide, image_scales_wide = [], []
-    data = im_list_to_blob(ims)
+    for img in raw_images:
-    ims_info = np.array([list(data.shape[1:3]) + [im_scale]
+        images, image_scales = image_util.scale_image(
-                         for im_scale in ims_scale], dtype=np.float32)
+            img, scales=cfg.TEST.SCALES, max_size=max_size)
+        images_wide += images
-    # Do Forward
+        image_scales_wide += image_scales
-    data = torch.from_numpy(data)
+        image_shapes_wide += [img.shape[:2] for img in images]
-    ims_info = torch.from_numpy(ims_info)
+    images = blob_util.im_list_to_blob(
+        images_wide, coarsest_stride=cfg.MODEL.COARSEST_STRIDE)
+    image_shapes = np.array(image_shapes_wide)
+    image_scales = np.array(image_scales_wide).reshape((len(images), -1))
+    images_info = np.hstack([image_shapes, image_scales]).astype('float32')
+    return images, images_info
+def ims_detect(detector, raw_images, timer=None):
+    """Detect images at single or multiple scales."""
+    images, images_info = get_data(raw_images)
+    timer.tic() if timer else timer
+    # Do forward
+    inputs = {'image': torch.from_numpy(images),
+              'im_info': torch.from_numpy(images_info)}
    if not hasattr(detector, 'script_forward'):
-        def script_forward(self, data, ims_info):
+        def script_forward(self, image, im_info):
-            return self.forward({'data': data, 'ims_info': ims_info})
+            return self.forward({'image': image, 'im_info': im_info})
        detector.script_forward = torch.jit.trace(
            func=types.MethodType(script_forward, detector),
-            example_inputs=[data, ims_info],
+            example_inputs=[inputs['image'], inputs['im_info']],
        )
+    outputs = detector.script_forward(inputs['image'], inputs['im_info'])
-    outputs = detector.script_forward(data, ims_info)
    outputs = dict((k, outputs[k].numpy()) for k in outputs.keys())
    # Decode results
-    all_scores, all_boxes = [], []
+    batch_pred = box_util.bbox_transform_inv(
-    pred_boxes = box_util.bbox_transform_inv(
        outputs['rois'][:, 1:5],
        outputs['bbox_pred'],
-        cfg.BBOX_REG_WEIGHTS,
+        cfg.BBOX_REG_WEIGHTS)
-    )
+    results = [([], []) for _ in range(len(raw_images))]
+    for i in range(len(images)):
-    for i in range(len(ims)):
+        ii = i // len(cfg.TEST.SCALES)
        inds = np.where(outputs['rois'][:, 0].astype(np.int32) == i)[0]
-        boxes = pred_boxes[inds] / ims_scale[i]
+        boxes = batch_pred[inds] / images_info[i][2]
-        all_scores.append(outputs['cls_prob'][inds])
+        boxes = box_util.clip_tiled_boxes(boxes, raw_images[ii].shape)
-        all_boxes.append(box_util.clip_tiled_boxes(boxes, raw_image.shape))
+        results[ii][0].append(outputs['cls_prob'][inds])
+        results[ii][1].append(boxes)
-    return np.vstack(all_scores), np.vstack(all_boxes)
+    # Merge from multiple scales
+    ret = [(np.vstack(s), np.vstack(b)) for s, b in results]
-def test_net(weights, num_classes, q_in, q_out, device):
+    timer.toc() if timer else timer
-    num_classes, cfg.GPU_ID = num_classes, device
+    return ret
+def test_net(weights, q_in, q_out, device, root_logger=True):
+    """Test a network trained with Faster R-CNN algorithm."""
+    cfg.GPU_ID = device
+    num_classes = len(cfg.MODEL.CLASSES)
+    logger.set_root_logger(root_logger)
    detector = new_detector(device, weights)
-    _t = time_util.new_timers('im_detect', 'misc')
+    must_stop = False
+    timers = time_util.new_timers('im_detect_bbox', 'misc')
+    empty_detections = np.zeros((0, 5), 'float32')
    while True:
-        i, raw_image = q_in.get()
+        if must_stop:
-        if i < 0:
            break
+        indices, raw_images = [], []
-        boxes_this_image = [[]]
+        for _ in range(cfg.TEST.IMS_PER_BATCH):
+            i, raw_image = q_in.get()
-        with _t['im_detect'].tic_and_toc():
+            if i < 0:
-            scores, boxes = im_detect(detector, raw_image)
+                must_stop = True
+                break
-        _t['misc'].tic()
+            indices.append(i)
-        for j in range(1, num_classes):
+            raw_images.append(raw_image)
-            inds = np.where(scores[:, j] > cfg.TEST.SCORE_THRESH)[0]
-            cls_scores = scores[inds, j]
+        if len(raw_images) == 0:
-            cls_boxes = boxes[inds, j * 4:(j + 1) * 4]
+            continue
-            cls_detections = np.hstack(
-                (cls_boxes, cls_scores[:, np.newaxis])
+        results = ims_detect(detector, raw_images, timers['im_detect_bbox'])
-            ).astype(np.float32, copy=False)
-            if cfg.TEST.USE_SOFT_NMS:
+        for i, (scores, boxes) in enumerate(results):
-                keep = nms_util.soft_nms(
+            timers['misc'].tic()
-                    cls_detections,
+            boxes_this_image = [[]]
-                    thresh=cfg.TEST.NMS,
+            for j in range(1, num_classes):
-                    method=cfg.TEST.SOFT_NMS_METHOD,
+                inds = np.where(scores[:, j] > cfg.TEST.SCORE_THRESH)[0]
-                    sigma=cfg.TEST.SOFT_NMS_SIGMA,
+                if len(inds) == 0:
-                )
+                    boxes_this_image.append(empty_detections)
-            else:
+                    continue
-                keep = nms_util.nms(
+                cls_scores = scores[inds, j]
-                    cls_detections,
+                cls_boxes = boxes[inds, j * 4:(j + 1) * 4]
-                    thresh=cfg.TEST.NMS,
+                cls_detections = np.hstack(
-                )
+                    (cls_boxes, cls_scores[:, np.newaxis])) \
-            cls_detections = cls_detections[keep, :]
+                    .astype(np.float32, copy=False)
-            boxes_this_image.append(cls_detections)
+                if cfg.TEST.USE_SOFT_NMS:
-        _t['misc'].toc()
+                    keep = nms_util.soft_nms(
+                        cls_detections,
-        q_out.put((
+                        thresh=cfg.TEST.NMS,
-            i,
+                        method=cfg.TEST.SOFT_NMS_METHOD,
-            dict([('im_detect', _t['im_detect'].average_time),
+                        sigma=cfg.TEST.SOFT_NMS_SIGMA,
-                  ('misc', _t['misc'].average_time)]),
+                    )
-            dict([('boxes', boxes_this_image)]),
+                else:
-        ))
+                    keep = nms_util.nms(
+                        cls_detections,
+                        thresh=cfg.TEST.NMS,
+                    )
+                cls_detections = cls_detections[keep, :]
+                boxes_this_image.append(cls_detections)
+            timers['misc'].toc()
+            q_out.put((
+                indices[i],
+                dict([('im_detect', timers['im_detect_bbox'].average_time),
+                      ('misc', timers['misc'].average_time)]),
+                dict([('boxes', boxes_this_image)]),
+            ))
--- a/seetadet/algo/faster_rcnn/utils.py
+++ b/seetadet/algo/faster_rcnn/utils.py
@@ -19,43 +19,78 @@ import numpy as np
 from seetadet.core.config import cfg
-def generate_grid_anchors(grid_shapes, base_anchors, strides):
+def get_shifted_coords(shapes, base_anchors):
-    num_strides = len(strides)
+    """Return the x-y coordinates of shifted anchors."""
-    if len(grid_shapes) != num_strides:
+    xs, ys = [], []
-        raise ValueError(
+    for i in range(len(shapes)):
-            'Given %d grids for %d strides.'
+        height, width = shapes[i]
-            % (len(grid_shapes), num_strides)
+        x, y = np.arange(0, width), np.arange(0, height)
-        )
+        x, y = np.meshgrid(x, y)
-    # Generate proposals from shifted anchors
+        # Add A anchors (A,) to cell K shifts (K,)
+        # to get shift coords (A, K)
+        xs.append(np.tile(x.flatten(), base_anchors[i].shape[0]))
+        ys.append(np.tile(y.flatten(), base_anchors[i].shape[0]))
+    return np.concatenate(xs), np.concatenate(ys)
+def get_shifted_anchors(shapes, base_anchors, strides):
+    """Return the shifted anchors on given shapes."""
    anchors_to_pack = []
-    for i in range(len(grid_shapes)):
+    for i in range(len(shapes)):
-        height, width = grid_shapes[i]
+        height, width = shapes[i]
        shift_x = np.arange(0, width) * strides[i]
        shift_y = np.arange(0, height) * strides[i]
        shift_x, shift_y = np.meshgrid(shift_x, shift_y)
        shifts = np.vstack((shift_x.ravel(), shift_y.ravel(),
                            shift_x.ravel(), shift_y.ravel())).transpose()
-        # Add a anchors (1, a, 4) to
+        # Add A anchors (A, 1, 4) to cell K shifts (1, K, 4)
-        # cell k shifts (k, 1, 4) to get
+        # to get shift anchors (A, K, 4)
-        # shift anchors (k, a, 4)
-        # Reshape to (k * a, 4) shifted anchors
        a = base_anchors[i].shape[0]
        k = shifts.shape[0]
-        anchors = (base_anchors[i].reshape((1, a, 4)) +
+        anchors = (base_anchors[i].reshape((a, 1, 4)) +
-                   shifts.reshape((1, k, 4)).transpose((1, 0, 2)))
+                   shifts.reshape((1, k, 4)))
-        if num_strides > 1:
+        anchors_to_pack.append(anchors.reshape((a * k, 4)))
-            # Transpose from (K, A, 4) to (A, K, 4)
-            # We will pack it with other strides to
-            # match the data format of (N, C, H, W)
-            anchors = anchors.transpose((1, 0, 2))
-            anchors = anchors.reshape((a * k, 4))
-            anchors_to_pack.append(anchors)
-        else:
-            # Original order of Faster R-CNN
-            return anchors.reshape((k * a, 4))
    return np.vstack(anchors_to_pack)
+def narrow_anchors(
+    all_coords,
+    base_anchors,
+    max_shapes,
+    shapes,
+    inds,
+    remapping=None,
+):
+    """Return the valid shifted anchors on given shapes."""
+    x_coords, y_coords = all_coords
+    inds_wide, remapping_wide = [], []
+    offset = num = 0
+    for i in range(len(max_shapes)):
+        num += base_anchors[i].shape[0] * np.prod(max_shapes[i])
+        inds_inside = np.where((inds >= offset) & (inds < num))[0]
+        inds_wide.append(inds[inds_inside])
+        if remapping is not None:
+            remapping_wide.append(remapping[inds_inside])
+        offset = num
+    offset1 = offset2 = num1 = num2 = 0
+    for i in range(len(max_shapes)):
+        num1 += base_anchors[i].shape[0] * np.prod(max_shapes[i])
+        num2 += base_anchors[i].shape[0] * np.prod(shapes[i])
+        inds = inds_wide[i]
+        x, y = x_coords[inds], y_coords[inds]
+        a = ((inds - offset1) // max_shapes[i][1]) // max_shapes[i][0]
+        inds = (a * shapes[i][0] + y) * shapes[i][1] + x + offset2
+        inds_mask = np.where((x < shapes[i][1]) & (y < shapes[i][0]))[0]
+        inds_wide[i] = inds[inds_mask]
+        if remapping is not None:
+            remapping_wide[i] = remapping_wide[i][inds_mask]
+        offset1, offset2 = num1, num2
+    outputs = [np.concatenate(inds_wide)]
+    if remapping is not None:
+        outputs += [np.concatenate(remapping_wide)]
+    return outputs[0] if len(outputs) == 1 else outputs
 def map_returns_to_blobs(returns, blobs, keys):
    """Map returns of image to blobs."""
    for i, key in enumerate(keys):
@@ -83,6 +118,5 @@ def map_blobs_by_levels(blobs, defaults, lvl_inds):
            outputs[key].append(
                blob[inds]
                if len(inds) > 0
-                else defaults[key]
+                else defaults[key])
-            )
    return outputs
--- a/seetadet/algo/mask_rcnn/data_loader.py
+++ b/seetadet/algo/mask_rcnn/data_loader.py
@@ -13,8 +13,11 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
+import collections
 import multiprocessing as mp
 import time
+import threading
+import queue
 import dragon
 import dragon.vm.torch as torch
@@ -23,9 +26,8 @@ import numpy as np
 from seetadet.algo.mask_rcnn import data_transformer
 from seetadet.core.config import cfg
 from seetadet.datasets.factory import get_dataset
+from seetadet.utils import blob as blob_util
 from seetadet.utils import logger
-from seetadet.utils.blob import im_list_to_blob
-from seetadet.utils.blob import mask_list_to_blob
 class DataLoader(object):
@@ -39,19 +41,19 @@ class DataLoader(object):
            'source': dataset.source,
            'classes': dataset.classes,
            'shuffle': cfg.TRAIN.USE_SHUFFLE,
-            'num_chunks': cfg.TRAIN.SHUFFLE_CHUNKS,
            'batch_size': cfg.TRAIN.IMS_PER_BATCH * 2,
            'num_transformers': cfg.TRAIN.NUM_THREADS - 1,
        })
+        self.iterator.start()
    def __call__(self):
        outputs = self.iterator.next()
-        if isinstance(outputs['data'], np.ndarray):
+        if isinstance(outputs['image'], np.ndarray):
-            outputs['data'] = torch.from_numpy(outputs['data'])
+            outputs['image'] = torch.from_numpy(outputs['image'])
        return outputs
-class Iterator(mp.Process):
+class Iterator(threading.Thread):
    """Iterator to return the batch of data."""
    def __init__(self, **kwargs):
@@ -65,17 +67,16 @@ class Iterator(mp.Process):
            rank = dragon.distributed.get_rank(process_group)
        # Configuration
-        self._prefetch = kwargs.get('prefetch', 5)
        self._batch_size = kwargs.get('batch_size', 2)
        self._num_readers = kwargs.get('num_readers', 1)
        self._num_transformers = kwargs.get('num_transformers', 3)
        self.daemon = True
        # Initialize queues
-        num_batches = self._prefetch * self._num_readers
+        num_batches = self._num_readers
-        self.q_in = mp.Queue(num_batches * self._batch_size)
+        self._queue1 = mp.Queue(num_batches * self._batch_size)
-        self.q1_out = mp.Queue(num_batches * self._batch_size)
+        self._queue2 = mp.Queue(num_batches * self._batch_size)
-        self.q2_out = mp.Queue(num_batches * self._batch_size)
+        self._queue3 = queue.Queue(num_batches)
        # Initialize readers
        self._readers = []
@@ -86,7 +87,7 @@ class Iterator(mp.Process):
            self._readers.append(dragon.io.DataReader(
                part_idx=part_idx, num_parts=num_parts, **kwargs))
            self._readers[i]._seed += part_idx
-            self._readers[i].q_out = self.q_in
+            self._readers[i].q_out = self._queue1
            self._readers[i].start()
            time.sleep(0.1)
@@ -95,8 +96,7 @@ class Iterator(mp.Process):
        for i in range(self._num_transformers):
            p = data_transformer.DataTransformer(**kwargs)
            p._seed += (i + rank * self._num_transformers)
-            p.q_in = self.q_in
+            p.q_in, p.q_out = self._queue1, self._queue2
-            p.q1_out, p.q2_out = self.q1_out, self.q2_out
            p.start()
            self._transformers.append(p)
            time.sleep(0.1)
@@ -119,38 +119,44 @@ class Iterator(mp.Process):
        """Return the next batch of data."""
        return self.__next__()
+    def run(self):
+        """Main loop."""
+        num_images = cfg.TRAIN.IMS_PER_BATCH
+        num_batches = cfg.TRAIN.ASPECT_GROUPING
+        logger.info('Initialize prefetching batches...')
+        example_buffer = [self._queue2.get()
+                          for _ in range(num_images * num_batches)]
+        next_examples = []
+        while True:
+            # Use cached buffer for next N examples
+            # Examples are sorted to simulate aspect grouping
+            if len(next_examples) == 0:
+                next_examples = example_buffer
+                next_examples.sort(key=lambda d: d['aspect_ratio'])
+                example_buffer = []
+            # Prepare the next batch
+            outputs = collections.defaultdict(list)
+            for i in range(num_images):
+                example = next_examples.pop(0)
+                outputs['image'].append(example['image'])
+                outputs['gt_boxes'].append(example['boxes'])
+                outputs['gt_segms'].append(example['segms'])
+                outputs['im_info'].append(example['im_info'])
+                outputs['fg_inds'].append(example.get('fg_inds', None))
+                outputs['bg_inds'].append(example.get('bg_inds', None))
+                example_buffer.append(self._queue2.get())
+            outputs['image'] = blob_util.im_list_to_blob(
+                outputs['image'], coarsest_stride=cfg.MODEL.COARSEST_STRIDE)
+            # Send batch data to consumer
+            self._queue3.put(outputs)
    def __iter__(self):
        """Return the iterator self."""
        return self
    def __next__(self):
        """Return the next batch of data."""
-        q_out = None
+        return self._queue3.get()
-        # Two queues to implement aspect-grouping
-        # This is necessary to reduce the gpu memory
-        # from fetching a huge square batch blob
-        while q_out is None:
-            if self.q1_out.qsize() >= cfg.TRAIN.IMS_PER_BATCH:
-                q_out = self.q1_out
-            elif self.q2_out.qsize() >= cfg.TRAIN.IMS_PER_BATCH:
-                q_out = self.q2_out
-        self.q1_out, self.q2_out = self.q2_out, self.q1_out
-        images, images_info = [], []
-        boxes_to_pack, masks_to_pack = [], []
-        for i in range(cfg.TRAIN.IMS_PER_BATCH):
-            image, image_scale, boxes, masks = q_out.get()
-            images.append(image)
-            images_info.append(list(image.shape[:2]) + [image_scale])
-            gt_boxes = np.zeros((boxes.shape[0], boxes.shape[1] + 1), 'float32')
-            gt_boxes[:, :boxes.shape[1]], gt_boxes[:, -1] = boxes, i
-            boxes_to_pack.append(gt_boxes)
-            masks_to_pack.append(masks)
-        return {
-            'data': im_list_to_blob(images),
-            'ims_info': np.array(images_info, 'float32'),
-            'gt_boxes': np.concatenate(boxes_to_pack),
-            'gt_masks': mask_list_to_blob(masks_to_pack),
-        }
--- a/seetadet/algo/mask_rcnn/data_transformer.py
+++ b/seetadet/algo/mask_rcnn/data_transformer.py
@@ -15,134 +15,136 @@ from __future__ import print_function
 import multiprocessing
+import cv2
 import numpy as np
+import numpy.random as npr
+from seetadet.algo import common as algo_common
 from seetadet.core.config import cfg
 from seetadet.datasets.example import Example
-from seetadet.pycocotools import mask_utils
+from seetadet.utils.pycocotools import mask_utils
 from seetadet.utils import boxes as box_util
-from seetadet.utils.blob import prep_im_for_blob
+from seetadet.utils import image as image_util
 class DataTransformer(multiprocessing.Process):
    def __init__(self, **kwargs):
        super(DataTransformer, self).__init__()
        self._scales = cfg.TRAIN.SCALES
+        self._random_scales = cfg.TRAIN.RANDOM_SCALES
        self._max_size = cfg.TRAIN.MAX_SIZE
        self._seed = cfg.RNG_SEED
-        self._use_flipped = cfg.TRAIN.USE_FLIPPED
        self._use_diff = cfg.TRAIN.USE_DIFF
+        self._use_flipped = cfg.TRAIN.USE_FLIPPED
+        self._use_distort = cfg.TRAIN.USE_COLOR_JITTER
        self._classes = kwargs.get('classes', ('__background__',))
        self._num_classes = len(self._classes)
        self._class_to_ind = dict(zip(self._classes, range(self._num_classes)))
-        self.q_in = self.q1_out = self.q2_out = None
+        self._anchor_sampler = algo_common.AnchorSampler()
+        self.q_in = self.q_out = None
        self.daemon = True
-    def make_roi_dict(self, example, im_scale, apply_flip=False):
+    def get_boxes_and_segms(self, example, im_scale, flipped):
-        objects, n_objects = example.objects, 0
+        objects, num_objects = example.objects, 0
        height, width = example.height, example.width
        if not self._use_diff:
            for obj in objects:
                if obj.get('difficult', 0) == 0:
-                    n_objects += 1
+                    num_objects += 1
        else:
-            n_objects = len(objects)
+            num_objects = len(objects)
-        roi_dict = {
+        boxes, segms = np.zeros((num_objects, 4), 'float32'), []
-            'boxes': np.zeros((n_objects, 4), 'float32'),
+        gt_classes = np.zeros((num_objects,), 'float32')
-            'masks': np.empty((n_objects, height, width), 'uint8'),
+        segm_flags = np.ones((num_objects,), 'float32')
-            'gt_classes': np.zeros((n_objects, 1), 'int32'),
-            'mask_flags': np.ones((n_objects, 1), 'float32'),
-        }
-        # Filter the difficult instances
+        # Filter the difficult instances.
        object_idx = 0
        for obj in objects:
-            if not self._use_diff and \
+            if not self._use_diff and obj.get('difficult', 0) > 0:
-                    obj.get('difficult', 0) > 0:
                continue
-            bbox, mask = obj['bbox'], obj['mask']
+            bbox = obj['bbox']
-            roi_dict['boxes'][object_idx, :] = [
+            boxes[object_idx, :] = [max(0, bbox[0]),
-                max(0, bbox[0]),
+                                    max(0, bbox[1]),
-                max(0, bbox[1]),
+                                    min(bbox[2], width - 1),
-                min(bbox[2], width - 1),
+                                    min(bbox[3], height - 1)]
-                min(bbox[3], height - 1),
+            if 'mask' in obj:
-            ]
+                mask_img = mask_utils.bytes2img(obj['mask'], height, width)
-            if mask is not None:
+                segms.append(mask_img[:, ::-1] if flipped else mask_img)
-                roi_dict['masks'][object_idx] = (
+            elif 'polygons' in obj:
-                    mask_utils.bytes2img(
+                polygons = obj['polygons']
-                        obj['mask'],
+                segms.append(box_util.flip_polygons(
-                        height,
+                    polygons, width) if flipped else polygons)
-                        width,
-                    ))
            else:
-                roi_dict['mask_flags'][object_idx] = 0.
+                segms.append(None)
-            roi_dict['gt_classes'][object_idx] = \
+                segm_flags[object_idx] = 0.
-                self._class_to_ind[obj['name']]
+            gt_classes[object_idx] = self._class_to_ind[obj['name']]
            object_idx += 1
-        # Flip the boxes if necessary
+        # Scale the boxes to the detecting scale.
-        if apply_flip:
+        boxes *= im_scale
-            roi_dict['boxes'] = \
-                box_util.flip_boxes(
-                    roi_dict['boxes'],
-                    width,
-                )
-        # Scale the boxes to the detecting scale
+        # Attach the classes and mask flags.
-        roi_dict['boxes'] *= im_scale
+        gt_boxes = np.empty((num_objects, 6), dtype=np.float32)
+        gt_boxes[:, :4], gt_boxes[:, 4] = boxes, gt_classes
+        gt_boxes[:, 5] = segm_flags  # Has segmentation or not.
-        return roi_dict
+        return gt_boxes, segms
    def get(self, example):
        example = Example(example)
-        img = example.image
-        # Scale
-        target_size = self._scales[np.random.randint(len(self._scales))]
-        img, im_scale = prep_im_for_blob(img, target_size, self._max_size)
-        # Flip
-        apply_flip = False
-        if self._use_flipped:
-            if np.random.randint(2) > 0:
-                img = img[:, ::-1]
-                apply_flip = True
-        # Example -> RoIDict
-        roi_dict = self.make_roi_dict(example, im_scale, apply_flip)
-        # Post-Process for gt boxes
-        # Shape like: [num_objects, {x1, y1, x2, y2, cls, flag}]
-        gt_boxes = \
-            np.concatenate([
-                roi_dict['boxes'],
-                roi_dict['gt_classes'],
-                roi_dict['mask_flags']
-            ], axis=1)
-        # Post-Process for gt masks
-        # Shape like: [num_objects, im_h, im_w]
-        if gt_boxes.shape[0] > 0:
-            gt_masks = roi_dict['masks']
-            if apply_flip:
-                gt_masks = gt_masks[:, :, ::-1]
-        else:
-            gt_masks = None
-        return img, im_scale, gt_boxes, gt_masks
+        # Resize.
+        img, im_scale = image_util.resize_image_with_target_size(
+            example.image,
+            target_size=npr.choice(self._scales),
+            max_size=self._max_size,
+            random_scales=self._random_scales,
+        )
+        # Flip.
+        flipped = False
+        if self._use_flipped and npr.randint(2) > 0:
+            img = img[:, ::-1]
+            flipped = True
+        # Distort.
+        if self._use_distort:
+            img = image_util.distort_image(img)
+        # Boxes and segmentations.
+        boxes, segms = self.get_boxes_and_segms(example, im_scale, flipped)
+        # Flip the boxes if necessary.
+        if flipped:
+            boxes = box_util.flip_boxes(boxes, img.shape[1])
+        # Standard outputs.
+        outputs = {'image': img,
+                   'boxes': boxes,
+                   'segms': segms,
+                   'im_info': img.shape[:2] + (im_scale,)}
+        # Attach precomputed targets.
+        if len(boxes) > 0:
+            outputs.update(
+                self._anchor_sampler(
+                    gt_boxes=boxes,
+                    im_info=outputs['im_info']))
+        return outputs
    def run(self):
-        # Fix the process-local random seed
+        # Disable the opencv threading.
+        cv2.setNumThreads(1)
+        # Fix the process-local random seed.
        np.random.seed(self._seed)
        # Main prefetch loop
        while True:
            outputs = self.get(self.q_in.get())
-            if len(outputs[2]) < 1:
+            if len(outputs['boxes']) < 1:
-                continue  # Ignore the non-object image
+                continue  # Ignore non-object image.
-            aspect_ratio = float(outputs[0].shape[0]) / outputs[0].shape[1]
+            height, width = outputs['image'].shape[:2]
-            if aspect_ratio > 1.:
+            outputs['aspect_ratio'] = float(height) / float(width)
-                self.q1_out.put(outputs)
+            self.q_out.put(outputs)
-            else:
-                self.q2_out.put(outputs)
--- a/seetadet/algo/mask_rcnn/proposal_target.py
+++ b/seetadet/algo/mask_rcnn/proposal_target.py
@@ -31,7 +31,7 @@ class ProposalTarget(object):
    def __init__(self):
        super(ProposalTarget, self).__init__()
        self.resolution = cfg.MRCNN.RESOLUTION
-        self.num_classes = cfg.MODEL.NUM_CLASSES
+        self.num_classes = len(cfg.MODEL.CLASSES)
        self.defaults = collections.OrderedDict([
            ('rois', np.array([[-1, 0, 0, 1, 1]], 'float32')),
            ('labels', np.array([-1], 'int64')),
@@ -39,18 +39,10 @@ class ProposalTarget(object):
            ('mask_targets', -np.ones((1, self.resolution, self.resolution), 'float32')),
        ])
-    def __call__(self, rpn_rois, gt_boxes, gt_masks, ims_info):
+    def __call__(self, **inputs):
        num_images = cfg.TRAIN.IMS_PER_BATCH
        # Proposal ROIs (0, x1, y1, x2, y2) coming from RPN
-        all_rois = rpn_rois
+        all_rois = inputs['rois']
-        # GT boxes (x1, y1, x2, y2, label)
-        # GT masks (num_objects, im_h, im_w)
-        gt_boxes_wide, gt_masks_wide = \
-            mask_util.dismantle_masks(
-                gt_boxes,
-                gt_masks,
-                num_images,
-            )
        # Prepare for the outputs
        keys = self.defaults.keys()
@@ -58,24 +50,25 @@ class ProposalTarget(object):
        # Generate targets separately
        for ix in range(num_images):
-            gt_boxes = gt_boxes_wide[ix]
+            # GT boxes (x1, y1, x2, y2, label)
-            gt_masks = gt_masks_wide[ix]
+            gt_boxes = inputs['gt_boxes'][ix]
+            gt_segms = inputs['gt_segms'][ix]
            # Extract proposals for this image
            rois = all_rois[np.where(all_rois[:, 0].astype('int32') == ix)[0]]
            # Include ground-truth boxes in the set of candidate rois
            inds = np.ones((gt_boxes.shape[0], 1), gt_boxes.dtype) * ix
            rois = np.vstack((rois, np.hstack((inds, gt_boxes[:, :4]))))
            # Sample a batch of RoIs for training
-            rois_per_image = cfg.TRAIN.BATCH_SIZE
+            rois_per_image = cfg.FRCNN.BATCH_SIZE
-            fg_rois_per_image = np.round(cfg.TRAIN.FG_FRACTION * rois_per_image)
+            fg_rois_per_image = np.round(cfg.FRCNN.FG_FRACTION * rois_per_image)
            rcnn_util.map_returns_to_blobs(
                sample_rois(
                    rois,
                    gt_boxes,
-                    gt_masks,
+                    gt_segms,
                    rois_per_image,
                    fg_rois_per_image,
-                    ims_info[ix][2],
+                    inputs['im_info'][ix][2],
                ), blobs, keys,
            )
@@ -122,10 +115,10 @@ class ProposalTarget(object):
            'rois': [new_tensor(rois_wide[i]) for i in range(num_levels)],
            'mask_rois': [new_tensor(mask_rois_wide[i]) for i in range(num_levels)],
            'labels': new_tensor(blobs['labels']),
-            'bbox_indices': new_tensor(bbox_cls_inds[fg_inds] + blobs['labels'][fg_inds]),
+            'bbox_inds': new_tensor(bbox_cls_inds[fg_inds] + blobs['labels'][fg_inds]),
            'bbox_targets': new_tensor(blobs['bbox_targets'][fg_inds].astype('float32')),
            'bbox_anchors': new_tensor(blobs['rois'][fg_inds, 1:].astype('float32')),
-            'mask_indices': new_tensor(mask_cls_inds + mask_labels),
+            'mask_inds': new_tensor(mask_cls_inds + mask_labels),
            'mask_targets': new_tensor(blobs['mask_targets']),
        }
@@ -134,7 +127,7 @@ def compute_targets(
    ex_rois,
    gt_rois,
    gt_labels,
-    gt_masks,
+    gt_segms,
    mask_flags,
    mask_size,
    im_scale,
@@ -150,29 +143,25 @@ def compute_targets(
    # Compute mask classification targets
    mask_shape = [mask_size] * 2
    ex_rois_ori = np.round(ex_rois / im_scale).astype(int)
-    gt_rois_ori = np.round(gt_rois / im_scale).astype(int)
    mask_targets = -np.ones([len(gt_labels)] + mask_shape, 'float32')
    for i in fg_inds:
        if mask_flags[i] > 0:
-            box_mask = \
+            if isinstance(gt_segms[i], list):
-                mask_util.intersect_box_mask(
+                ret = mask_util.warp_mask_via_polygons(
-                    ex_rois_ori[i],
+                    gt_segms[i], ex_rois_ori[i], mask_shape)
-                    gt_rois_ori[i],
+            else:
-                    gt_masks[i],
+                gt_rois_ori = np.round(gt_rois / im_scale).astype(int)
-                )
+                ret = mask_util.warp_mask_via_intersection(
-            if box_mask is not None:
+                    gt_segms[i], ex_rois_ori[i], gt_rois_ori[i], mask_shape)
-                mask_targets[i] = \
+            if ret is not None:
-                    mask_util.resize_mask(
+                mask_targets[i] = ret.astype('float32')
-                        mask=box_mask,
-                        size=mask_shape,
-                    )
    return bbox_targets, mask_targets
 def sample_rois(
    all_rois,
    gt_boxes,
-    gt_masks,
+    gt_segms,
    num_rois,
    num_fg_rois,
    im_scale,
@@ -184,15 +173,15 @@ def sample_rois(
    labels = gt_boxes[gt_assignment, 4].astype('int64')
    # Select foreground RoIs as those with >= FG_THRESH overlap
-    fg_inds = np.where(max_overlaps >= cfg.TRAIN.FG_THRESH)[0]
+    fg_inds = np.where(max_overlaps >= cfg.FRCNN.POSITIVE_OVERLAP)[0]
    fg_rois_per_this_image = int(min(num_fg_rois, fg_inds.size))
    # Sample foreground regions without replacement
    if fg_inds.size > 0:
        fg_inds = npr.choice(fg_inds, fg_rois_per_this_image, False)
    # Select background RoIs as those within [BG_THRESH_LO, BG_THRESH_HI)
-    bg_inds = np.where((max_overlaps < cfg.TRAIN.BG_THRESH_HI) &
+    bg_inds = np.where((max_overlaps < cfg.FRCNN.NEGATIVE_OVERLAP_HI) &
-                       (max_overlaps >= cfg.TRAIN.BG_THRESH_LO))[0]
+                       (max_overlaps >= cfg.FRCNN.NEGATIVE_OVERLAP_LO))[0]
    # Compute number of background RoIs to take from this image
    bg_rois_per_this_image = num_rois - fg_rois_per_this_image
    bg_rois_per_this_image = min(bg_rois_per_this_image, bg_inds.size)
@@ -213,7 +202,7 @@ def sample_rois(
        rois[:, 1:5],
        gt_boxes[gt_assignment[keep_inds], :4],
        labels,
-        gt_masks[gt_assignment[fg_inds]],
+        [gt_segms[i] for i in gt_assignment[fg_inds]],
        gt_boxes[gt_assignment[fg_inds], 5],
        cfg.MRCNN.RESOLUTION,
        im_scale,

--- a/seetadet/algo/mask_rcnn/test.py
+++ b/seetadet/algo/mask_rcnn/test.py
--- a/seetadet/algo/retinanet/anchor_target.py
+++ b/seetadet/algo/retinanet/anchor_target.py
@@ -13,13 +13,15 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
+import collections
+import math
 import numpy as np
+from seetadet.algo.faster_rcnn import generate_anchors as anchor_util
+from seetadet.algo.faster_rcnn import utils as rcnn_util
 from seetadet.core.config import cfg
-from seetadet.algo.faster_rcnn.generate_anchors import generate_anchors_v2
-from seetadet.algo.faster_rcnn.utils import generate_grid_anchors
 from seetadet.utils import boxes as box_util
-from seetadet.utils import logger
 from seetadet.utils.env import new_tensor
@@ -41,95 +43,113 @@ class AnchorTarget(object):
                     (2 ** (octave / float(scales_per_octave)))
                     for octave in range(scales_per_octave)]
            self.base_anchors.append(
-                generate_anchors_v2(
+                anchor_util.generate_anchors_v2(
                    stride=stride,
                    ratios=self.ratios,
-                    sizes=sizes,
+                    sizes=sizes))
-                ))
+        # Plan the maximum anchor layout
-        # Store the cached grid anchors
+        max_size = cfg.TRAIN.MAX_SIZE
-        self.last_grid_shapes = None
+        if max_size == 0:
-        self.last_grid_anchors = None
+            max_size = cfg.TRAIN.SCALES[0]
+        if cfg.MODEL.COARSEST_STRIDE > 0:
+            stride = float(cfg.MODEL.COARSEST_STRIDE)
+            max_size = int(math.ceil(max_size / stride) * stride)
+        self.max_shapes = [[math.ceil(max_size / stride)] * 2
+                           for stride in self.strides]
+        self.all_coords = rcnn_util.get_shifted_coords(
+            self.max_shapes, self.base_anchors)
+        self.all_anchors = rcnn_util.get_shifted_anchors(
+            self.max_shapes, self.base_anchors, self.strides)
+    def sample_anchors(self, gt_boxes, im_info, all_anchors=None):
+        all_anchors = self.all_anchors \
+            if all_anchors is None else all_anchors
+        # Remove anchors separating from the image
+        inds_inside = np.where((all_anchors[:, 0] < im_info[1]) &
+                               (all_anchors[:, 1] < im_info[0]))[0]
+        anchors = all_anchors[inds_inside, :]
+        num_inside = len(anchors)
+        labels = np.empty((num_inside,), dtype='int32')
+        labels.fill(-1)
+        # Overlaps between the anchors and the gt boxes.
+        overlaps = box_util.bbox_overlaps(anchors, gt_boxes)
+        argmax_overlaps = overlaps.argmax(axis=1)
+        max_overlaps = overlaps[np.arange(num_inside), argmax_overlaps]
+        # Foreground: for each gt, anchor with highest overlap.
+        gt_argmax_overlaps = overlaps.argmax(axis=0)
+        gt_max_overlaps = overlaps[gt_argmax_overlaps, np.arange(overlaps.shape[1])]
+        gt_argmax_overlaps = np.where(overlaps == gt_max_overlaps)[0]
+        gt_assignment = argmax_overlaps[gt_argmax_overlaps]
+        labels[gt_argmax_overlaps] = gt_boxes[gt_assignment, 4]
+        # Foreground: above threshold IoU.
+        inds = max_overlaps >= cfg.RETINANET.POSITIVE_OVERLAP
+        gt_assignment = argmax_overlaps[inds]
+        labels[inds] = gt_boxes[gt_assignment, 4]
+        # Background: below threshold IoU.
+        labels[max_overlaps < cfg.RETINANET.NEGATIVE_OVERLAP] = 0
+        # Retract the clamping if we don't have one.
+        fg_inds = np.where(labels > 0)[0]
+        if len(fg_inds) == 0:
+            gt_assignment = argmax_overlaps[gt_argmax_overlaps]
+            labels[gt_argmax_overlaps] = gt_boxes[gt_assignment, 4]
+            fg_inds = np.where(labels > 0)[0]
+        # Select ignore labels to avoid too many negatives
+        # (~100x faster for 200 background indices)
+        ignore_inds = np.where(labels < 0)[0]
+        return inds_inside[fg_inds], inds_inside[ignore_inds]
-    def __call__(self, features, gt_boxes):
+    def __call__(self, **inputs):
        num_images = cfg.TRAIN.IMS_PER_BATCH
-        gt_boxes_wide = box_util.dismantle_boxes(gt_boxes, num_images)
+        shapes = [f.shape[-2:] for f in inputs['features']]
+        image_stride = sum(self.base_anchors[i].shape[0] * np.prod(shapes[i])
-        if len(gt_boxes_wide) != num_images:
+                           for i in range(len(inputs['features'])))
-            logger.fatal(
-                'Input {} images, got {} slices of gt boxes.'
-                .format(num_images, len(gt_boxes_wide))
-            )
-        # Generate grid anchors from base
-        grid_shapes = [f.shape[-2:] for f in features]
-        if grid_shapes == self.last_grid_shapes:
-            all_anchors = self.last_grid_anchors
-        else:
-            self.last_grid_shapes = grid_shapes
-            self.last_grid_anchors = all_anchors = \
-                generate_grid_anchors(
-                    grid_shapes,
-                    self.base_anchors,
-                    self.strides,
-                )
-        num_anchors = all_anchors.shape[0]
-        # Label: ``1`` is positive, ``0`` is negative, ``-1` is don't care
+        narrow_args = [self.all_coords, self.base_anchors, self.max_shapes, shapes]
-        labels_wide = -np.ones((num_images, num_anchors,), 'int64')
+        outputs = collections.defaultdict(list)
-        bbox_indices_wide, bbox_anchors_wide, bbox_targets_wide = [], [], []
-        # Different from R-CNN, all anchors will be used
+        # Label: ``1`` is positive, ``0`` is negative, ``-1` is don't care
-        inds_inside, anchors = np.arange(num_anchors), all_anchors
+        output_labels = np.zeros((num_images, image_stride,), 'int64')
-        num_inside = len(inds_inside)
        for ix in range(num_images):
-            # GT boxes (x1, y1, x2, y2, label)
+            fg_inds = inputs['fg_inds'][ix]
-            gt_boxes = gt_boxes_wide[ix]
+            ignore_inds = inputs['bg_inds'][ix]
+            gt_boxes = inputs['gt_boxes'][ix]
-            # label: 1 is positive, 0 is negative, -1 is don't care
-            labels = np.empty((num_inside,), dtype='int64')
+            # Narrow anchors to match the feature layout
-            labels.fill(-1)
+            anchors = self.all_anchors[fg_inds]
+            ignore_inds = rcnn_util.narrow_anchors(*(narrow_args + [ignore_inds]))
-            # Overlaps between the anchors and the gt boxes
+            _, anchors = rcnn_util.narrow_anchors(*(narrow_args + [fg_inds, anchors]))
-            overlaps = box_util.bbox_overlaps(anchors, gt_boxes)
+            fg_inds = rcnn_util.narrow_anchors(*(narrow_args + [fg_inds]))
-            argmax_overlaps = overlaps.argmax(1)
-            max_overlaps = overlaps[np.arange(num_inside), argmax_overlaps]
+            # Compute bbox targets
+            gt_assignment = box_util.bbox_overlaps(anchors, gt_boxes).argmax(axis=1)
-            # Foreground: for each gt, anchor with highest overlap
+            bbox_targets = box_util.bbox_transform(anchors, gt_boxes[gt_assignment, :4])
-            gt_argmax_overlaps = overlaps.argmax(0)
+            outputs['bbox_anchors'].append(anchors)
-            gt_max_overlaps = overlaps[gt_argmax_overlaps, np.arange(overlaps.shape[1])]
+            outputs['bbox_targets'].append(bbox_targets)
-            gt_argmax_overlaps = np.where(overlaps == gt_max_overlaps)[0]
-            gt_inds = argmax_overlaps[gt_argmax_overlaps]
+            # Compute label assignments
-            labels[gt_argmax_overlaps] = gt_boxes[gt_inds, 4]
+            output_labels[ix, ignore_inds] = -1
+            output_labels[ix, fg_inds] = gt_boxes[gt_assignment, 4]
-            # Foreground: above threshold IoU
-            inds = max_overlaps >= cfg.RETINANET.POSITIVE_OVERLAP
+            # Compute sparse indices
-            gt_inds = argmax_overlaps[inds]
+            fg_inds += ix * image_stride
-            labels[inds] = gt_boxes[gt_inds, 4]
+            outputs['bbox_inds'].extend([fg_inds])
-            fg_inds = np.where(labels > 0)[0]
-            # Background: below threshold IoU
-            labels[max_overlaps < cfg.RETINANET.NEGATIVE_OVERLAP] = 0
-            # Retract the clamping if we don't have one
-            if len(fg_inds) == 0:
-                gt_inds = argmax_overlaps[gt_argmax_overlaps]
-                labels[gt_argmax_overlaps] = gt_boxes[gt_inds, 4]
-                fg_inds = np.where(labels > 0)[0]
-            labels_wide[ix, inds_inside] = labels
-            bbox_anchors_wide.append(anchors[fg_inds])
-            bbox_indices_wide.append(fg_inds + (num_anchors * ix))
-            bbox_targets_wide.append(
-                box_util.bbox_transform(
-                    anchors[fg_inds],
-                    gt_boxes[argmax_overlaps[fg_inds], :4],
-                )
-            )
        return {
-            'labels': new_tensor(labels_wide),
+            'labels': new_tensor(output_labels),
-            'bbox_indices': new_tensor(np.concatenate(bbox_indices_wide)),
+            'bbox_inds': new_tensor(
-            'bbox_anchors': new_tensor(np.concatenate(bbox_anchors_wide).astype('float32')),
+                np.concatenate(outputs['bbox_inds'])),
-            'bbox_targets': new_tensor(np.concatenate(bbox_targets_wide).astype('float32')),
+            'bbox_targets': new_tensor(
+                np.concatenate(outputs['bbox_targets']).astype('float32')),
+            'bbox_anchors': new_tensor(
+                np.concatenate(outputs['bbox_anchors']).astype('float32')),
        }
--- a/seetadet/algo/retinanet/data_loader.py
+++ b/seetadet/algo/retinanet/data_loader.py
@@ -22,7 +22,10 @@ class DataLoader(object):
    """Provide mini-batches of data."""
    def __new__(cls):
-        if cfg.TRAIN.MAX_SIZE > 0:
+        pipeline_type = cfg.PIPELINE.TYPE.lower()
+        if pipeline_type == 'default' or pipeline_type == 'rcnn':
            return faster_rcnn.DataLoader()
-        else:
+        elif pipeline_type == 'ssd':
            return ssd.DataLoader()
+        else:
+            raise ValueError('Unsupported pipeline: ' + pipeline_type)
--- a/seetadet/algo/retinanet/test.py
+++ b/seetadet/algo/retinanet/test.py
@@ -20,60 +20,79 @@ import numpy as np
 from seetadet.core.config import cfg
 from seetadet.modeling.detector import new_detector
+from seetadet.utils import blob as blob_util
+from seetadet.utils import image as image_util
+from seetadet.utils import logger
 from seetadet.utils import nms as nms_util
 from seetadet.utils import time_util
-from seetadet.utils.blob import im_list_to_blob
-from seetadet.utils.image import scale_image
-def ims_detect(detector, raw_images):
+def get_data(raw_images):
-    """Detect images, with single or multiple scales."""
+    """Return the test data."""
-    ims, ims_scale = [], []
+    max_size = cfg.TEST.MAX_SIZE
-    for i in range(len(raw_images)):
+    if cfg.PIPELINE.TYPE.lower() == 'ssd':
-        im, im_scale = scale_image(raw_images[i])
+        max_size = 0  # Warped to a fixed size
-        ims += im
+    images_wide = []
-        ims_scale += im_scale
+    image_shapes_wide, image_scales_wide = [], []
+    for img in raw_images:
-    num_scales = len(ims_scale) // len(raw_images)
+        images, image_scales = image_util.scale_image(
-    ims_shape = np.array([im.shape[:2] for im in ims])
+            img, scales=cfg.TEST.SCALES, max_size=max_size)
-    ims_scale = np.array(ims_scale).reshape((len(ims), -1))
+        images_wide += images
+        image_scales_wide += image_scales
-    # Prepare blobs
+        image_shapes_wide += [img.shape[:2] for img in images]
-    data = im_list_to_blob(ims)
+    images = blob_util.im_list_to_blob(
-    ims_info = np.hstack([ims_shape, ims_scale]).astype('float32')
+        images_wide, coarsest_stride=cfg.MODEL.COARSEST_STRIDE)
+    image_shapes = np.array(image_shapes_wide)
+    image_scales = np.array(image_scales_wide).reshape((len(images), -1))
+    images_info = np.hstack([image_shapes, image_scales]).astype('float32')
+    return images, images_info
+def ims_detect(detector, raw_images, timer=None):
+    """Detect images at single or multiple scales."""
+    images, images_info = get_data(raw_images)
+    timer.tic() if timer else timer
    # Do Forward
-    data = torch.from_numpy(data)
+    inputs = {'image': torch.from_numpy(images),
-    ims_info = torch.from_numpy(ims_info)
+              'im_info': torch.from_numpy(images_info)}
+    # with torch.no_grad():
+    #     outputs = detector.forward(inputs)
    if not hasattr(detector, 'script_forward'):
-        def script_forward(self, data, ims_info):
+        def script_forward(self, image, im_info):
-            return self.forward({'data': data, 'ims_info': ims_info})
+            return self.forward({'image': image, 'im_info': im_info})
        detector.script_forward = torch.jit.trace(
            func=types.MethodType(script_forward, detector),
-            example_inputs=[data, ims_info],
+            example_inputs=[inputs['image'], inputs['im_info']],
        )
-    outputs = detector.script_forward(data, ims_info)
+    outputs = detector.script_forward(inputs['image'], inputs['im_info'])
    outputs = dict((k, outputs[k].numpy()) for k in outputs.keys())
+    timer.toc() if timer else timer
-    # Unpack results
-    results = outputs['detections']
+    # Decode results
-    detections = [[] for _ in range(len(raw_images))]
+    detections = outputs['detections']
+    results = [[] for _ in range(len(raw_images))]
-    for i in range(len(ims)):
+    for i in range(len(images)):
-        inds = np.where(results[:, 0].astype(np.int32) == i)[0]
+        inds = np.where(detections[:, 0].astype(np.int32) == i)[0]
-        detections[i // num_scales].append(results[inds, 1:])
+        results[i // len(cfg.TEST.SCALES)].append(detections[inds, 1:])
-    return [np.vstack(detections[i]) for i in range(len(raw_images))]
+    # Merge from multiple scales
+    ret = [np.vstack(d) for d in results]
+    timer.toc() if timer else timer
-def test_net(weights, num_classes, q_in, q_out, device):
+    return ret
-    num_classes, cfg.GPU_ID = num_classes, device
+def test_net(weights, q_in, q_out, device, root_logger=True):
+    """Test a network trained with RetinaNet algorithm."""
+    cfg.GPU_ID = device
+    num_classes = len(cfg.MODEL.CLASSES)
+    logger.set_root_logger(root_logger)
    detector = new_detector(device, weights)
    must_stop = False
-    _t = time_util.new_timers('im_detect', 'misc')
+    timers = time_util.new_timers('im_detect_bbox', 'misc')
+    empty_detections = np.zeros((0, 5), 'float32')
    while True:
        if must_stop:
@@ -91,17 +110,19 @@ def test_net(weights, num_classes, q_in, q_out, device):
            continue
        # Run detecting on specific scales
-        with _t['im_detect'].tic_and_toc():
+        results = ims_detect(detector, raw_images, timers['im_detect_bbox'])
-            results = ims_detect(detector, raw_images)
-        # Post-Processing
+        # Post-processing
        for i, detections in enumerate(results):
-            _t['misc'].tic()
+            timers['misc'].tic()
            boxes_this_image = [[]]
-            # {x1, y1, x2, y2, score, cls}
+            # Detection format: (x1, y1, x2, y2, score, cls)
            detections = np.array(detections)
            for j in range(1, num_classes):
                cls_indices = np.where(detections[:, 5].astype(np.int32) == j)[0]
+                if len(cls_indices) == 0:
+                    boxes_this_image.append(empty_detections)
+                    continue
                cls_boxes = detections[cls_indices, :4]
                cls_scores = detections[cls_indices, 4]
                cls_detections = np.hstack((
@@ -121,11 +142,11 @@ def test_net(weights, num_classes, q_in, q_out, device):
                    )
                cls_detections = cls_detections[keep, :]
                boxes_this_image.append(cls_detections)
-            _t['misc'].toc()
+            timers['misc'].toc()
            q_out.put((
                indices[i],
-                dict([('im_detect', _t['im_detect'].average_time),
+                dict([('im_detect', timers['im_detect_bbox'].average_time),
-                      ('misc', _t['misc'].average_time)]),
+                      ('misc', timers['misc'].average_time)]),
                dict([('boxes', boxes_this_image)]),
            ))
--- a/seetadet/algo/ssd/__init__.py
+++ b/seetadet/algo/ssd/__init__.py
@@ -14,7 +14,4 @@ from __future__ import division
 from __future__ import print_function
 from seetadet.algo.ssd.data_loader import DataLoader
-from seetadet.algo.ssd.hard_mining import HardMining
+from seetadet.algo.ssd.anchor_target import AnchorTarget
-from seetadet.algo.ssd.multibox import MultiBoxMatch
-from seetadet.algo.ssd.multibox import MultiBoxTarget
-from seetadet.algo.ssd.priorbox import PriorBox
--- a/seetadet/algo/ssd/anchor_target.py
+++ b/seetadet/algo/ssd/anchor_target.py
+# ------------------------------------------------------------
+# Copyright (c) 2017-present, SeetaTech, Co.,Ltd.
+#
+# Licensed under the BSD 2-Clause License.
+# You should have received a copy of the BSD 2-Clause License
+# along with the software. If not, See,
+#
+#     <https://opensource.org/licenses/BSD-2-Clause>
+#
+# ------------------------------------------------------------
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import collections
+import math
+import numpy as np
+from seetadet.algo.ssd import generate_anchors as anchor_util
+from seetadet.algo.ssd import utils as ssd_util
+from seetadet.core.config import cfg
+from seetadet.utils import boxes as box_util
+from seetadet.utils.env import new_tensor
+class AnchorTarget(object):
+    """Assign ground-truth targets to anchors."""
+    def __init__(self):
+        super(AnchorTarget, self).__init__()
+        # Load the basic configs
+        self.strides = cfg.SSD.STRIDES
+        anchor_sizes = cfg.SSD.ANCHOR_SIZES
+        aspect_ratios = cfg.SSD.ASPECT_RATIOS
+        self.base_anchors = []
+        for i in range(len(anchor_sizes)):
+            ratios = aspect_ratios[i]
+            if not isinstance(ratios, (tuple, list)):
+                # All strides share the same ratios
+                ratios = aspect_ratios
+            self.base_anchors.append(
+                anchor_util.generate_anchors(
+                    min_sizes=[anchor_sizes[i][0]],
+                    max_sizes=[anchor_sizes[i][1]],
+                    ratios=ratios))
+        # Plan the fixed anchor layout
+        max_size = cfg.TRAIN.SCALES[0]
+        if cfg.MODEL.COARSEST_STRIDE > 0:
+            stride = float(cfg.MODEL.COARSEST_STRIDE)
+            max_size = int(math.ceil(max_size / stride) * stride)
+        shapes = [[math.ceil(max_size / stride)] * 2
+                  for stride in self.strides]
+        self.all_anchors = ssd_util.get_shifted_anchors(
+            shapes, self.base_anchors, self.strides)
+    def sample_anchors(self, gt_boxes, all_anchors=None):
+        anchors = self.all_anchors \
+            if all_anchors is None else all_anchors
+        num_anchors = len(anchors)
+        labels = np.empty((num_anchors,), dtype='int32')
+        labels.fill(-1)
+        # Overlaps between the anchors and the gt boxes.
+        overlaps = box_util.bbox_overlaps(anchors, gt_boxes)
+        argmax_overlaps = overlaps.argmax(axis=1)
+        max_overlaps = overlaps[np.arange(num_anchors), argmax_overlaps]
+        # Foreground: for each gt, anchor with highest overlap.
+        gt_argmax_overlaps = overlaps.argmax(axis=0)
+        gt_assignment = argmax_overlaps[gt_argmax_overlaps]
+        labels[gt_argmax_overlaps] = gt_boxes[gt_assignment, 4]
+        # Foreground: above threshold IoU.
+        inds = max_overlaps >= cfg.SSD.POSITIVE_OVERLAP
+        gt_assignment = argmax_overlaps[inds]
+        labels[inds] = gt_boxes[gt_assignment, 4]
+        fg_inds = np.where(labels > 0)[0]
+        # Negative: not matched and below threshold IoU.
+        neg_inds = np.where(labels <= 0)[0]
+        neg_overlaps = max_overlaps[neg_inds]
+        eligible_neg_inds = np.where(neg_overlaps < cfg.SSD.NEGATIVE_OVERLAP)[0]
+        neg_inds = neg_inds[eligible_neg_inds]
+        return fg_inds, neg_inds
+    def __call__(self, **inputs):
+        num_images = cfg.TRAIN.IMS_PER_BATCH
+        neg_pos_ratio = cfg.SSD.NEGATIVE_POSITIVE_RATIO
+        image_stride = self.all_anchors.shape[0]
+        cls_prob = inputs['cls_prob'].numpy()
+        outputs = collections.defaultdict(list)
+        # Label: ``1`` is positive, ``0`` is negative, ``-1` is don't care
+        output_labels = np.empty((num_images, image_stride,), 'int64')
+        output_labels.fill(-1)
+        for ix in range(num_images):
+            fg_inds = inputs['fg_inds'][ix]
+            neg_inds = inputs['bg_inds'][ix]
+            gt_boxes = inputs['gt_boxes'][ix]
+            # Mining hard negatives as background.
+            num_pos, num_neg = len(fg_inds), len(neg_inds)
+            num_bg = min(int(num_pos * neg_pos_ratio), num_neg)
+            neg_loss = -np.log(np.maximum(
+                cls_prob[ix, neg_inds][np.arange(num_neg),
+                                       np.zeros((num_neg,), 'int32')],
+                np.finfo(float).eps))
+            bg_inds = neg_inds[np.argsort(-neg_loss)][:num_bg]
+            # Compute bbox targets.
+            anchors = self.all_anchors[fg_inds]
+            gt_assignment = box_util.bbox_overlaps(
+                anchors, gt_boxes).argmax(axis=1)
+            bbox_targets = box_util.bbox_transform(
+                anchors, gt_boxes[gt_assignment, :4],
+                cfg.BBOX_REG_WEIGHTS)
+            outputs['bbox_anchors'].append(anchors)
+            outputs['bbox_targets'].append(bbox_targets)
+            # Compute label assignments.
+            output_labels[ix, bg_inds] = 0
+            output_labels[ix, fg_inds] = gt_boxes[gt_assignment, 4]
+            # Compute sparse indices.
+            fg_inds += ix * image_stride
+            outputs['bbox_inds'].extend([fg_inds])
+        return {
+            'labels': new_tensor(output_labels),
+            'bbox_inds': new_tensor(
+                np.concatenate(outputs['bbox_inds'])),
+            'bbox_targets': new_tensor(
+                np.concatenate(outputs['bbox_targets']).astype('float32')),
+            'bbox_anchors': new_tensor(
+                np.concatenate(outputs['bbox_anchors']).astype('float32')),
+        }
--- a/seetadet/algo/ssd/cat.jpg
+++ b/seetadet/algo/ssd/cat.jpg
--- a/seetadet/algo/ssd/data_loader.py
+++ b/seetadet/algo/ssd/data_loader.py
@@ -13,8 +13,11 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
+import collections
 import multiprocessing as mp
 import time
+import threading
+import queue
 import dragon
 import dragon.vm.torch as torch
@@ -23,6 +26,7 @@ import numpy as np
 from seetadet.algo.ssd import data_transformer
 from seetadet.core.config import cfg
 from seetadet.datasets.factory import get_dataset
+from seetadet.utils import blob as blob_util
 from seetadet.utils import logger
@@ -32,28 +36,24 @@ class DataLoader(object):
    def __init__(self):
        super(DataLoader, self).__init__()
        dataset = get_dataset(cfg.TRAIN.DATASET)
-        if cfg.USE_DALI:
+        self.iterator = Iterator(**{
-            from seetadet.dali import ssd_pipeline as pipe
+            'dataset': dataset.cls,
-            self.iterator = pipe.new_iterator(dataset.source)
+            'source': dataset.source,
-        else:
+            'classes': dataset.classes,
-            self.iterator = Iterator(**{
+            'shuffle': cfg.TRAIN.USE_SHUFFLE,
-                'dataset': dataset.cls,
+            'batch_size': cfg.TRAIN.IMS_PER_BATCH * 2,
-                'source': dataset.source,
+            'num_transformers': cfg.TRAIN.NUM_THREADS - 1,
-                'classes': dataset.classes,
+        })
-                'shuffle': cfg.TRAIN.USE_SHUFFLE,
+        self.iterator.start()
-                'num_chunks': cfg.TRAIN.SHUFFLE_CHUNKS,
-                'batch_size': cfg.TRAIN.IMS_PER_BATCH * 2,
-                'num_transformers': cfg.TRAIN.NUM_THREADS - 1,
-            })
    def __call__(self):
        outputs = self.iterator.next()
-        if isinstance(outputs['data'], np.ndarray):
+        if isinstance(outputs['image'], np.ndarray):
-            outputs['data'] = torch.from_numpy(outputs['data'])
+            outputs['image'] = torch.from_numpy(outputs['image'])
        return outputs
-class Iterator(object):
+class Iterator(threading.Thread):
    """Iterator to return the batch of data."""
    def __init__(self, **kwargs):
@@ -67,15 +67,16 @@ class Iterator(object):
            rank = dragon.distributed.get_rank(process_group)
        # Configuration
-        self._prefetch = kwargs.get('prefetch', 5)
+        self._batch_size = kwargs.get('batch_size', 8)
-        self._batch_size = kwargs.get('batch_size', 32)
        self._num_readers = kwargs.get('num_readers', 1)
        self._num_transformers = kwargs.get('num_transformers', 3)
+        self.daemon = True
        # Initialize queues
-        num_batches = self._prefetch * self._num_readers
+        num_batches = self._num_readers
-        self.q_in = mp.Queue(num_batches * self._batch_size)
+        self._queue1 = mp.Queue(num_batches * self._batch_size)
-        self.q_out = mp.Queue(num_batches * self._batch_size)
+        self._queue2 = mp.Queue(num_batches * self._batch_size)
+        self._queue3 = queue.Queue(num_batches)
        # Initialize readers
        self._readers = []
@@ -86,7 +87,7 @@ class Iterator(object):
            self._readers.append(dragon.io.DataReader(
                part_idx=part_idx, num_parts=num_parts, **kwargs))
            self._readers[i]._seed += part_idx
-            self._readers[i].q_out = self.q_in
+            self._readers[i].q_out = self._queue1
            self._readers[i].start()
            time.sleep(0.1)
@@ -95,7 +96,7 @@ class Iterator(object):
        for i in range(self._num_transformers):
            p = data_transformer.DataTransformer(**kwargs)
            p._seed += (i + rank * self._num_transformers)
-            p.q_in, p.q_out = self.q_in, self.q_out
+            p.q_in, p.q_out = self._queue1, self._queue2
            p.start()
            self._transformers.append(p)
            time.sleep(0.1)
@@ -118,26 +119,41 @@ class Iterator(object):
        """Return the next batch of data."""
        return self.__next__()
+    def run(self):
+        """Main loop."""
+        num_images = cfg.TRAIN.IMS_PER_BATCH
+        num_batches = cfg.TRAIN.ASPECT_GROUPING
+        logger.info('Initialize prefetching batches...')
+        example_buffer = [self._queue2.get()
+                          for _ in range(num_images * num_batches)]
+        next_examples = []
+        while True:
+            # Use cached buffer for next N examples
+            if len(next_examples) == 0:
+                next_examples = example_buffer
+                example_buffer = []
+            # Prepare the next batch
+            outputs = collections.defaultdict(list)
+            for i in range(num_images):
+                example = next_examples.pop(0)
+                outputs['image'].append(example['image'])
+                outputs['gt_boxes'].append(example['boxes'])
+                outputs['im_info'].append(example['im_info'])
+                outputs['fg_inds'].append(example.get('fg_inds', None))
+                outputs['bg_inds'].append(example.get('bg_inds', None))
+                example_buffer.append(self._queue2.get())
+            outputs['image'] = blob_util.im_list_to_blob(
+                outputs['image'], coarsest_stride=cfg.MODEL.COARSEST_STRIDE)
+            # Send batch data to consumer
+            self._queue3.put(outputs)
    def __iter__(self):
        """Return the iterator self."""
        return self
    def __next__(self):
        """Return the next batch of data."""
-        n = cfg.TRAIN.IMS_PER_BATCH
+        return self._queue3.get()
-        h = w = cfg.TRAIN.SCALES[0]
-        boxes_to_pack = []
-        image, boxes = self.q_out.get()
-        images = np.zeros((n, h, w, 3), image.dtype)
-        for i in range(n):
-            images[i] = image
-            gt_boxes = np.zeros((boxes.shape[0], boxes.shape[1] + 1), 'float32')
-            gt_boxes[:, :boxes.shape[1]], gt_boxes[:, -1] = boxes, i
-            boxes_to_pack.append(gt_boxes)
-            if i != (cfg.TRAIN.IMS_PER_BATCH - 1):
-                image, boxes = self.q_out.get()
-        boxes_to_pack = np.concatenate(boxes_to_pack)
-        return {'data': images, 'gt_boxes': boxes_to_pack}
--- a/seetadet/algo/ssd/data_transformer.py
+++ b/seetadet/algo/ssd/data_transformer.py
@@ -14,8 +14,12 @@ from __future__ import division
 from __future__ import print_function
 import multiprocessing
+import cv2
 import numpy as np
+import numpy.random as npr
+from seetadet.algo import common as algo_common
 from seetadet.algo.ssd import transforms
 from seetadet.core.config import cfg
 from seetadet.datasets.example import Example
@@ -27,108 +31,95 @@ class DataTransformer(multiprocessing.Process):
        super(DataTransformer, self).__init__()
        self._scale = cfg.TRAIN.SCALES[0]
        self._seed = cfg.RNG_SEED
-        self._mirror = cfg.TRAIN.USE_FLIPPED
        self._use_diff = cfg.TRAIN.USE_DIFF
+        self._use_flipped = cfg.TRAIN.USE_FLIPPED
        self._classes = kwargs.get('classes', ('__background__',))
        self._num_classes = len(self._classes)
        self._class_to_ind = dict(zip(self._classes, range(self._num_classes)))
-        self.augment_image = \
+        self._anchor_sampler = algo_common.AnchorSampler()
-            transforms.Compose(
+        self._apply_transform = transforms.Compose(transforms.Distort(),
-                transforms.Distort(),  # Color augmentation
+                                                   transforms.Expand(),
-                transforms.Expand(),   # Expand and padding
+                                                   transforms.Sample(),
-                transforms.Sample(),   # Sample a patch randomly
+                                                   transforms.Resize())
-                transforms.Resize(),   # Resize to a fixed scale
-            )
        self.q_in = self.q_out = None
        self.daemon = True
-    def make_roi_dict(self, example, apply_flip=False):
+    def get_boxes(self, example):
-        objects, n_objects = example.objects, 0
+        objects, num_objects = example.objects, 0
        height, width = example.height, example.width
        if not self._use_diff:
            for obj in objects:
                if obj.get('difficult', 0) == 0:
-                    n_objects += 1
+                    num_objects += 1
        else:
-            n_objects = len(objects)
+            num_objects = len(objects)
-        roi_dict = {
+        boxes = np.zeros((num_objects, 4), 'float32')
-            'boxes': np.zeros((n_objects, 4), 'float32'),
+        gt_classes = np.zeros((num_objects,), 'int32')
-            'gt_classes': np.zeros((n_objects,), 'int32'),
-        }
-        # Filter the difficult instances
+        # Filter the difficult instances.
        object_idx = 0
        for obj in objects:
-            if not self._use_diff and \
+            if not self._use_diff and obj.get('difficult', 0) > 0:
-                    obj.get('difficult', 0) > 0:
                continue
            bbox = obj['bbox']
-            roi_dict['boxes'][object_idx, :] = [
+            boxes[object_idx, :] = [max(0, bbox[0]),
-                max(0, bbox[0]),
+                                    max(0, bbox[1]),
-                max(0, bbox[1]),
+                                    min(bbox[2], width - 1),
-                min(bbox[2], width - 1),
+                                    min(bbox[3], height - 1)]
-                min(bbox[3], height - 1),
+            gt_classes[object_idx] = self._class_to_ind[obj['name']]
-            ]
-            roi_dict['gt_classes'][object_idx] = \
-                self._class_to_ind[obj['name']]
            object_idx += 1
-        if apply_flip:
+        # Normalize.
-            roi_dict['boxes'] = \
+        boxes[:, 0::2] /= width
-                box_util.flip_boxes(
+        boxes[:, 1::2] /= height
-                    roi_dict['boxes'],
-                    width,
-                )
-        # Normalize to unit sizes
+        # Attach the classes.
-        roi_dict['boxes'][:, 0::2] /= width
+        gt_boxes = np.empty((num_objects, 5), dtype=np.float32)
-        roi_dict['boxes'][:, 1::2] /= height
+        gt_boxes[:, :4], gt_boxes[:, 4] = boxes, gt_classes
-        return roi_dict
+        return gt_boxes
    def get(self, example):
        example = Example(example)
-        img = example.image
-        # Flip
-        apply_flip = False
-        if self._mirror:
-            if np.random.randint(2) > 0:
-                img = img[:, ::-1]
-                apply_flip = True
-        # Example -> RoIDict
+        # Boxes.
-        roi_dict = self.make_roi_dict(example, apply_flip)
+        boxes = self.get_boxes(example)
+        if len(boxes) == 0:
+            return {'boxes': boxes}
-        # Post-Process for gt boxes
+        # Distort => Expand => Sample => Resize
-        # Shape like: [num_objects, {x1, y1, x2, y2, cls}]
+        img, boxes = self._apply_transform(example.image, boxes)
-        gt_boxes = np.empty((roi_dict['gt_classes'].size, 5), 'float32')
-        gt_boxes[:, :4], gt_boxes[:, 4] = roi_dict['boxes'], roi_dict['gt_classes']
-        if len(gt_boxes) == 0:
+        # Restore to the blob scale.
-            # Ignore the non-object image
+        boxes[:, :4] *= self._scale
-            return img, gt_boxes
-        # Distort => Expand => Sample => Resize
+        # Flip.
-        img, gt_boxes = self.augment_image(img, gt_boxes)
+        if self._use_flipped and npr.randint(2) > 0:
+            img = img[:, ::-1]
+            boxes = box_util.flip_boxes(boxes, img.shape[1])
-        # Restore to the blob scale
+        # Standard outputs.
-        gt_boxes[:, :4] *= self._scale
+        outputs = {'image': img, 'boxes': boxes, 'im_info': img.shape[:2]}
-        # Post-Process for image
+        # Attach precomputed targets.
-        if img.dtype == 'uint16':
+        if len(boxes) > 0:
-            img = img.astype('float32') / 256.
+            outputs.update(
+                self._anchor_sampler(
+                    gt_boxes=boxes,
+                    im_info=outputs['im_info']))
-        return img, gt_boxes
+        return outputs
    def run(self):
-        # Fix the process-local random seed
+        # Disable the opencv threading.
+        cv2.setNumThreads(1)
+        # Fix the process-local random seed.
        np.random.seed(self._seed)
        # Main prefetch loop
        while True:
            outputs = self.get(self.q_in.get())
-            if len(outputs[1]) < 1:
+            if len(outputs['boxes']) < 1:
-                continue  # Ignore the non-object image
+                continue  # Ignore non-object image.
            self.q_out.put(outputs)
--- a/seetadet/algo/ssd/generate_anchors.py
+++ b/seetadet/algo/ssd/generate_anchors.py
--- a/seetadet/algo/ssd/hard_mining.py
+++ b/seetadet/algo/ssd/hard_mining.py
--- a/seetadet/algo/ssd/multibox.py
+++ b/seetadet/algo/ssd/multibox.py
--- a/seetadet/algo/ssd/priorbox.py
+++ b/seetadet/algo/ssd/priorbox.py
--- a/seetadet/algo/ssd/test.py
+++ b/seetadet/algo/ssd/test.py
--- a/seetadet/algo/ssd/transforms.py
+++ b/seetadet/algo/ssd/transforms.py
--- a/seetadet/algo/ssd/transforms_test.py
+++ b/seetadet/algo/ssd/transforms_test.py
--- a/seetadet/algo/ssd/utils.py
+++ b/seetadet/algo/ssd/utils.py
--- a/seetadet/core/config.py
+++ b/seetadet/core/config.py
--- a/seetadet/core/coordinator.py
+++ b/seetadet/core/coordinator.py
--- a/seetadet/core/registry.py
+++ b/seetadet/core/registry.py
--- a/seetadet/core/test_engine.py
+++ b/seetadet/core/test_engine.py
--- a/seetadet/core/test.py
+++ b/seetadet/core/test.py
--- a/seetadet/core/train.py
+++ b/seetadet/core/train.py
--- a/seetadet/dali/data_reader.py
+++ b/seetadet/dali/data_reader.py
--- a/seetadet/dali/rcnn_pipeline.py
+++ b/seetadet/dali/rcnn_pipeline.py
--- a/seetadet/dali/ssd_pipeline.py
+++ b/seetadet/dali/ssd_pipeline.py
--- a/seetadet/datasets/coco_evaluator.py
+++ b/seetadet/datasets/coco_evaluator.py
--- a/seetadet/datasets/dataset.py
+++ b/seetadet/datasets/dataset.py
--- a/seetadet/datasets/example.py
+++ b/seetadet/datasets/example.py
--- a/seetadet/datasets/voc_eval.py
+++ b/seetadet/datasets/voc_eval.py
--- a/seetadet/modeling/airnet.py
+++ b/seetadet/modeling/airnet.py
--- a/seetadet/modeling/detector.py
+++ b/seetadet/modeling/detector.py
--- a/seetadet/modeling/fast_rcnn.py
+++ b/seetadet/modeling/fast_rcnn.py
--- a/seetadet/modeling/fpn.py
+++ b/seetadet/modeling/fpn.py
--- a/seetadet/modeling/mask_rcnn.py
+++ b/seetadet/modeling/mask_rcnn.py
--- a/seetadet/modeling/mobilenet.py
+++ b/seetadet/modeling/mobilenet.py
--- a/seetadet/modeling/resnet.py
+++ b/seetadet/modeling/resnet.py
--- a/seetadet/modeling/retinanet.py
+++ b/seetadet/modeling/retinanet.py
--- a/seetadet/modeling/rpn.py
+++ b/seetadet/modeling/rpn.py
--- a/seetadet/modeling/ssd.py
+++ b/seetadet/modeling/ssd.py
--- a/seetadet/modeling/vgg.py
+++ b/seetadet/modeling/vgg.py
--- a/seetadet/modules/det.py
+++ b/seetadet/modules/det.py
--- a/seetadet/modules/init.py
+++ b/seetadet/modules/init.py
--- a/seetadet/modules/nn.py
+++ b/seetadet/modules/nn.py
--- a/seetadet/modules/utils.py
+++ b/seetadet/modules/utils.py
--- a/seetadet/modules/vision.py
+++ b/seetadet/modules/vision.py
--- a/seetadet/onnx/nodes.py
+++ b/seetadet/onnx/nodes.py
--- a/seetadet/solver/lr_scheduler.py
+++ b/seetadet/solver/lr_scheduler.py
--- a/seetadet/solver/sgd.py
+++ b/seetadet/solver/sgd.py
--- a/seetadet/utils/attrdict.py
+++ b/seetadet/utils/attrdict.py
--- a/seetadet/utils/blob.py
+++ b/seetadet/utils/blob.py
--- a/seetadet/utils/boxes.py
+++ b/seetadet/utils/boxes.py
--- a/seetadet/utils/boxes_v2.py
+++ b/seetadet/utils/boxes_v2.py
--- a/seetadet/utils/env.py
+++ b/seetadet/utils/env.py
--- a/seetadet/utils/image.py
+++ b/seetadet/utils/image.py
--- a/seetadet/utils/mask.py
+++ b/seetadet/utils/mask.py
--- a/seetadet/utils/observer.py
+++ b/seetadet/utils/observer.py
--- a/seetadet/pycocotools/__init__.py
+++ b/seetadet/pycocotools/__init__.py
--- a/seetadet/pycocotools/coco.py
+++ b/seetadet/pycocotools/coco.py
--- a/seetadet/pycocotools/cocoeval.py
+++ b/seetadet/pycocotools/cocoeval.py
--- a/seetadet/pycocotools/mask.py
+++ b/seetadet/pycocotools/mask.py
--- a/seetadet/pycocotools/mask_utils.py
+++ b/seetadet/pycocotools/mask_utils.py
--- a/seetadet/utils/stats.py
+++ b/seetadet/utils/stats.py
--- a/seetadet/utils/time_util.py
+++ b/seetadet/utils/time_util.py
--- a/seetadet/utils/vis.py
+++ b/seetadet/utils/vis.py
--- a/setup.py
+++ b/setup.py
--- a/tools/__init__.py
+++ b/tools/__init__.py
--- a/tools/export.py
+++ b/tools/export.py
--- a/tools/mpi_train.py
+++ b/tools/mpi_train.py
--- a/tools/test.py
+++ b/tools/test.py
--- a/tools/test_all.py
+++ b/tools/test_all.py
--- a/tools/train.py
+++ b/tools/train.py
--- a/version.txt
+++ b/version.txt