Xiang Li1*Lin Zhang1*Yau Pun Chen1Yu-Wing Tai2Chi-Keung Tang1
1Hong Kong University of Science and Technology
2Tencent
Abstract
Deep learning has revolutionized object detection thanks to large-scale datasets, but their object categories are still arguably very limited. In this paper, we attempt to enrich such categories by addressing the one-shot object detection problem, where the number of annotated training examples for learning an unseen class is limited to one.
We introduce a two-stage model consisting of a first stage Matching-FCOS network and a second stage Structure-Aware Relation Module, the combination of which integrates metric learning with an anchor-free Faster R-CNN-style detection pipeline, eventually eliminating the need to fine-tune on the support images. We also propose novel training strategies that effectively improve detection performance. Extensive quantitative and qualitative evaluations were performed and our method exceeds the state-of-the-art one-shot performance which requires fine-tuning on PASCAL VOC 2007 test set by 12.1%.
Model
Overview of our architecture. The query and support images are first fed into a shared siamese backbone network. Then our Matching-FCOS produces a set of high-recall proposals. The second stage, which we term Structure-Aware Relation Module (SARM), learns to classify and regress bounding boxes by focusing on structure local features. The final goal is to detect objects in the query image with the same class of the support object.
Matching-FCOS network as the first stage of our model. C3--C5 refers to feature maps of the backbone and P3--P7 refers to feature maps of the FPN.
Structure-Aware Relation Module (SARM) at the second stage. We first pool query proposals features and support features into K x K features and concatenate them. We then use pixel-wise convolutional layers to compare structure-aware local features. Here, the cat is decomposed into structural modules including ears, feet, etc. By processing these features locally, our module can discover more relevant cues to achieve higher detection precision.
Performance
Comparison with previous work LSTD (which requires fine-tuning) in one-shot settings evaluated on task I (trained on COCO and tested on ImageNet-LOC) and II (trained on COCO excluding PASCAL VOC classes and tested on PASCAL VOC) following the LSTD paper, as well as the abalation study of our model on task II.