Unsupervised Visual Knowledge Discovery and Accumulation in Dynamic Environments

2019-11-13T16:00:30Z (GMT) by Ziyin Wang
Developing unsupervised vision systems in Dynamic Environments is one of the next challenges in Computer Vision. In Dynamic Environments, we usually lack the complete domain knowledge of the applied environments before deployment, and computation is also limited due to the need for prompt reaction and on-board computational capacity. This thesis studies a series of key Computer Vision problems in Dynamic Environments.

First, we propose a stream clustering algorithm and a number of variants for unsupervised feature learning and object discovery, which possess several crucial characteristics required by applications in Dynamic Environments, e.g. fully progressive, arbitrary similarity measure, matching object while the feature space is increasing, etc. We give strong provable guarantees of the clustering accuracy in statistic view. Based on the above the approaches, we tackle the problem of discovering aerial objects on-the-fly, where we assume all of the objects are unknown at the beginning of the deployment. The vision system is required to discover from the low-level features to salient objects on-the-fly without any supervision. We propose a number of approaches with respect to object proposal, tracking, recognition, and localization to achieve real-time performance. Extensive experiments on prevalent aerial video datasets showed that the approaches efficiently and accurately discover salient ground objects.

To explore complex and deep architectures in Dynamic Environments, we propose Unsupervised Deep Encoding which unifies traditional Visual Encoding and Convolutional Neural Networks. We found strong relationships between single-layer Neural Networks and Clustering and thus performed unsupervised feature learning at each layer from the feature maps of the previous layer. We replaced the dot product inside each neuron with a similarity measure, which is also used in unsupervised feature learning. The weight vectors of our network are initialized by cluster centers. Therefore, one feature map is a visual encoding of its previous feature map. We applied this mechanism to pre-training Convolutional Neural Networks for image classification. It has been found by extensive experiments that pre-training benefits the network more reliable learning dynamics (e.g.fast convergence without Batch Normalization) and better classification accuracy.