Reducing Wide-Area Satellite Data to Concise Sets for More Efficient Training and Testing of Land-Cover Classifiers

2019-06-10T17:27:49Z (GMT) by Tommy Y. Chang
Obtaining an accurate estimate of a land-cover classifier's performance over a wide geographic area is a challenging problem due to the need to generate the ground truth that covers the entire area that may be thousands of square kilometers in size. The current best approach constructs a testing dataset by drawing samples randomly from the entire area --- with a human supplying the true label for each such sample --- with the hope that the selections thus made statistically capture all of the data diversity in the area. A major shortcoming of this approach is that it is difficult for a human to ensure that the information provided by the next data element chosen by the random sampler is non-redundant with respect to the data already collected. In order to reduce the annotation burden, it makes sense to remove any redundancies from the entire dataset before presenting its samples to a human for annotation. This dissertation presents a framework that uses a combination of clustering and compression to create a concise-set representation of the land-cover data for a large geographic area. Whereas clustering is achieved by applying Locality Sensitive Hashing (LSH) to the data elements, compression is achieved through choosing a single data element to represent a given cluster. This framework reduces the annotation burden on the human and makes it more likely that the human would persevere during the annotation stage. We validate our framework experimentally by comparing it with the traditional random sampling approach using WorldView2 satellite imagery.