In recent years, image-level weakly supervised semantic segmentation (WSSS) has developed rapidly in natural scenes due to the easy availability of classification tags. However, limited to complex backgrounds, multi-category scenes, and dense small targets in remote sensing (RS) images, relatively little research has been conducted in this field. To alleviate the impact of the above problems in RS scenes, a self-supervised Siamese network based on an explicit pixel-level constraints framework is proposed, which greatly improves the quality of class activation maps and the positioning accuracy in multi-category RS scenes. Specifically, there are three novel devices in this paper to promote performance to a new level: (a) A pixel-soft classification loss is proposed, which realizes explicit constraints on pixels during the image-level training; (b) A pixel global awareness module, which captures high-level semantic context and low-level pixel spatial information, is constructed to improve the consistency and accuracy of RS object segmentation; (c) A dynamic multi-scale fusion module with a gating mechanism is devised, which enhances feature representation and improves the positioning accuracy of RS objects, particularly on small and dense objects. Experiments on two RS challenge datasets demonstrate that these proposed modules achieve new state-of-the-art results by only using image-level labels, which improve mIoU to 36.79% on iSAID and 45.43% on ISPRS in the WSSS task. To the best of our knowledge, this is the first work to perform image-level WSSS on multi-class RS scenes.
Semantic segmentation in aerial images has become an indispensable part in remote sensing image understanding for its extensive application prospects. It is crucial to jointly reason the 2D appearance along with 3D information and acquire discriminative global context to achieve better segmentation. However, previous approaches require accurate elevation data (e.g., nDSM and DSM) as additional inputs to segment semantics which sorely limits their applications. On the other hand, due to the various forms of objects in complex scenes, the global context is generally dominated by features of salient patterns (e.g., large objects) and tends to smooth inconspicuous patterns (e.g., small stuff and boundaries). In this article, a novel joint framework named Height-Embedding Context Reassembly Network (HECR-Net) is proposed. First, considering the fact that the corresponding elevation data is insufficient while we still want to exploit the serviceable height information. To alleviate the above data constraint, our method simultaneously predicts semantic labels and height maps from single aerial images by distilling height-aware embeddings implicitly. Second, we introduce a novel context-aware Reorganization (CAR) module to generate a discriminative feature with global context appropriately assigned to each local position. It benefits from both the global context aggregation module for ambiguity eliminating and local feature redistribution module for detail refinement. Third, we make full use of the learning height-aware embeddings to promote the performance of semantic segmentation via introducing a modality-affinitive propagation (MAP) block. Finally, the segmentation results on ISPRS Vaihingen and Potsdam data set illustrate that the proposed HECR-Net achieves state-of-the-art performance
Numerous deep-learning methods have been successfully applied to semantic segmentation and height estimation of remote-sensing imagery. It has also been proved that such framework can be reusable for multiple tasks to reduce computational resource overhead. However, there are still some technical limitations due to the semantic inconsistency between 3-D and 2-D features and strong interference of different objects with similar spectral-spatial properties. Previous works have sought to address these issues through hard parameter sharing or soft parameter sharing schemes. But due to unintentional integration, the specific information transmitted between multiple tasks is not clear or in a lot of redundancy. Furthermore, tuning the weights by hand between classification and regression loss function is challenging. In this paper, a novel multi-task learning method, termed ASSEH, is proposed to associatively segment semantics and estimate height from monocular remote-sensing imagery. First, considering semantic inconsistency across tasks, we design a task-specific distillation (TSD) module containing a set of task-specific gating units for each task at the cost of fewer parameters. The module allows for task-specific features to be tailored from backbone, whilst allowing for task-shared features to be transmitted. Second, we leverage the proposed cross-task propagation (CTP) module to construct and diffuse the local pattern graphlets at the common positions across tasks. Such a high-order recursive method can bridge two tasks explicitly to effectively settle semantic ambiguities caused by similar spectral characteristics with less computational burden and memory requirements. Third, a dynamic weighted geometric mean (DWGeoMean) strategy is introduced to dynamically learn the weights of each task and be more robust to the magnitude of the loss function. Finally, the results on ISPRS Vaihingen and Urban Semantic 3D data set well demonstrate that our ASSEH achieves the state-of-the-art performance.
Building 3D reconstruction from remote sensing images has a wide range of applications in smart cities, photogrammetry and other fields. Methods for automatic 3D urban building modeling typically employ multi-view images as input to algorithms to recover point clouds and 3D models of buildings. However, such models rely heavily on multi-view images of buildings, which are time-intensive and limit the applicability and practicality of the models. To solve these issues, we focus on designing an efficient DSM estimation-driven reconstruction framework (Building3D), which aims to reconstruct 3D building models from the input single-view remote sensing image. Existing DSM estimation networks suffer from the imbalance between local features and global features, which leads to over-smooth DSM estimates at instance boundaries. To address this issue, we propose a Semantic Flow Field-guided DSM Estimation (SFFDE) network, which utilizes the proposed concept of elevation semantic flow to achieve the registration of local and global features. First, in order to make the network semantics globally aware, we propose an Elevation Semantic Globalization (ESG) module to realize the semantic globalization of instances. Further, in order to alleviate the semantic span of global features and original local features, we propose a Local-to-Global Elevation Semantic Registration (L2G-ESR) module based on elevation semantic flow. Our Building3D is rooted in the SFFDE network for building elevation prediction, synchronized with a building extraction network for building masks, and then sequentially performs point cloud reconstruction and surface reconstruction (or CityGML model reconstruction). On this basis, our Building3D can optionally generate CityGML models or surface mesh models of the buildings. Extensive experiments on ISPRS Vaihingen and DFC2019 datasets on the DSM estimation task show that our SFFDE significantly improves upon state-of-the-art and δ 1 , δ 2 and δ 3 metrics of our SFFDE are improved to 0.595, 0.897 and 0.970. Furthermore, our Building3D achieves impressive results in the 3D point cloud and 3D model reconstruction process.
Unsupervised domain adaptation (UDA) is essential since manually labeling pixel-level annotations is consuming and expensive. Since the domain discrepancies have not been well solved, existing UDA approaches yield poor performance compared with supervised learning approaches. In this paper, we propose a novel sequential learning network (SLNet) for unsupervised cross-scene aerial image segmentation. The whole system is decoupled into two sequential parts—the image translation model and segmentation adaptation model. Specifically, we introduce the spectral space transferring (SST) approach to narrow the visual discrepancy. The high-frequency components between the source images and the translated images can be transferred in the Fourier spectral space for better preserving the important identity and fine-grained details. To further alleviate the distribution discrepancy, an efficient pseudo-label revising (PLR) approach was developed to guide pseudo-label learning via entropy minimization. Without additional parameters, the entropy map works as the adaptive threshold, constantly revising the pseudo labels for the target domain. Furthermore, numerous experiments for single-category and multi-category UDA segmentation demonstrate that our SLNet is the state-of-the-art.