While deep learning is an emerging trend in modeling spatiotemporal dynamics of air pollution, designing a high-performance architecture with various neural network components is difficult. This study aims to identify a suitable deep learning architecture for deriving spatiotemporal distributions of surface NO 2 across China. Based on satellite retrievals and geographic variables, we compared the performance of a series of deep learning models and a benchmark model (i.e., random forest) in predicting daily NO 2 during 2016-2017 on a 0.1° grid. The deep learning models tested were hybrids of full connection layers (FC), autoencoders (AE), residual networks (RES), and convolution layers (CNN). Their ensemble forms were also comparatively evaluated. The integrated results of the sample-, cell-, and date-based holdout tests show that the ensemble FC-AE-RES-CNN model ( R 2 =0.71; RMSE=9.9 μg/m 3 ) outperformed the random forest ( R 2 =0.67; RMSE=10.8 μg/m 3 ), but none of the single-form networks did ( R 2 ranged from 0.56 to 0.61, and RMSE ranged from 11.6 to 12.2 μg/m 3 ). The superior performance of the ensemble FC-AE-RES-CNN model could be attributed to its lower bias and lower variance, which benefited from the sophisticated model structure and the ensemble strategy. Consequently, the ensemble FC-AE-RES-CNN model predictions demonstrated a spatiotemporal distribution of surface NO 2 with higher fidelity. In light of the comprehensive comparisons, we expect that ensemble deep learning models will be substituting for traditional machine learning models in environmental spatiotemporal mapping applications.