In Water research
Dissolved Organic Carbon (DOC) in inland waters plays an essential role in the global carbon cycle and has significant public health effects. Machine learning (ML) together with remote sensing has emerged as a powerful and promising combination to quantify water quality parameters from space. However, inland water sample data for DOC is limited. Hence, little is known about the potential to quantify DOC content in inland waters, especially over large-scale areas. This study presents the first attempt to estimate DOC in inland waters over a large-scale area using satellite data and ML methods with the newly published open-source dataset AquaSat. Four ML approaches, namely Random Forest Regression (RFR), Support Vector Regression (SVR), Gaussian Process Regression (GPR), and a Multilayer Backpropagation Neural Network (MBPNN) were trained using more than 16 thousand samples across the continental United States matched with satellite data from Landsat 5, 7 and 8 missions. Satellite data from the Landsat missions were further extended with environmental data from the ERA5-Land product and used as input to train the ML algorithms. Our results show that including environmental data as inputs considerably improved the prediction of DOC for all ML algorithms, with GPR showing the most promising performance results with moderate estimation errors (RMSE: 4.08 mg/L). Permutation feature importance analysis showed that the wavelength range in the visible Green band (from Landsat) and the monthly average air temperature (from ERA5-Land) were the most important variables for the ML approaches. The results demonstrate the predictive strength of GPR and its useful feature to derive per pixel standard deviations for detailed analysis. Our results further highlight the important role of considering environmental processes to explain DOC variations over large scales. The application and performance of the GPR in mapping spatiotemporal variations of DOC in an entire water body were discussed by taking Lake Okeechobee (the 8th largest freshwater lake in the U.S.) as an illustrative example. While performance evaluation showed that DOC concentrations can be retrieved with adequate accuracy, algorithm development was challenged by the heterogenous nature of large-scale open source in situ data, issues related to atmospheric correction, and the low spatial and temporal resolution of the environmental predictors. This research demonstrates how open source, large-scale datasets like AquaSat in combination with ML and satellite remote sensing can make research toward large-scale estimation of inland water DOC more realistic while highlighting its remaining limitations and challenges.
Harkort Lasse, Duan Zheng
2022-Dec-09
Dissolved organic carbon, Landsat, Machine learning, Open source data, Remote sensing, Water quality