Statistical combination of different types of chlorofyll-a measurements in the Dutch North Sea

4. New method – data aggregation

In this chapter, we present an alternative approach for determining chl-a growing season means and their associated confidence ratings, addressing the imbalances between in-situ and EO measurements. This method considers the variation in spatio-temporal distribution during the growing season and accounts for missing data.

Given the discrepancies between in-situ and EO measurements, an aggregation of these datasets is recommended before performing any further calculations, as the weighting method currently overemphasizes the influence of in-situ samples (chapter 3). Moreover, variations in the spatio-temporal distribution of both EO and in-situ measurements necessitate aggregation prior to calculating growing season means (Figure 6). Before combining in-situ and EO data into a single growing season mean, separate means are calculated. However, these means are derived by simply averaging the data over the entire growing season without accounting for temporal and spatial variations (Figure 6). In certain months and regions, more measurements may be available due to favorable weather conditions, leading to overrepresentation of specific areas or periods (for both EO and in-situ data). This approach can result in a skewed representation of chl-a concentrations, particularly as coastal regions and certain months tend to exhibit higher concentrations (Figure 3 and Figure 4).

Our approach will thus treat in-situ measurements as individual data points alongside EO data, aggregating both on a year-month-grid basis before proceeding with analyses. The focus will be on 1) determining the optimal grid size for aggregation, 2) imputing missing values, 3) calculating growing season means, and 4) introducing a new method for confidence ratings.

4.1 Optimal grid size

As stated above, a grid-based approach can effectively cope with temporal and spatial variation in chl-a measurements. The optimal grid size may be determined by identifying the sample size at which the variation in measurements becomes less pronounced. Figure 7 illustrates this relationship by plotting the standard deviation against sample size for four random 10x10 km grids across different months in 2008. While the sample size at which the standard deviation stabilizes varies between grids, a sample size of approximately 100 appears to provide a reliable estimate of chl-a. When dividing the SNS into 5x5 km grids, 12% of the grids have fewer than 100 samples, whereas only 6% of grids fall below this threshold when using a 10x10 km grid.

Grids are assigned to assessment areas based on their initial shapes, and may thus overlap between two neighboring areas. However, only the corresponding datapoints that lie within the initial shape of the assessment area are considered for further calculations. Dividing the study area into grids introduces edge effects, particularly where only part of the geographical area falls within the boundary of a grid cell (Figure 6). This results in smaller sample sizes at the edges. While in a large area like the SNS, these effects do not significantly impact the overall confidence rating, they become more pronounced in smaller coastal OSPAR assessment areas, where edge areas make up a relatively larger portion (Figure 8). Over the entire study period, using a 5x5 km grid results in 46% and 40% of grids containing fewer than 100 samples in RHPM and MPM, respectively. In contrast, using a larger 10x10 km grid reduces these percentages to 28% and 25%, thereby improving the confidence rating. Also in later years, with higher resolution EO data, edge effects are more prominent in smaller assessment areas.

Another important consideration when determining grid size is its impact on the mean growing season values. Utilizing a smaller grid size helps minimize the loss of resolution during the aggregation process. Table 4 illustrates the differences in growing season means for various grid sizes. Across the whole study period, the mean growing season difference between the 5x5 and 10x10 grids is 0.05 μg L-1 chl-a, while the difference between the 5x5 and 25x25 grids is 0.2 μg L-1 chl-a. Although increasing the sample size up to approximately 250 samples can improve the reliability of chl-a estimates, selecting a grid size that is too coarse beyond this threshold may lead to a significant loss of resolution. Do note that mean values in the coarser grid size are consistently higher than the smaller grid size, due to the relatively higher contribution of grids with higher mean chl-a values (high chl-a concentrations can sometimes be 100 times higher than background concentrations, which impacts mean concentrations).

A 10x10 km grid may represent the optimal choice, as it offers a sufficient sample size to provide reliable chl-a estimates on a year – month – grid basis while minimizing edge effects and resolution losses. Consequently, further analyses will be conducted using this grid size.

Table 4. Chlorophyll-a growing season means per year using a first year – month – grid aggregation step with both in-situ and EO data. For this first step the SNS assessment area was divided into 5x5 (2637 grid cells), 10x10 (714 grid cells), and 25x25 (143 grid cells) km grids
yearchl-a (mean 5x5)chl-a (mean 10x10)chl-a (mean 25x25)
19983,083,123,15
19993,783,843,92
20002,662,712,75
20014,004,054,15
20023,003,053,18
20033,633,673,86
20043,063,113,22
20053,063,113,23
20063,093,143,26
20073,503,573,78
20083,763,823,98
20093,023,073,26
20103,393,443,66
20113,213,263,44
20122,822,883,08
20133,943,984,13
20143,273,333,51
20153,243,313,47
20163,183,233,44
20172,712,762,87
20183,033,083,26
20192,762,802,94
20202,912,963,09

4.2 Missing values

Particularly in the beginning of the study period, certain year–month–grid combinations are missing (Table 5; SNS area). Since only a small portion of the data is missing, the remaining data can be used to impute these missing values. One approach is the use of a random forest algorithm, implemented through the R package missForest7). This method preserves the relationships and distributions of the observed data, is resistant to overfitting, and is capable of handling high-dimensional data.

The accuracy of the imputation using missForest was evaluated using the Normalized Root Mean Square Error (NRMSE) metric. An exceptionally low NRMSE of 3.4 × 10-7 was obtained, indicating that the imputed values closely align with the observed data and demonstrate strong model performance with minimal prediction error.

Given the small proportion of missing data, the impact of imputing these values on the overall mean per growing season is minimal (Table 6), with a maximum difference of only 0.02 μg L-1 chl-a in 1998. For the SNS area, imputing missing values may therefore not be necessary, as the effect on the overall mean is negligible, and the process introduces an additional step in data handling. However, this may not be the case for other OSPAR assessment areas, where imputing missing values may be more relevant for a comprehensive analysis. For instance, the proportion of missing year–month–grid data in the MPM and RHPM regions combined varies from 11 to 17% between 1998 and 2001 (Table 5; MPM & RHPM), although missing values do not pose an issue in later years.

Table 5. Number of missing and measured year–month–grid (10x10) combinations per year in the SNS (left side) and the MPM and RHPM assessment areas (right side)
SNSMPM & RHPM
yearn missingn measuredn missingn measured
19981024917112623
199937498280655
2000714948128607
2001744945112623
200216500322713
2003050190735
2004050190735
2005150182733
2006050190735
2007050190735
2008350160735
2009050190735
2010050190735
2011350160735
2012050190735
2013250174731
2014050190735
2015050190735
2016050192733
2017050190735
2018050190735
2019150180735
2020050190735

Table 6. Chlorophyll-a growing season means per year of in-situ and EO data using a 10x10 km aggregation step in the SNS assessment areas, taking into account only observations and observations and imputations including the difference between both values
yearchl-a (mean with imputations)chl-a (mean only observations)difference
19983,133,120,0187
19993,853,840,0096
20002,732,710,0229
20014,054,050,0047
20023,063,050,0092
20033,673,670
20043,113,110
20053,113,110,0006
20063,143,140
20073,573,570
20083,823,820,0002
20093,083,080
20103,443,440
20113,263,260,0009
20122,882,880
20133,983,980,0006
20143,333,330
20153,313,310
20163,233,230
20172,762,760
20183,083,080
20192,802,800,0002
20202,962,960

4.3 Trends

To illustrate the effect of the new method to calculate chl-a growing season means as compared to the current weighting method, we calculated the trend over the years using TrendSpotter8). This approach detects and estimates non-linear trends in environmental time series, and calculates standard errors and concomitant confidence intervals based on the Kalman-filter framework. In addition, the difference between the modelled trend value (the red line in figure 10) in each year and the model value in the last year is calculated. This enables the assessment of multiplicative trends (the percentage yearly change) and confidence intervals9), which are the base for a classification in increasing, stable, decreasing or uncertain trends (Figure 9).

By giving the in-situ data 50% weight in most years, the high in-situ chl-a concentrations, especially at the start of the monitoring period, influence the overall mean and this results in an overall moderate decrease in chl-a between 1998 and 2020, while the trend is stable in the new method (Figure 10). In the new method, in-situ and EO data were first aggregated on the 10x10 km scale, and missing values were imputed (Table 6).

4.4 Confidence rating – new method

In the current COMPEAT assessment the confidence rating per OSPAR assessment area for the chl-a measurements for both in-situ and EO data depends on the same criteria in Annex 13 of the OSPAR eutrophication status procedure6. These criteria, originally based on in-situ measurements, assign weights of 50%, 30%, or 10% to the in-situ data, though these weights lack a formal statistical foundation. Here, we propose a method that assigns a confidence rating to each year–month–grid combination based on sample size, which can then be aggregated to derive an overall confidence rating for the chl-a growing season mean each year.

To determine the confidence classes with sample size for each OSPAR assessment area, we will use the relative margin of error (MOE). The MOE quantifies the amount of random sampling error and represents the radius (= half) of the confidence interval. A 95% confidence interval reflects the 95% likelihood that the true value lies within that interval and is bounded by one MOE below and one MOE above the estimated mean. By using the relative MOE (the MOE divided by the mean), we adjust for variations in mean chl-a values across year–month–grid combinations. As shown in Figure 11, the relative MOE decreases as the sample size increases.

The sample size at which the relative MOE falls below a specified threshold can be used to define the boundaries for ‘low’, ‘moderate,’ or ‘high’ confidence ratings. The choice of relative MOE thresholds for confidence ratings depends on the acceptable error range. Given the typically large sample sizes per year–month–grid combination, we suggest to set the ‘moderate confidence’ boundary at a relative MOE of 0.1 and the ‘high confidence’ boundary at 0.05, corresponding to an error range of 10% and 5% respectively. For example, in Figure 11, the relative MOE drops below a threshold of 0.1 at a sample size of approximately 50 for grid 668 in July 2008. At this sample size, the mean chl-a value for that specific year–month–grid combination has a 10% error range (= 0.1 on the Y-axis) with 95% confidence. Likewise, the relative MOE for grid 453 in June drops below 0.05 at a sample size of about 200, so the error range for this year–month–grid combination is 5% at that sample size.

Despite small differences between year–month–grid combinations, the relative MOE curves in Figure 11 show a comparable pattern. Therefore, general sample size thresholds for each year–month–grid combination in the entire SNS area were assessed. We randomly sampled 100 grids for each year–month combination with a sample size greater than 100. For each grid, chl-a measurements were sampled up to a size of 800, with replacement, and the relative MOE was calculated (as shown in Figure 11). An exponentially decaying function was then fitted to each year–month–grid combination:

relative MOE =a×sample sizeb

with the nonlinear least squares (nls) function in R, where the a and b parameters were estimated by the model. Based on the fitted values, the sample size corresponding to the 0.1 and 0.05 MOE thresholds was determined for each year–month–grid. The means of these sample sizes serve as the overall confidence class boundaries for the SNS area. A threshold of 0.1 corresponds to a sample size of approximately 50 (‘moderate confidence boundary’), and a threshold of 0.05 corresponds to a sample size of around 200 (‘high confidence boundary’).

Per year–month–grid confidence classes can thus be assigned based on sample size. For subsequent aggregation per year–grid or per year classes were assigned numerical values of 1 (‘low confidence’), 2 (‘moderate confidence’) or 3 (‘high confidence’), where the overall mean determines the confidence rating. A score smaller than 1.5 represents a low confidence rating, between 1.5 and 2.5 a moderate confidence rating and a score higher than 2.5 a high confidence rating, as shown on the map in Figure 12. Results aggregated per year are shown in Table 7, including numerical scores. Some edge effects are present, as discussed in chapter 4.1, but these do not significantly affect the confidence rating.

The same method can be used for the MPM and RHPM coastal areas, where we first determined the MOE thresholds based on chl-a measurements and sample sizes. As a result of the higher variation in chl-a concentrations closer to shore, the threshold of 0.1 corresponds to a sample size of approximately 60 (‘moderate boundary’), and the threshold of 0.05 corresponds to a sample size of around 260 (‘high boundary’). Consequently, confidence rating is on average lower than in the SNS area (Figure 13 and Table 7).

Table 7. Confidence rating and scores per year in the different assessment areas
SNS MPM & RHPM
yearconfidence scoreconfidence classyearconfidence scoreconfidence class
19982,58high19981,60moderate
19992,81high19991,85moderate
20002,67high20001,68moderate
20012,68high20011,81moderate
20022,85high20022,12moderate
20032,90high20032,45moderate
20042,89high20042,37moderate
20052,85high20052,31moderate
20062,89high20062,38moderate
20072,88high20072,35moderate
20082,88high20082,36moderate
20092,89high20092,41moderate
20102,88high20102,35moderate
20112,88high20112,36moderate
20122,89high20122,39moderate
20132,89high20132,40moderate
20142,89high20142,37moderate
20152,89high20152,39moderate
20162,90high20162,48moderate
20172,90high20172,46moderate
20182,90high20182,50moderate
20192,90high20192,45moderate
20202,90high20202,53high

4.5 Conclusions new method

We propose to first aggregate chl-a data of different methods on a 10x10 km grid scale. This approach allows for adjustments based on variations in sampling effort, while its impact on the overall yearly means is expected to be minimal. Additionally, we recommend a new, more objective method for assigning confidence ratings, where boundaries can be set based on acceptable error range.

7) Stekhoven & Bühlmann (2012) MissForest—non-parametric missing value imputation for mixed-type data.
8) Visser, H., Estimation and detection of flexible trends (2004). Atmospheric Environment 38:4135–4145
9) Soldaat, L., Visser, H., Van Roomen, M., Van Strien, A., (2007). Smoothing and trend detection in waterbird monitoring data using structural time-series analysis and the Kalman filter. J. Ornithol. 148 (2): 351–357.