Carbomont / CarboEuropeIP Gap Filling Tool: the Details

This Gap Filling Tool for eddy covariance and meteorological data is based on the "mean diurnal approach". This tool fills gap in all missing input variables except for the date and time columns. That is, no new lines will be produced. This allows to fill gaps in partial datasets (e.g. subselection of data from a specific footprint direction).

Usage

There are only the following requirements for the tool to work properly:
  1. You must specify how many header lines there are
  2. You must specify where date and time information resides (two column numbers starting with 1 as the first column in your table)
  3. You must tell the tool how missing values (the ones that should be replaced by gap filled data) are specified
Concerning the column separator you should use the same character throughout the file. The gap filler just makes a statistics of which non-numeric characters occur how frequently, and then makes the smart (?) decision that the most frequent character is your column separator. So no need to convert your file. Yes, your input file must be plain ASCII, so no Excel workbooks and such.

Header Lines

This is just for convenience. You can specify as many header lines as you want. The gap filling tool will just copy these lines from your input file to the output file. Since it does not need to know what the columns mean, you are also free about the naming of your columns.

Date and Time Information

This is essential information. The gap filling tool expects two columns:
  1. one which specifies the Julian day (that is number of day in the year, starting with 1 as the first of January)
  2. the other with a decimal time information either in the format 11.5 or 1130 for "half past eleven"

You will get a notification in the log file if the gap filling tool detected the HHMM format which is not a decimal time, but which will be converted to decimal time.

Limitation: the gap filling tool will only work within one year; if you want to fill gaps of more years you must process each year separately, or the day of year variable must increase continuously, such that 1 January of the second year is DOY 366 (or 367 in leap years) and so on. If you mix years you will get rubbish as output, since the gap filling tool only knows day of year and time.

Missing Values

Basically the gap filling tool replaces all occurrences of the missing value that you specified by gap filled data. It will not produce extra lines that are not present in the input file. Be aware of this!

If you specify a string such as "NA" you should check whether uppercase and lowercase notations were correctly identified by the gap filler (they should).

If you specify a numeric value, it will be treated as a numeric value and all values that fall within a range around this value are treated as missing values to avoid problems with rounding errors. Thus, if you specify -999.99 be aware that this could be interpreted as -999.99 plus or minus 0.1 to detect missing values. It is generally not a bad idea anyway to use a missing value in your dataset that is not too similar to true data.

Procedure

A two-step procedure is chosen to fill the gaps. Most important, it is to know that each column is treated independent of any other columns.

Step 1

In a first flush short gaps are filled with linear interpolation, except if they are at the immediate beginning or end of the data file. Therefore be aware that if you have missing data in the first and/or last record of your file you will still have missing values in the output! You will need to replace them manually using your preferred subjective guess method.

The current settings are as follows: length_of_short_gap is set to 4 (that means 4 missing values in a row are bridget in step 1; if you have 30-minute resolution, this corresponds to a 2-hour gap).

Step 2

In a second flush the remaining gaps are then filled by mean diurnal cycles of the respective variable. Gaps that were filled during the first flush are now treated equally like measured data. Be aware of this assumption. Moreover, we assume that you provided 30-minute averages, and therefore in the second step the result may differ from your expectations if you use a different temporal averaging rate.

To fill the gaps we find the valid data values within each 30-minute interval of the data from a certain number of days before and after the gap (specified by search_days; see below). We have the additional requirement that we need to find at least min_available values within each 30-minute time segment, otherwise we use a trick: we add the values that we found for the previous and following 30-minute time interval and produce a weighted average for the 30-minute time segment we work in, using a weight of 0.5 for the data we found within our time segment, and 0.25 each for data from the previous and following intervals.

Now we replace the missing values in the gap by the mean obtained for the corresponding 30-minute interval. Generally there is little probability that the gap cannot be filled since the definition of a gap means that it is bounded by available data. The worst case, for example, is a file with only three days (e.g. days 1 to 3) where the middle day (day 2) consists of missing data. In such a case the min_available condition could not be met and also looking to the left and right around the 30-minute interval may not yield the desired information to fill the gaps. But hopefully you do not challenge this gap filling procedure with such questionable input!

The current settings are as follows: search_days is set to 3; this is the number of days before and after the gap (so, twice as many in total) that should be searched for valid data. min_available is set to 4. Since a maximum of 3 values can be either before or after the gap (with search_days being 3), this means that at least one of the values was always measured before the gap and at least one value was always measured after the gap.


Further reading

Several gap filling methods were compared and discussed by Falge et al. (2001). Falge listed several advantages of the "mean diurnal approach". First, this method has the potential to capture non-linearity due to diurnal or temporal changes in response. Second, interactions between light and temperature are captured in a simple manner, as they show a lagged co-variance over the course of the day. Further, there is no need to calculate functional responses curves. This method, however, has its limitations. The functional responses usually found in the data, are not always respected and there is a potential for some bias under certain climatic conditions. Finally, this approach was chosen, because it is very time-efficient and CPU usage-friendly in computing the gap filled datasets. It also works outside the growing period where no clear light response is expected in the NEE data.


Date modified: 02.12.2004 by Werner Eugster