Top-down source level estimate

This document describes the methodology used to perform the top down source level estimates used in a reconciliation analysis.

The source data is from Bridger surveys. For the case discussed here we are using data for operator CU912 from a single survey and extrapolating this out based on infrastructure data.

The aim of this analysis is to get a single value for the total methane emission volume for the entire inventory for a given year. We have three main data sources:

The Bridger survey
The operator infrastructure data which consists of multiple files with varying levels of information regarding equipment and site level attributes.
Tank pressure data for high pressure events for a subset of the operator's tanks.

We need to take the observations from a single survey and use this to estimate emissions over the entire population and year. Ideally we would have data over a longer time period, as an aerial survey is only a snapshot on two different days. However, since it an effectively two instantaneous (relative to the length of a year) snapshots we are exchanging temporal resolution for spatial resolution and breadth.

The first assumption we must make is that all equipment across the operator's inventory is similar enough, and that the emission profiles remain similar through time, such that extrapolating the survey subset out to the entire population across the entire year is acceptable. This assumption is not entirely valid, and using it alone would introduce significant error. To improve this we carefully choose certain attributes for each component that have an influence on the emission profile, and stratify the sampling accordingly. So our first major correction is:

Stratified bootstrapping. Identifying sites with similar characteristics and use only survey data from the same subset when extrapolating.

The second assumption is that the sites surveyed are representative of the population. Even if on average the population is reasonably homogenous in terms of emission profile, with some variations allowed above and below the mean, we must still sample a representative portion of this to get fair result. However, we are aware that this is not the case and have verified this by analysis of the tank pressure data. So our second correction will be:

Activity adjustments. We use the tank pressure data which has excellent temporal coverage to identify site activity over the year, and adjust our sampling to account for this through downward adjustment of detection probabilities.

The activity adjustments are only applicable to the tank component, whereas stratification we do for all equipment types except Other. Both these methods will be explained in depth below.

Bootstrapping

Bootstrapping is a technique used to estimate the mean of a given statistic and its uncertainty when only a limited sample of data is available. The process is:

Draw a random sample (with replacement) from the surveyed sites
Compute the total emission rate for that draw
Repeat 1,000 times
The spread of the 1,000 estimates forms the confidence interval

So by repeating drawing at random from the survey sample we can end up with a "survey" of the entire population, no matter how much larger that may be than our original survey. By repeating this process many times we are able to simulate the randomness that we would expect to see if we really did survey the entire population.

Bridger survey data generally involves up to two flights over a site. We will treat each of these as different observations, and for the sake of bootstrapping, they are effectively different sites. So each observation is unique and independent. We have experimented with using only the first flight to determine the detection frequency, but this creates some other issues, which are described below in Section Adjustment for Bridger second flight bias.

Stratification

Rather than treating all sites as identical, sites are grouped into strata - subsets sharing similar characteristics such as site type or size category. Bootstrapping is performed independently within each stratum, and results are summed at the end. This ensures each type of site contributes to the final estimate in proportion to how common it is in the population, not how well-represented it happened to be in the survey.

Activity-Based Detection Correction

The problem

The tank pressure data gives us a measure of activity at a site. The file consists of pressure events, including a start time, end time and value for the pressure that occurred (including mean, maximum and minimum). While there may be some value in the exact values relative to the tank's particular pressure limits, the quality of our survey data and the errors within which we are working are such that this level of detail is of little value. Instead we simplify this down to a count of the number of unique days for each site that a pressure event occurred. This table can now be used as a measure of activity at each site based on the data from the entire year.

The activity data is also filtered out to remove outliers. We find that there are a few sites with many more events than the rest, to avoid bringing the average up significantly we remove any data with a count over 100 days. This is also motivated by the fact that we actually have a Bridger non-detection at the same time as a pressure event at one of these outlier sites - the site in question has 225 unique days with an event, given these two facts combined it is important to remove this from the data as it is possible this a faulty sensor.

Although the pressure data only covers a subset of the survey data, it also covers a subset of the rest of the population. Using our derived activity measure we find that the surveyed sites have an activity approximately twice as high as the rest of the population. We make the inference that this pattern extends beyond the coverage of the pressure data and assign the following activity values to the rest of the population.

In Bridger survey	In pressure data	Assigned activity
Yes	Yes	From activity data
No	Yes	From activity data
Yes	No	Mean of survey activity
No	No	Mean of non-survey activity

So now we have a measure of site activity — a proxy for either site complexity, site throughput or a general measure of how well managed a site is. We make the assumption that higher activity is associated with a higher probability of an emission occurring; so it is not surprising that the survey sample which is biased toward sites that are inherently more likely to have detections also has a higher activity using this metric. Without correction the bootstrapped detection rate would overestimate the true population detection rate.

Deriving the correction factor

The correction is motivated by the Poisson distribution, which models the number of times a rare event occurs given an expected average rate λ (lambda). A key property of the Poisson distribution is that λ scales linearly with the underlying activity rate - a site twice as active has twice the expected number of emission events.

If a site has an activity above the mean for the population then it would have a higher value for lambda. We can scale it down to what it would be at the population level using the activity ratio

λ_population = λ_survey / activity_ratio

In a true Poisson distribution the probability of observing at least one detection would be

P(k >= 1) = 1 - e^−λ

For rare events (lambda << 1), like the high emission tank events we are dealing with here, this can be approximated as $1 - e^{- λ} \approx λ$ . Using this assumption we can scale the probability of detection for this site to the rest of the population

P(detection_population) \approx P(detection_survey) / activity_ratio

In our case we do not know what the value for P(detection_survey) is at every site, it is different for all and dependent on the activity. However, we do not need to know it, as we can use the ratio here to adjust the probability of a given draw from the sample being an actual detection.

For each detected site drawn in a bootstrap sample, we ask: given that this site is more active than the average population site, should we retain its detection or cancel it? When drawing samples in the bootstrapping exercise if we get a non-detection then we continue, no adjustment is made. If we get a detection, then we need to apply the adjustment from this particular site's probability of a detection P(site) (which we don't actually know) to the real population probability. The probability of retaining a detection is scaled inversely with the site's activity ratio:

P(retain detection) = min(1.0, 1 / activity_ratio)

This preserves the correct proportional scaling from the Poisson reasoning while ensuring that sites at or below the population mean are never penalised.

This result can be understood intuitively:

A site at exactly the population mean activity (activity_ratio = 1.0) has P(retain) = 1.0 — its detection is always kept; no correction is applied as it is a representative site.
A site below the population mean (activity_ratio < 1.0) also has P(retain) = 1.0 — because these are rare events, low-activity sites are already naturally under-represented in our detected sample. Capping the probability at 1.0 ensures we do not introduce massive variance by artificially inflating these rare detections.
A site above the population mean (activity_ratio > 1.0) has P(retain) < 1.0 — the more over-active it is, the more likely its detection is turned into a non-detect to strip away the selection bias.

Step by step

Step 1 — Carry activity data through to the sample

Before bootstrapping begins, the activity value (pressure event days) for each surveyed site are attached to its detection record. So when a site is drawn during resampling it brings its own activity value with it, which we can then use to adjust that detection.

Step 2 — Draw sites as normal

A bootstrap sample of n_sites is drawn with replacement from the survey data. This is the number of sites for the given strata. Each drawn survey data record carries its own activity value from the site that it was observed at.

Step 3 — Compute each site's activity ratio

We then use the activity value from the sampled site to down weight it if the activity is above the population mean.

activity_ratio = site_activity / population_mean_activity

Sites without activity data are assigned the population mean, giving activity_ratio = 1.0 and P(retain) = 1.0 — no correction is applied. Sites below the population mean are also not adjusted. This is the conservative choice: where we have no information or the surveyed site is actually less active, we make no adjustment.

Step 4 — Compute the retention probability and draw the outcome

python

p_retain = np.minimum(1.0, 1.0 / activity_ratio)
adjusted_detections = (rng.uniform(0, 1, size=len(p_retain)) < p_retain).astype(int)
adjusted_detections = adjusted_detections * base_detection  # non-detections always stay 0

For each detected site, a uniform random number is drawn between 0 and 1. If it falls below p_retain, the detection is kept; otherwise it is cancelled. The final multiplication by base_detection ensures that sites which were not detected in the survey always remain non-detections regardless of the random draw.

Worked example

Take a surveyed site with 25 pressure event days that had a detection event (detection = 1):

activity_ratio = 25 / 12.74 = 1.96
P(retain) = min(1.0, 1 / 1.96) = 0.51
P(cancel) = 1 - 0.51 = 0.49

This site has roughly a 50/50 chance of retaining its detection in any given bootstrap iteration. A uniform random number is drawn — if it is below 0.51 the detection is kept and an emission rate is drawn as normal; if it is above 0.51 the detection is cancelled and no emission rate is drawn.

The table below shows how retention probability varies across different activity levels:

Site activity (days)	Activity ratio	P(retain detection)	P(cancel detection)
6.4 (below population mean)	0.50	100%	0%
12.7 (population mean)	1.00	100%	0%
19.1 (1.5× population mean)	1.50	67%	33%
25.5 (2× population mean)	2.00	50%	50%
38.2 (3× population mean)	3.00	33%	67%

Sites at or below the population mean are never corrected. Sites above the mean are corrected in proportion to how much they exceed it.

Non-detected sites

Sites with a survey detection of 0 always remain 0, enforced by multiplying by base_detection after the draw. The correction only ever acts on sites that were actually detected in the survey.

Emission rates are unchanged

If a detection is retained, its emission rate is drawn independently from the full pool of observed emission rates, unchanged from the uncorrected method. The assumption is that the magnitude of an emission, given that one occurs, is not strongly driven by activity level — only the probability of occurrence is. This assumption debatable, a site with higher activity may actually have larger emissions as well. However, there is no justifiable scaling we can do between our activity measure and emission volume. This is because, as explained above, activity relates well to event frequency, they are both measuring a frequency in time. As we only have one detected emission which lines up with a tank pressure event, we do not have enough data to even infer what this relationship might be.

What the Correction Achieves

By applying the activity correction inside every bootstrap iteration the method we are able to:

Reduce the mean emission estimate - most surveyed sites have above-average activity, so a detection drawn from such a site is dropped more often than not across bootstrap samples.
Widen the confidence interval - the random retention draw introduces additional uncertainty on top of the resampling, reflecting that we do not know exactly how much of the observed detection rate is due to over-activity.
Apply the correction proportionally using each site's own data - highly over-active sites are corrected more strongly than mildly over-active ones.
Not over-correct - observations at sites at or below population mean activity are never adjusted, regardless of their detection outcome.

Adjustment for runtime (flares only)

The flaring infrastructure data has a runtime column. In almost all cases this is 8760 hours - the length of a year. To account for the sites that have a lower value we make an adjustment for runtime.

This adjustment is made through a simple scaling of the total number of detections and the total emission rate for each bootstrapping sample by a runtime factor. The runtime factor is calculated as the total number of hours across all sites over the total number of hours in a year multiplied by the number of sites.

For example, assume we have a runtime factor of 0.8. If we do one bootstrapping run and have 100 detections and a total emission rate of 100 kg/h, then our adjust result for that run is 80 detections and 80 kg/h. This is a basic adjustment and does not account for bias in the survey, since at most our survey sites would have 8760 hours runtime which is very close to the mean. However, this adjustment does handle any concerns at the population scale due to over counting the effective number of sites with flares.

Adjustment for Bridger second flight bias

Note: the following has been changed. The reason is given below the text here explaining what I was originally doing.

In the first step where we sample from the survey data (with replacement) we make an adjustment. The Bridger survey data is split into two tables:

1.) The data from the first observation of every site only (by day of observation). This includes detection and non-detection observations.
2.) The data from all observations, including return visits to a site on a later day, that had a non-zero emission. So this is all survey data with a non-zero detection rate.

Table 1.) is used to determine if a site (in the set of n_sites we are sampling for) has a detection or not. It is a binary result. This results in a column of 0 and 1 values. We then take n_site samples from Table 2.) and multiple the detected rate column by this column of binary values.

The result of the procedure is a detection probability based on first site visits only, and an emission rate based on all the data available.

More information on the Bridger second flight bias is given below.

So I have decided not to implement this because it has a problematic consequence for when we want to split the data up. It creates a weird statistical issues that is kind of hard to understand. The best way to see is using the following example.

Consider three sites with tanks. We observe both of these sites on two different flights. So we have the following observational data.

Observed emission flux	Site A	Site B	Site C
First flight	100	1	0
Second flight	1	0	1

Now using the approach described above we would have calculated a detection probability of 2/3 and a mean emission rate of 25.75. So, if our “population” is 10 sites in total with the same characteristics we have a total flux of 171.67 kg/h.

But now we want to split the data into large emissions (non-normal events perhaps) and “normal” emissions. So we have two datasets now.

Observed emission flux (high)	Site A	Site B	Site C
First flight	100	0	0
Second flight	0	0	0

Observed emission flux (normal)	Site A	Site B	Site C
First flight	0	1	0
Second flight	1	0	1

Now when we compute the same statistics for each data set we get:

For the high emissions. Mean emission rate is 100 kg/h, detection probability is 1/3. So for the entire population we have 333 kg/h.
For the low emission events we get a mean emission rate of 1 kg/h and a detection probability of 1/3 again (remember we are only using observations from the first flight for detection probability). So the population value is 3.3 kg/h.

Combining these together now gives us a total population rate of 336 kg/h. This is much higher than our value before splitting up the data! The same outcome occurs when we do the bootstrapping, if we split the data up, the totals increase.

The reason for this is that this technique of combining all (non-zero only) emissions means the larger the dataset the more diluted the large emission events become by the second flight. In the first table above, the 100 kg/h event is immediately compensated for by the existence of a 1 kg/h events in the next flight, whereas the detection probability is not taking these events into account.

There are two ways to fix this. If we only consider the first flight for all statistics. Then we get the result of 333 kg/h from both the first combined set, and the sum of the split data. Or we can use all the data we have for computing both the detection probability and the mean emission rate. Let us work through that below.

For the combined dataset we get a mean of 25.75 kg/h and the detection probability is still 2/3. This may or may not change depending on the data. For the next two we get:

For the high emissions. Mean emission rate is 100 kg/h, but now the detection probability is 1/6. So for the entire population we have 166.67 kg/h.
For the low emission events we get a mean emission rate of 1 kg/h and a detection probability of 1/2. So the population value is 5 kg/h.

Now the total of the split data is 171.67, which is exactly the same as we get from the combined data.

This exercise has demonstrated an import bias that can be created by using the approach described in the text at the start of this section. For this reason will not use this approach and we are comfortable doing this as we see no strong reason to do so (see the section below where we consider the bias due to Bridger’s second flights).

Top-down source level estimate ​

Bootstrapping ​

Stratification ​

Activity-Based Detection Correction ​

The problem ​

Deriving the correction factor ​

Step by step ​

Worked example ​

Non-detected sites ​

Emission rates are unchanged ​

What the Correction Achieves ​

Adjustment for runtime (flares only) ​

Adjustment for Bridger second flight bias ​