Extract emission data and images from Bridger KMZ files

The KMZ files supplied by Bridger are archives containing a KML file (an XML document) and a number of PNG plume images. This tool extracts the emission data, flight coverage, site polygons and equipment from the KML files and the images. It uses the image bounds defined in the KML file to convert the PNG plume images to a GeoTIFF format.

The KML file contains HTML tables (designed for display in a UI) which contain the emission data. While this is not an ideal format it is sufficiently structured to be parsed and extracted. The overall structure of the KML file (which is a nesting of "folders") is not consistent across files we have tested. This tool is designed to be flexible enough to handle these differences but there are still some scenarios that could require further modifications to be made.

CLI Tool

This tool is designed to be used on the command line to extract data from a KMZ file. To run the CLI tool, you need to have Docker installed on your machine or set up your Python environment manually (see the Docker file for steps to follow). Use the following command.

Process emissions only

docker-compose run cli-tool --input "input_files/CVX0014 - Processed Report.kmz" --process_images --output_folder "my_output"

It will not run if the output folder already exists. If you want to force it to remove the output folder you can use the -f flag.

Enrich emissions data with equipment type

The processed reports do not link emissions with equipment. This linkage seems to only be available in the equipment report Excel files; the link is not even available in the equipment report KMZs - perhaps because these are to be viewed on a map so it wasn't considered necessary. So, we have to load the tab "Emission Location Extended" in the equipment report XLSX file and use this to link with the emissions. The result is a emissions table that contains the equipment type and ID.

To link to an Excel file for the equipment report include the following command

bash

--equipment_report_xlsx "/cvx_import/CVX0006/CVX0006 Equipment Report v2.xlsx"

Generate non-detections

The equipment report XLSX file only contains the equipment for which there was a detection. To generate a list of non-detections we need to get the full equipment list and combine it with the swath coverage.

The XLSX file does sometimes contain the full list of equipment, but it is not always there or easy to parse. It also only gives the number of scans, the first time and the last time; it does not contain all the individual times.

So instead we use the equipment report KMZ file to create a list of equipment, locations and types. We then merge this with the swath coverage from the processed emission report KMZ file to create a list of what equipment was covered. Finally, we merge this with the emissions to create a list of non-detections.

To provide the equipment report KMZ use the following argument with the CLI tool.

bash

--equipment_report_kmz "/cvx_import/CVX0006/CVX0006 Equipment Report v2.kmz"

The original Bridger files contain non-detects but only for equipment that has at least one detection. There are never non-detects for equipment that never have an emission. This is the primary reason why we have created this method. Note that non-detections computed using this method may differ slightly from the non-detections provided by Bridger, as we use clustering there are cases where Bridger will report two "non-detections" within a minute or two, whereas in our approach these would be combined into a single non-detection event. This will result in a slight increase in the detection frequency.

Outputs

Depending on the inputs provided (see the section above) several output files are created.

`emissions.csv`

This is the primary emissions table. It is formatted for upload into the Aershed Platform. This table contains some non-detections (emission_detected=False) for equipment that had a detection at another time. Equipment that never have a detection do not appear in this file.

`non_detections.csv`

This file contains a list of non-detections for all equipment found in the equipment report. This is the fully processed final form with emissions removed. If an item of equipment has an emission at a specific time in the emissions table, then it will not appear in this list at that specific time.

`combined.csv` and `combined_for_aggregate.csv`

These files are simply the emissions and non-detections concatenated into one table. The only difference between these two files are the column names which are intended for different tools and the combined.csv includes extra entries for entire sites that have non-detections. This is to avoid a downstream issue where sites were not counted if they had no detections. It needs to be filtered out in some cases.

combined_for_aggregate is specifically intended to use with the bridger_kmz/aggregate_output.py script. This script is for aggregating data into the form of the original equipment reports from Bridger. It takes data from multiple surveys and creates a single row for each item of equipment, with a first scan and last scan time, as well as computing the persistent or intermittent classification. It creates an output identical to the original TEP data we have used in the past for trend analysis.

combined.csv is in the format required for the A/B analysis.

Neither of these files are suitable for upload into the Aershed Platform due to the column names used.

`equipment.csv`

This file contains all the equipment from the equipment report KMZ file. This is still a work in progress as it may contain some duplicates (same location but different equipment ID). Once complete it will be a source of truth of equipment for that survey.

`infrastructure.csv`

Note: will rename this to sites.csv for next run and fix site names

This file contains a list of sites and site polygons. This will eventually be combined with the equipment file.

`swath_coverage.csv`

This file contains the coverage of the survey. It is a direct copy of the information from the processed report KMZ file. It is used to construct the non-detections.

`aggregate/*.csv`

This folder at the higher level contains the combined aggregate data. The files and what they contain are defined in the script bridger_kmz/aggregate_output.py. For more details see the section Aggregating output below.

Removing duplicate equipment

In some cases the same equipment is listed multiple times in the infrastructure list provided by Bridger in the equipment report. These impact the results as we end up creating additional non-detections. It is even possible to have a "non-detection" at the exact same time and location as a detection due to these duplications. So we remove duplicates, prioritising equipment which has a reported emission. If there is no emissions, then choose the one with the numerically smaller ID.

Removing duplicates for CVX0002 results in 56 items of equipment being removed. The impact on the detection frequency is given in the table below.

With duplications removed:

Equipment Type	Observations	Detections	Detection Frequency
Compressor	250	82	32.8%
Facility Piping	30	12	40.0%
Flare	195	20	10.26%
Other	24	20	83.33%
Separator	1327	35	2.64%
Tank	1792	65	3.63%
VRU	30	0	0.0%
Well	568	4	0.7%
Overall	4216	238	5.65%

With duplications included:

Equipment Type	Observations	Detections	Detection Frequency
Compressor	252	82	32.54%
Facility Piping	30	12	40.0%
Flare	195	20	10.26%
Other	24	20	83.33%
Separator	1352	35	2.59%
Tank	1825	65	3.56%
VRU	30	0	0.0%
Well	574	4	0.7%
Overall	4282	238	5.56%

Note that this is a simple deduplication process and not the same as the more comprehensive process used to create the infrastructure list described below.

Aggregating output

For the trend analysis we need to aggregate the data in a consistent way to how the old TEP data was aggregated. So we aggregate all observations for each item of equipment in each survey, so one row per equipment item per survey. The CLI tool prepares some output for this process, but does not do the complete aggregation yet.

We need to do a special step for CVX00003 since in this case we do not have an infrastructure list, but we can source infrastructure from other surveys to create a complete list. This is then used with the swath data and emissions data from the CVX0003 survey to create a list of non-detections. So we run this fix using:

bash

python bridger_kmz/cvx3_fix.py

then we can aggregate all outputs into the respective phases (as defined in the script references in the following command) using the command:

bash

python bridger_kmz/aggregate_output.py

This will create the outputs in the cvx_processed/aggregate folder.

Equipment non-detections

Non-detections are computed based on the equipment list provided and the swath coverage.

Non-detections with missing swath coverage

In one particular survey (HES0007) the swath coverage provided was incomplete. The equipment report contained many more sites than included in the swath coverage. Inspection of the coverage file showed that only sites with a detection were included. In general the coverage file will have swaths over many sites with no emission locations underneath them and the equipment contained in the equipment report will have at least one fly over. In addition to these inconsistencies in the supplied data the operator was confident that all sites were covered - which is consistent with the fact that all their other surveys in the same basin contained much larger coverage as well. Finally, the detection probabilities for this survey were significantly higher than all other surveys for this operator in this basin, 8% instead of a value closer to 3%.

Considering these facts it is most likely that a data supply issue is the cause of the missing coverage. We have created a solution to impute the missing coverage in this special case. This process is based on the assumption that the missing coverage is only non-detections and that all emission events have otherwise been provided.

Typically during the calculation of non-detections all equipment items are spatially joined with the swath coverage polygons. The intersection is retained and forms the equipment non-detections file; as some sites have coverage at multiple times equipment can appear more than once in this file. Generally all equipment items are covered. However, in the case of HES0007 there are equipment items that that are not covered. When the parameter fill_missing_coverage is set to True we take the items without coverage and create a single observation at the time. The time chosen for this is arbitrary for most of our reporting as long as it is within the range of the swath coverage data, so we simply use the first time from the swath data for all imputed records.

This process can only give an estimation of the true coverage. In many cases sites are flown over multiple times. Fortunately multiple site visits are generally only done with a detection was made on the first flight, so if our primary assumption that only non-detections are missing, then a single site visit will be mostly appropriate. This will not always be the case, as sites can be flown multiple times but it is the best solution we can provide.

While most reporting we currently do does not depend on the actual observation time, for data validity in downstream processes we need to put a value here. However, we must be cautious of any future analysis that may actually make inferences from it. We are also not able to import this into the Aershed platform as it could create serious inconsistencies with other data sources (for example, if a third party had a detection at the time of our arbitrarily chosen non-detection time).

To run the process with this fix included the following parameter must be added to the configuration file.

    fill_missing_coverage: True

Extract emission data and images from Bridger KMZ files ​

CLI Tool ​

Process emissions only ​

Enrich emissions data with equipment type ​

Generate non-detections ​

Outputs ​

emissions.csv ​

non_detections.csv ​

combined.csv and combined_for_aggregate.csv ​

equipment.csv ​

infrastructure.csv ​

swath_coverage.csv ​

aggregate/*.csv ​

Removing duplicate equipment ​

Aggregating output ​

Equipment non-detections ​

Non-detections with missing swath coverage ​

Extract emission data and images from Bridger KMZ files

CLI Tool

Process emissions only

Enrich emissions data with equipment type

Generate non-detections

Outputs

`emissions.csv`

`non_detections.csv`

`combined.csv` and `combined_for_aggregate.csv`

`equipment.csv`

`infrastructure.csv`

`swath_coverage.csv`

`aggregate/*.csv`

Removing duplicate equipment

Aggregating output

Equipment non-detections

Non-detections with missing swath coverage