Skip to content

Extract emission data and images from Bridger KMZ files

The KMZ files supplied by Bridger are archives containing a KML file (an XML document) and a number of PNG plume images. This tool extracts the emission data, flight coverage, site polygons and equipment from the KML files and the images. It uses the image bounds defined in the KML file to convert the PNG plume images to a GeoTIFF format.

The KML file contains HTML tables (designed for display in a UI) which contain the emission data. While this is not an ideal format it is sufficiently structured to be parsed and extracted. The overall structure of the KML file (which is a nesting of "folders") is not consistent across files we have tested. This tool is designed to be flexible enough to handle these differences but there are still some scenarios that could require further modifications to be made.

CLI Tool

This tool is designed to be used on the command line to extract data from a KMZ file. To run the CLI tool, you need to have Docker installed on your machine or set up your Python environment manually (see the Docker file for steps to follow). Use the following command.

Process emissions only

docker-compose run cli-tool --input "input_files/CVX0014 - Processed Report.kmz" --process_images --output_folder "my_output"

It will not run if the output folder already exists. If you want to force it to remove the output folder you can use the -f flag.

Enrich emissions data with equipment type

The processed reports do not link emissions with equipment. This linkage seems to only be available in the equipment report Excel files; the link is not even available in the equipment report KMZs - perhaps because these are to be viewed on a map so it wasn't considered necessary. So, we have to load the tab "Emission Location Extended" in the equipment report XLSX file and use this to link with the emissions. The result is a emissions table that contains the equipment type and ID.

To link to an Excel file for the equipment report include the following command

bash
--equipment_report_xlsx "/cvx_import/CVX0006/CVX0006 Equipment Report v2.xlsx"

Generate non-detections

The equipment report XLSX file only contains the equipment for which there was a detection. To generate a list of non-detections we need to get the full equipment list and combine it with the swath coverage.

The XLSX file does sometimes contain the full list of equipment, but it is not always there or easy to parse. It also only gives the number of scans, the first time and the last time; it does not contain all the individual times.

So instead we use the equipment report KMZ file to create a list of equipment, locations and types. We then merge this with the swath coverage from the processed emission report KMZ file to create a list of what equipment was covered. Finally, we merge this with the emissions to create a list of non-detections.

To provide the equipment report KMZ use the following argument with the CLI tool.

bash
--equipment_report_kmz "/cvx_import/CVX0006/CVX0006 Equipment Report v2.kmz"

The original Bridger files contain non-detects but only for equipment that has at least one detection. There are never non-detects for equipment that never have an emission. This is the primary reason why we have created this method. Note that non-detections computed using this method may differ slightly from the non-detections provided by Bridger, as we use clustering there are cases where Bridger will report two "non-detections" within a minute or two, whereas in our approach these would be combined into a single non-detection event. This will result in a slight increase in the detection frequency.

Outputs

Depending on the inputs provided (see the section above) several output files are created.

emissions.csv

This is the primary emissions table. It is formatted for upload into the Aershed Platform. This table contains some non-detections (emission_detected=False) for equipment that had a detection at another time. Equipment that never have a detection do not appear in this file.

non_detections.csv

This file contains a list of non-detections for all equipment found in the equipment report. This is the fully processed final form with emissions removed. If an item of equipment has an emission at a specific time in the emissions table, then it will not appear in this list at that specific time.

combined.csv and combined_for_aggregate.csv

These files are simply the emissions and non-detections concatenated into one table. The only difference between these two files are the column names which are intended for different tools and the combined.csv includes extra entries for entire sites that have non-detections. This is to avoid a downstream issue where sites were not counted if they had no detections. It needs to be filtered out in some cases.

combined_for_aggregate is specifically intended to use with the bridger_kmz/aggregate_output.py script. This script is for aggregating data into the form of the original equipment reports from Bridger. It takes data from multiple surveys and creates a single row for each item of equipment, with a first scan and last scan time, as well as computing the persistent or intermittent classification. It creates an output identical to the original TEP data we have used in the past for trend analysis.

combined.csv is in the format required for the A/B analysis.

Neither of these files are suitable for upload into the Aershed Platform due to the column names used.

equipment.csv

This file contains all the equipment from the equipment report KMZ file. This is still a work in progress as it may contain some duplicates (same location but different equipment ID). Once complete it will be a source of truth of equipment for that survey.

infrastructure.csv

Note: will rename this to sites.csv for next run and fix site names

This file contains a list of sites and site polygons. This will eventually be combined with the equipment file.

swath_coverage.csv

This file contains the coverage of the survey. It is a direct copy of the information from the processed report KMZ file. It is used to construct the non-detections.

aggregate/*.csv

This folder at the higher level contains the combined aggregate data. The files and what they contain are defined in the script bridger_kmz/aggregate_output.py. For more details see the section Aggregating output below.

Removing duplicate equipment

In some cases the same equipment is listed multiple times in the infrastructure list provided by Bridger in the equipment report. These impact the results as we end up creating additional non-detections. It is even possible to have a "non-detection" at the exact same time and location as a detection due to these duplications. So we remove duplicates, prioritising equipment which has a reported emission. If there is no emissions, then choose the one with the numerically smaller ID.

Removing duplicates for CVX0002 results in 56 items of equipment being removed. The impact on the detection frequency is given in the table below.

With duplications removed:

Equipment TypeObservationsDetectionsDetection Frequency
Compressor2508232.8%
Facility Piping301240.0%
Flare1952010.26%
Other242083.33%
Separator1327352.64%
Tank1792653.63%
VRU3000.0%
Well56840.7%
Overall42162385.65%

With duplications included:

Equipment TypeObservationsDetectionsDetection Frequency
Compressor2528232.54%
Facility Piping301240.0%
Flare1952010.26%
Other242083.33%
Separator1352352.59%
Tank1825653.56%
VRU3000.0%
Well57440.7%
Overall42822385.56%

Note that this is a simple deduplication process and not the same as the more comprehensive process used to create the infrastructure list described below.

Aggregating output

For the trend analysis we need to aggregate the data in a consistent way to how the old TEP data was aggregated. So we aggregate all observations for each item of equipment in each survey, so one row per equipment item per survey. The CLI tool prepares some output for this process, but does not do the complete aggregation yet.

We need to do a special step for CVX00003 since in this case we do not have an infrastructure list, but we can source infrastructure from other surveys to create a complete list. This is then used with the swath data and emissions data from the CVX0003 survey to create a list of non-detections. So we run this fix using:

bash
python bridger_kmz/cvx3_fix.py

then we can aggregate all outputs into the respective phases (as defined in the script references in the following command) using the command:

bash
python bridger_kmz/aggregate_output.py

This will create the outputs in the cvx_processed/aggregate folder.