Skip to content

This documentation is for the integration of the GHGSat API into the datalab and the process for importing emissions, non-detections and site data into the Aershed platform.

GHGSat API client

The client for interacting with the GHGSat API is defined in the file ghgsat/client.py and contains an interface with a function for each data type we require. Authentication is done once at initialisation. The auth token is fetched based on the following environment variables which must be set in your base or environment specific .env file

GHGSAT_API_URL=https://spectra-api.ghgsat.com/api/public/v1
GHGSAT_EMAIL=
GHGSAT_PASSWORD=

The email used in the Aerscape ops email; these details can be found in BitWarden.

The auth token can be cached between runs by setting the configuration parameter cache_token to True. This will store the token in a JSON file in your jobs output directory. On the next run the client will attempt to use this token, if it is not successful then it will fetch a new token and try again. It is preferrable to cache the token when making multiple requests while developing or testing.

Sites

The GHGSat API provides a limited number of operator sites around the world. Most of these we already have but some are new. We need to ensure we stay up to date with this data for two reasons:

  • We use the GHGSat site identifier to correctly match observations to site non-detections.
  • If we do not already have the same infrastructure in Aershed then we need to import it to avoid missing a matched emission event.

GHGSat provides a unique id and type for each site, we rename these to ghgsat_id and ghgsat_site_type and then perform an infrastructure update against the Aershed database. GHGSat sites are point locations, so in most cases these points are inside existing Aershed sites, so we simply add these two fields as extra data.

In some cases there are multiple GHGSat sites inside an existing polygon. In this case we concatenate the ghgsat_id and type (if different) in to a semicolon separated list. This list is then expanded later when we are matching non-detection events to sites (to avoid missing them based on matching exactly on the ghgsat_id field).

To create an infrastructure update use the configuration in config/ghgsat/sites.yaml. In this file you can optionally specify to filter to a geographic "zone". This process will download all available sites and perform an infrastructure update against the chosen environment.

Emissions

Emissions are split into GHGSat provided observations and reprocessed third-party data. The details of each are given below.

In each case emissions are downloaded with a fixed start date defined in the configuration file. This is not currently a rolling filter. The appropriate emissions API for each type is used to get a list of events from the start date and matching the other filters. This is then compared against emission events already imported into the Aershed database (by making a database query) and the list filtered to only new events.

Once the list of new events has been created the corresponding plume images are downloaded. This means only new assets are downloaded and we avoid having to download all items every time the process is run. Details on plume image processing are given below.

GHGSat emissions

These emission events are observed by GHGSat satellites and available from the deliverables/emissions/ end-point. We only download events with the gas type equal to CH4 and type DELIVERABLE_SITE_OBSERVATION. In configuration and code we refer to these as the emissions.

Global Emissions

These emissions events come from third-party satellites and are reprocessed by GHGSat. These events are retrieved from the deliverables/emissions_global_survey/ end-point. We only download events with the gas type equal to CH4 and type DELIVERABLE_SITE_OBSERVATION. In configuration and code we refer to these as the global_emissions.

The sensor field is mapped to an Aerscape third-party data provider name and used for the secondary_data_source field on the import record. The mapping is, at the time or writing

GHGSat sensor nameAershed secondary data source
Sentinel-2Sentinel-2 - ESA
Sentinel-5pSentinel-5P/TROPOMI - ESA
EMITEMIT - NASA
Landsat-8Landsat - NASA/USGS

If a new sensor name appears that we do not have defined in our mapping then a warning is logged and a Slack notification is sent. To add a new mapping look up the secondary data source values in the Django admin page under the GHGSat emission provider. This is then added to the corresponding value in the ghgsat/get_global_emissions.py file.

Non-detections

Non-detections are inferred based on observations and emission events. Observations are downloaded from the deliverables/observations end-point using the filter for type DELIVERABLE_SITE_OBSERVATION. Non-detections are only processed and imported for GHGSat emissions and not for global emissions.

The process of creating non-detections is first we download both observations emissions directly from the GHGSat API. These are then joined on the ghgsat_id and observation_id. From this we filter to only observations that do not match with an emission. We then download sites from the Aershed database and link to the remaining observations on the ghgsat_id. The final non-detections file is then created based on the observation time, Aershed site identifier and detection-limit. The GHGSat observation ID is also stored.

The non-detections file is then compared with existing non-detections in the Aershed database. Any new non-detections are then written to file and imported in the next step of the process.

When linking with sites if there are observations remaining without a matched site then warning is logged. The missing sites will be imported the next time the infrastructure update process runs. If there are persistent missing sites errors then further investigation is required to understand why the corresponding ghgsat_id is not present on any sites stored for this operator.

Plume image processing

The plume images that come from the emissions and global emissions product can be different. Even within the global emissions product we have different plume images as well, as each provider can have different formats. For this reason for GHGSat we will provide all images to the Aershed platform in a clean state and take care of the processing prior to import.

To process a sample set of images using the GHGSat image processing script use the example program in config/ghgsat/test-images.yaml. Create a folder in your environment of choice, for example if your PROCESSING_DIR is the data folder in the root of the datalab tools then make a directory at data/local/plume_images/raw and place the raw plume images in there.

Then run the program with the following command.

./scripts/run.sh config/ghgsat/test-images.yaml --env local

Imports

All imports of GHGSat data are now handled directly here in the datalab tools package. Importers are defined in the importers/ folder. Each importer has the following main features. Although the processing scripts already filter our existing records for the given environment, the importer does an additional check that the records to be imported are not already in the Aershed database. The importer also checks that the owner matches that specified in the CSV file (for import types with the owner in the file) and additional validity checks.

Scheduled processing

The GHGSat import processes are setup to run on a scheduled basis. Each process is designed to be able to be run repeatedly, only creating import files for new changes and then only running an import if required.

We use a cron job to run the scheduled job. The following line is an example of one of the cron job definitions.

0 */12 * * * /usr/bin/bash -c '/usr/bin/bash /home/USER/datalab-tools/scripts/run.sh config/ghgsat/sites.yaml --force --env qf487-prod >> /home/USER/logs/ghgsat-sites.log 2>&1'

This will run the job at the start of the hour every 12 hours.

When setting up a new instance the following must be done:

  • Create a new directory for logs.
  • Update the cron file using the command sudo crontab -e and copy the line above, pointing to your datalab tools repository and logs directory.
  • Make sure that the datalab tools Docker image is built and tested before leaving this to run automatically.

The following jobs are currently setup to run every 12 hours

Minutes past hourJobConfiguration file
0Sitesghgsat/sites.yaml
5Emissionsghgsat/emissions.yaml
10Non-detectionsghgsat/non_detections.yaml
15Global emissionsghgsat/global_emissions.yaml

Each job involves both download and processing of data and the import step.

The output folder for each job has a dynamic name based on the timestamp. For example the global emissions has the following configuration

base_output_dir: "${PROCESSING_DIR}/qf487/ghgsat/global-emissions-${TIMESTAMP}/"

This enables each job to run independently and enables auditing of previous jobs as any raw downloaded data is not overwritten.