Infrastructure update user guide

The infrastructure update process enables us to take a CSV file and compare it with any existing infrastructure in the database. The result is then a JSON file, conforming to the protocol of the infrastructure import API, that contains the appropriate updates. For more information on how the update process works see the infrastructure process documentation.

This guide steps through how to take a CSV file and create a JSON file ready for import.

First follow the steps in the introduction for setting up the Data Lab tools project. Use the Docker guidelines for the quickest and simplest setup.

Example scenario

See the example sites import CSV file at datalab_tools/data/example/sites.csv. This contains some polygon and point sites and some metadata such as site names, status and a unique ID.

We will create an import using the configuration file config/infrastructure_updates/example.yaml. Inspect this file to understand how the process works along with the guide below to see what the options do. If you want to test an import, create a fake company and then replace the owner value in the top of the configuration value with this company identifier.

To create an import run the following command.

docker compose run --rm cli-tool python datalab_tools/cli.py config/infrastructure_updates/example.yaml

This will create a file in the folder data/example/processed/data_imports_new_only (see below for more details on the outputs from this program).

After importing, run this process again to verify that no new sites or updates to sites are created. To test creating an update, modify a value in the CSV file or add a whole new column (and reference it in the configuration file), then run the process again.

Infrastructure update program

The infrastructure_update program can combine different types of data inputs and do the required transformations to make a update file ready for import. The following describes what this program can do, the input requirements and defines the configuration options.

See any of the YAML files in the config/infrastructure_update/ folder for examples.

Input files

The infrastructure_update program accepts CSV, KMZ, KML files for input. If the file is not in this format then it should be transformed into CSV file first using an ad-hoc custom script, see the datalab_too.s/infrastructure/custom folder for some examples. The KMZ and KML input types are fragile and expect certain properties, these will be improved to be more flexible in future.

CSV input file requirements

Input files must have at least the following columns present:

geometry the geometry as a POINT of POLYGON in WKT format (longitude,latitude) in EPSG:4326 projection.
Type either Site or Equipment

Some site types can still end up as equipment if they are found to be within a site polygon. See the process documentation for more details.

Optional fields that will be mapped to the corresponding fields in the output file are:

site_name: The name of the site.
operator_status: The operational status of the site.
operator_unique_id: A unique identifier for the operator.
other_operator_ids: Additional identifiers for the operator.
subpart_w_segment: A subpart or segment associated with the site.
date_of_installation: The installation date of the site or equipment.
start_date: The start date as a string in the format YYYY-MM-DD.
end_date: The end date as a string in the format YYYY-MM-DD (can also be set at a per file level, see below).

The following columns must be set in the YAML file for each input file, see below for more information.

geometry_source: The source of the geometry data.
site_data_source: The source of the site data.

Any additional columns can be pulled through the process but only if they are specified in the configuration file under the extra_data key (see below). If these columns are specified then they will also be extracted and considered for differential updates when pulling data from the Aershed database.

Outputs

This program will create the following output files relative to the base_output_dir and output_folder defined in the configuration file.

data_imports_new_only/DATE_infrastructure_upload_new_OPERATOR.json: this file contains only CREATE_SITE records. This is useful if you only want to import new data.
data_imports/DATE_infrastructure_upload_updated_OPERATOR.json: this file contains both create and update records. This is all changes so can be used without the need for the new only data.

There will also be one or two (depending on if there are updates or not) KML files created in the root of the datalab_tools project, these are ad-hoc files that are useful for inspection in Google Earth.

Configuration file

The following explains the possible values in the configuration for thr infrastructure_update program. An example configuration including all available options is given below.

YAML

infrastructure_update:
  - output_folder: ""
    overlap_threshold: 0.5
    handle_nearby_points: true
    number_of_batches: 1
    process_equipment: false
    nearby_point_threshold: 50
    extra_data:
      - "extra_info"
    import_parameters:
      cross_operator_match_distance_m: 25
    inactive_status_values:
      - "Reclaimed"
      - "Abandoned"
    updated_data:
      - path: "sites.csv"
        geometry_column: "geometry"
        geometry_from_lat_lon:
          lat: latitude
          lon: longitude
        type: "csv"
        priority: 2
        geometry_source: "OPERATOR_PROVIDED"
        data_source: "Test data"
        create_polygons_width: 1e-5
        field_mapping:
          "name": "site_name"
          "Status": "operator_status"
          "unique-id": "operator_unique_id"
      - path: "extra/polygons.kml"
        type: "kml"
        priority: 3
        geometry_source: "DRAWN_BY_AERSCAPE"
        load_points: false

At the program level we have:

output_folder: if not null then a sub-folder will be created under base_output_dir and all outputs will be relative to that directory.
overlap_threshold: default value is 0.9 corresponding to 90%. If two sites have an overlap greater than this value then we consider them to be duplicates and remove one. If both sites have the same priority value then we keep the site with the largest area, otherwise the site with the highest priority is kept.
number_of_batches: (default 1) how many batches to split the data into for the update process. New data is clustered spatially into the set number of clusters, and existing data filtered to the same spatial extent with a buffer around it.
handle_nearby_points: (default False) if set to true then points close to polygons are considered in the matching. If a point is sufficiently close then it can be used to update a polygon, such as providing a site name to a nearby polygon that does not have one, and combining nearby points into sites. See the process documentation and example scenarios for more information.
nearby_point_threshold: (default 50) the approximate distance in metres within which point sites will be merged with polygon sites. The value is converted to degrees using the approximate conversion of degrees = nearby_point_threshold / 111320.
process_equipment: (default True) process equipment infrastructure. This option exists to enable turning of equipment processing for operators with large amount of infrastructure that can slow down the update process.
extra_data: this is a list of additional columns that should be pulled through the update process. These will appear with the given names in the data import JSON file and will be put into the extra_data field of the infrastructure in the Aershed database.
inactive_status_values: an array of status values that are considered inactive. Sites with these vales in the operator_status column may be ignored, for more details on what scenarios they are still imported see the process documentation.
import_parameters: this is an object of any type that will be passed directly in the same form into the params field in the infrastructure import file.
handle_nearby_points_in_new_data: (default True) this is intended as an override for testing only. Generally we always want to avoid importing new data with points very close to polygons, but setting this to False will enable that to happen. For consistent processing handle_nearby_points must also be undefined (default value) or set to False.

At the input file level we have (under the updated_data key):

path: the path to the file relative to the base_input_dir.
geometry_source: must be one of UNKNOWN, BRIDGER, OPERATOR_PROVIDED, DRAWN_BY_AERSCAPE, SCRAPED_FROM_PUBLIC_DATA, GROUND_SURVEY.
geometry_column: default is geometry but can also be changed to the name of the column in the input file that contains the WKT data.
geometry_from_lat_lon: (default None) if the input file only has lattitude and longitude columns then this can be used to convert these into a WKT string in the format POINT (lon lat), which is consistent with the expected import format. If this field is provided it must be a dictionary (key-value pairs), with keys lat and lon and values for the lattitude and longitude columns respectively. This will only work if all rows are point sites non-null values in the lattitude and longitude columns.
polygon_quality: this indicates the quality of the polygons in the input data based. The value is represented as an integer in the database. The possible input values for this key are LOWEST, LOW, MEDIUM, HIGH, BEST. To include the polygon_quality in the import you must also include this column in the extra_data field at the job level configuration (see above) as it is not a default column.
type: the type of import, can be csv, kml or kmz.
priority: the priority to give to this data when comparing it with other files and the existing database contents. Can be any integer value except for 1. The existing data in the Aershed database has a priority of 1. New data can be ranked below or above (>1).
data_source: the value to put into the data_source field in the resultant output, can be any string value.
geometry_source: the value to put into the geometry_source field in the resultant output.
field_mapping: this is a list of key-value pairs used to rename columns in the input data (CSV only for now).
load_points: KML and KMZ only, if set to false it will ignore any points in the file.
end_date: if an existing site has changed to an inactive status, or a site must be imported despite having an inactive status, then set this as the end_date if the end_date column is null or not provided. This will override any value for end_date in the import file.
import_inactive_sites: If set to true then import sites from this data set even if they have a status considered inactive.
create_polygons_width: If this is defined then any point sites in the data file will be transformed into polygon sites of the given width (in degrees).
filters: Generic filtering of input data by simple string based filters. See below for more information.

Data filtering by input string. These filters can be defined in the update data configuration object and allow filtering out data based on the value of any input column. The filters are defined in infrastructure/import_filter.py and can be extended as required. Currently the available filters are contains and does not contain and the filter has the form:

yaml

- type: "does not contain"
  column: "site_name"
  value: "inactive"

Infrastructure update user guide ​

Example scenario ​

Infrastructure update program ​

Input files ​

CSV input file requirements ​

Outputs ​