Skip to content

Infrastructure update user guide

The infrastructure update process enables us to take a CSV file and compare it with any existing infrastructure in the database. The result is then a JSON file, conforming to the protocol of the infrastructure import API, that contains the appropriate updates. For more information on how the update process works see the infrastructure process documentation.

This guide steps through how to take a CSV file and create a JSON file ready for import.

First follow the steps in the introduction for setting up the Data Lab tools project. Use the Docker guidelines for the quickest and simplest setup.

Example scenario

See the example sites import CSV file at datalab_tools/data/example/sites.csv. This contains some polygon and point sites and some metadata such as site names, status and a unique ID.

We will create an import using the configuration file config/infrastructure_updates/example.yaml. Inspect this file to understand how the process works along with the guide below to see what the options do. If you want to test an import, create a fake company and then replace the owner value in the top of the configuration value with this company identifier.

To create an import run the following command.

docker compose run --rm cli-tool python datalab_tools/cli.py config/infrastructure_updates/example.yaml

This will create a file in the folder data/example/processed/data_imports_new_only (see below for more details on the outputs from this program).

After importing, run this process again to verify that no new sites or updates to sites are created. To test creating an update, modify a value in the CSV file or add a whole new column (and reference it in the configuration file), then run the process again.

Infrastructure update program

The infrastructure_update program can combine different types of data inputs and do the required transformations to make a update file ready for import. The following describes what this program can do, the input requirements and defines the configuration options.

See any of the YAML files in the config/infrastructure_update/ folder for examples.

Input files

The infrastructure_update program accepts CSV, KMZ, KML files for input. If the file is not in this format then it should be transformed into CSV file first using an ad-hoc custom script, see the datalab_too.s/infrastructure/custom folder for some examples. The KMZ and KML input types are fragile and expect certain properties, these will be improved to be more flexible in future.

CSV input file requirements

Input files must have at least the following columns present:

  • geometry the geometry as a POINT of POLYGON in WKT format (longitude,latitude) in EPSG:4326 projection.
  • Type either Site or Equipment

Some site types can still end up as equipment if they are found to be within a site polygon. See the process documentation for more details.

Optional fields that will be mapped to the corresponding fields in the output file are:

  • site_name: The name of the site.
  • operator_status: The operational status of the site.
  • operator_unique_id: A unique identifier for the operator.
  • other_operator_ids: Additional identifiers for the operator.
  • subpart_w_segment: A subpart or segment associated with the site.
  • date_of_installation: The installation date of the site or equipment.
  • start_date: The start date as a string in the format YYYY-MM-DD.
  • end_date: The end date as a string in the format YYYY-MM-DD (can also be set at a per file level, see below).

The following columns must be set in the YAML file for each input file, see below for more information.

  • geometry_source: The source of the geometry data.
  • site_data_source: The source of the site data.

Any additional columns can be pulled through the process but only if they are specified in the configuration file under the extra_data key (see below). If these columns are specified then they will also be extracted and considered for differential updates when pulling data from the Aershed database.

Outputs

This program will create the following output files relative to the base_output_dir and output_folder defined in the configuration file.

  • data_imports_new_only/DATE_infrastructure_upload_new_OPERATOR.json: this file contains only CREATE_SITE records. This is useful if you only want to import new data.
  • data_imports/DATE_infrastructure_upload_updated_OPERATOR.json: this file contains both create and update records. This is all changes so can be used without the need for the new only data.

There will also be one or two (depending on if there are updates or not) KML files created in the root of the datalab_tools project, these are ad-hoc files that are useful for inspection in Google Earth.

Configuration file

The following explains the possible values in the configuration for thr infrastructure_update program. An example configuration including all available options is given below.

YAML
infrastructure_update:
  - output_folder: ""
    overlap_threshold: 0.5
    handle_nearby_points: true
    nearby_point_threshold: 50
    extra_data:
      - "extra_info"
    import_parameters:
      cross_operator_match_distance_m: 25
    inactive_status_values:
      - "Reclaimed"
      - "Abandoned"
    updated_data:
      - path: "sites.csv"
        geometry_column: "geometry"
        type: "csv"
        priority: 2
        geometry_source: "OPERATOR_PROVIDED"
        data_source: "Test data"
        field_mapping:
          "name": "site_name"
          "Status": "operator_status"
          "unique-id": "operator_unique_id"
      - path: "extra/polygons.kml"
        type: "kml"
        priority: 3
        geometry_source: "DRAWN_BY_AERSCAPE"
        load_points: false

At the program level we have:

  • output_folder: if not null then a sub-folder will be created under base_output_dir and all outputs will be relative to that directory.
  • overlap_threshold: default value is 0.9 corresponding to 90%. If two sites have an overlap greater than this value then we consider them to be duplicates and remove one. If both sites have the same priority value then we keep the site with the largest area, otherwise the site with the highest priority is kept.
  • handle_nearby_points: (default False) if set to true then points close to polygons are considered in the matching. If a point is sufficiently close then it can be used to update a polygon, such as providing a site name to a nearby polygon that does not have one, and combining nearby points into sites. See the process documentation and example scenarios for more information.
  • nearby_point_threshold: (default 50) the approximate distance in metres within which point sites will be merged with polygon sites. The value is converted to degrees using the approximate conversion of degrees = nearby_point_threshold / 111320.
  • extra_data: this is a list of additional columns that should be pulled through the update process. These will appear with the given names in the data import JSON file and will be put into the extra_data field of the infrastructure in the Aershed database.
  • inactive_status_values: an array of status values that are considered inactive. Sites with these vales in the operator_status column may be ignored, for more details on what scenarios they are still imported see the process documentation.
  • import_parameters: this is an object of any type that will be passed directly in the same form into the params field in the infrastructure import file.
  • handle_nearby_points_in_new_data: (default True) this is intended as an override for testing only. Generally we always want to avoid importing new data with points very close to polygons, but setting this to False will enable that to happen. For consistent processing handle_nearby_points must also be undefined (default value) or set to False.

At the input file level we have (under the updated_data key):

  • path: the path to the file relative to the base_input_dir.
  • geometry_source: must be one of UNKNOWN, BRIDGER, OPERATOR_PROVIDED, DRAWN_BY_AERSCAPE, SCRAPED_FROM_PUBLIC_DATA, GROUND_SURVEY.
  • geometry_column: default is geometry but can also be changed to the name of the column in the input file that contains the WKT data.
  • type: the type of import, can be csv, kml or kmz.
  • priority: the priority to give to this data when comparing it with other files and the existing database contents. Can be any integer value except for 1. The existing data in the Aershed database has a priority of 1. New data can be ranked below or above (>1).
  • data_source: the value to put into the data_source field in the resultant output, can be any string value.
  • geometry_source: the value to put into the geometry_source field in the resultant output.
  • field_mapping: this is a list of key-value pairs used to rename columns in the input data (CSV only for now).
  • load_points: KML and KMZ only, if set to false it will ignore any points in the file.
  • end_date: if an existing site has changed to an inactive status, or a site must be imported despite having an inactive status, then set this as the end_date if the end_date column is null or not provided. This will override any value for end_date in the import file.
  • import_inactive_sites: If set to true then import sites from this data set even if they have a status considered inactive.