Infrastructure update user guide
The infrastructure update process enables us to take a CSV file and compare it with any existing infrastructure in the database. The result is then a JSON file, conforming to the protocol of the infrastructure import API, that contains the appropriate updates. For more information on how the update process works see the infrastructure process documentation.
This guide steps through how to take a CSV file and create a JSON file ready for import.
First follow the steps in the introduction for setting up the Data Lab tools project. Use the Docker guidelines for the quickest and simplest setup.
Example scenario
See the example sites import CSV file at datalab_tools/data/example/sites.csv. This contains some polygon and point sites and some metadata such as site names, status and a unique ID.
We will create an import using the configuration file config/infrastructure_updates/example.yaml. Inspect this file to understand how the process works along with the guide below to see what the options do. If you want to test an import, create a fake company and then replace the owner value in the top of the configuration value with this company identifier.
To create an import run the following command.
docker compose run --rm cli-tool python datalab_tools/cli.py config/infrastructure_updates/example.yamlThis will create a file in the folder data/example/processed/data_imports_new_only (see below for more details on the outputs from this program).
After importing, run this process again to verify that no new sites or updates to sites are created. To test creating an update, modify a value in the CSV file or add a whole new column (and reference it in the configuration file), then run the process again.
Infrastructure update program
The infrastructure_update program can combine different types of data inputs and do the required transformations to make a update file ready for import. The following describes what this program can do, the input requirements and defines the configuration options.
See any of the YAML files in the config/infrastructure_update/ folder for examples.
Input files
The infrastructure_update program accepts CSV, KMZ, KML files for input. If the file is not in this format then it should be transformed into CSV file first using an ad-hoc custom script, see the datalab_too.s/infrastructure/custom folder for some examples. The KMZ and KML input types are fragile and expect certain properties, these will be improved to be more flexible in future.
CSV input file requirements
Input files must have at least the following columns present:
geometrythe geometry as a POINT of POLYGON in WKT format (longitude,latitude) inEPSG:4326projection.TypeeitherSiteorEquipment
Some site types can still end up as equipment if they are found to be within a site polygon. See the process documentation for more details.
Optional fields that will be mapped to the corresponding fields in the output file are:
site_name: The name of the site.operator_status: The operational status of the site.operator_unique_id: A unique identifier for the operator.other_operator_ids: Additional identifiers for the operator.subpart_w_segment: A subpart or segment associated with the site.date_of_installation: The installation date of the site or equipment.start_date: The start date as a string in the formatYYYY-MM-DD.end_date: The end date as a string in the formatYYYY-MM-DD(can also be set at a per file level, see below).
The following columns must be set in the YAML file for each input file, see below for more information.
geometry_source: The source of the geometry data.site_data_source: The source of the site data.
Any additional columns can be pulled through the process but only if they are specified in the configuration file under the extra_data key (see below). If these columns are specified then they will also be extracted and considered for differential updates when pulling data from the Aershed database.
Outputs
This program will create the following output files relative to the base_output_dir and output_folder defined in the configuration file.
data_imports_new_only/DATE_infrastructure_upload_new_OPERATOR.json: this file contains onlyCREATE_SITErecords. This is useful if you only want to import new data.data_imports/DATE_infrastructure_upload_updated_OPERATOR.json: this file contains both create and update records. This is all changes so can be used without the need for the new only data.
There will also be one or two (depending on if there are updates or not) KML files created in the root of the datalab_tools project, these are ad-hoc files that are useful for inspection in Google Earth.
Configuration file
The following explains the possible values in the configuration for thr infrastructure_update program. An example configuration including all available options is given below.
infrastructure_update:
- output_folder: ""
overlap_threshold: 0.5
handle_nearby_points: true
number_of_batches: 1
process_equipment: false
nearby_point_threshold: 50
extra_data:
- "extra_info"
import_parameters:
cross_operator_match_distance_m: 25
inactive_status_values:
- "Reclaimed"
- "Abandoned"
updated_data:
- path: "sites.csv"
geometry_column: "geometry"
type: "csv"
priority: 2
geometry_source: "OPERATOR_PROVIDED"
data_source: "Test data"
create_polygons_width: 1e-5
field_mapping:
"name": "site_name"
"Status": "operator_status"
"unique-id": "operator_unique_id"
- path: "extra/polygons.kml"
type: "kml"
priority: 3
geometry_source: "DRAWN_BY_AERSCAPE"
load_points: falseAt the program level we have:
output_folder: if not null then a sub-folder will be created underbase_output_dirand all outputs will be relative to that directory.overlap_threshold: default value is 0.9 corresponding to 90%. If two sites have an overlap greater than this value then we consider them to be duplicates and remove one. If both sites have the same priority value then we keep the site with the largest area, otherwise the site with the highest priority is kept.number_of_batches: (default1) how many batches to split the data into for the update process. New data is clustered spatially into the set number of clusters, and existing data filtered to the same spatial extent with a buffer around it.handle_nearby_points: (defaultFalse) if set totruethen points close to polygons are considered in the matching. If a point is sufficiently close then it can be used to update a polygon, such as providing a site name to a nearby polygon that does not have one, and combining nearby points into sites. See the process documentation and example scenarios for more information.nearby_point_threshold: (default50) the approximate distance in metres within which point sites will be merged with polygon sites. The value is converted to degrees using the approximate conversion ofdegrees = nearby_point_threshold / 111320.process_equipment: (defaultTrue) process equipment infrastructure. This option exists to enable turning of equipment processing for operators with large amount of infrastructure that can slow down the update process.extra_data: this is a list of additional columns that should be pulled through the update process. These will appear with the given names in the data import JSON file and will be put into theextra_datafield of the infrastructure in the Aershed database.inactive_status_values: an array of status values that are considered inactive. Sites with these vales in theoperator_statuscolumn may be ignored, for more details on what scenarios they are still imported see the process documentation.import_parameters: this is an object of any type that will be passed directly in the same form into theparamsfield in the infrastructure import file.handle_nearby_points_in_new_data: (defaultTrue) this is intended as an override for testing only. Generally we always want to avoid importing new data with points very close to polygons, but setting this toFalsewill enable that to happen. For consistent processinghandle_nearby_pointsmust also be undefined (default value) or set toFalse.
At the input file level we have (under the updated_data key):
path: the path to the file relative to thebase_input_dir.geometry_source: must be one ofUNKNOWN,BRIDGER,OPERATOR_PROVIDED,DRAWN_BY_AERSCAPE,SCRAPED_FROM_PUBLIC_DATA,GROUND_SURVEY.geometry_column: default isgeometrybut can also be changed to the name of the column in the input file that contains the WKT data.polygon_quality: this indicates the quality of the polygons in the input data based. The value is represented as an integer in the database. The possible input values for this key areLOWEST,LOW,MEDIUM,HIGH,BEST. To include thepolygon_qualityin the import you must also include this column in theextra_datafield at the job level configuration (see above) as it is not a default column.type: the type of import, can becsv,kmlorkmz.priority: the priority to give to this data when comparing it with other files and the existing database contents. Can be any integer value except for1. The existing data in the Aershed database has a priority of1. New data can be ranked below or above (>1).data_source: the value to put into thedata_sourcefield in the resultant output, can be any string value.geometry_source: the value to put into thegeometry_sourcefield in the resultant output.field_mapping: this is a list of key-value pairs used to rename columns in the input data (CSV only for now).load_points: KML and KMZ only, if set tofalseit will ignore any points in the file.end_date: if an existing site has changed to an inactive status, or a site must be imported despite having an inactive status, then set this as theend_dateif theend_datecolumn is null or not provided. This will override any value forend_datein the import file.import_inactive_sites: If set totruethen import sites from this data set even if they have a status considered inactive.create_polygons_width: If this is defined then any point sites in the data file will be transformed into polygon sites of the given width (in degrees).filters: Generic filtering of input data by simple string based filters. See below for more information.
Data filtering by input string. These filters can be defined in the update data configuration object and allow filtering out data based on the value of any input column. The filters are defined in infrastructure/import_filter.py and can be extended as required. Currently the available filters are contains and does not contain and the filter has the form:
- type: "does not contain"
column: "site_name"
value: "inactive"