Infrastructure update user guide
The infrastructure update process enables us to take a CSV file and compare it with any existing infrastructure in the database. The result is then a JSON file, conforming to the protocol of the infrastructure import API, that contains the appropriate updates. For more information on how the update process works see the infrastructure process documentation.
This guide steps through how to take a CSV file and create a JSON file ready for import.
First follow the steps in the introduction for setting up the Data Lab tools project. Use the Docker guidelines for the quickest and simplest setup.
Example scenario
See the example sites import CSV file at datalab_tools/data/example/sites.csv
. This contains some polygon and point sites and some metadata such as site names, status and a unique ID.
We will create an import using the configuration file config/infrastructure_updates/example.yaml
. Inspect this file to understand how the process works along with the guide below to see what the options do. If you want to test an import, create a fake company and then replace the owner
value in the top of the configuration value with this company identifier.
To create an import run the following command.
docker compose run --rm cli-tool python datalab_tools/cli.py config/infrastructure_updates/example.yaml
This will create a file in the folder data/example/processed/data_imports_new_only
(see below for more details on the outputs from this program).
After importing, run this process again to verify that no new sites or updates to sites are created. To test creating an update, modify a value in the CSV file or add a whole new column (and reference it in the configuration file), then run the process again.
Infrastructure update program
The infrastructure_update
program can combine different types of data inputs and do the required transformations to make a update file ready for import. The following describes what this program can do, the input requirements and defines the configuration options.
See any of the YAML files in the config/infrastructure_update/
folder for examples.
Input files
The infrastructure_update
program accepts CSV, KMZ, KML files for input. If the file is not in this format then it should be transformed into CSV file first using an ad-hoc custom script, see the datalab_too.s/infrastructure/custom
folder for some examples. The KMZ and KML input types are fragile and expect certain properties, these will be improved to be more flexible in future.
CSV input file requirements
Input files must have at least the following columns present:
geometry
the geometry as a POINT of POLYGON in WKT format (longitude,latitude) inEPSG:4326
projection.Type
eitherSite
orEquipment
Some site types can still end up as equipment if they are found to be within a site polygon. See the process documentation for more details.
Optional fields that will be mapped to the corresponding fields in the output file are:
site_name
: The name of the site.operator_status
: The operational status of the site.operator_unique_id
: A unique identifier for the operator.other_operator_ids
: Additional identifiers for the operator.subpart_w_segment
: A subpart or segment associated with the site.date_of_installation
: The installation date of the site or equipment.start_date
: The start date as a string in the formatYYYY-MM-DD
.end_date
: The end date as a string in the formatYYYY-MM-DD
(can also be set at a per file level, see below).
The following columns must be set in the YAML file for each input file, see below for more information.
geometry_source
: The source of the geometry data.site_data_source
: The source of the site data.
Any additional columns can be pulled through the process but only if they are specified in the configuration file under the extra_data
key (see below). If these columns are specified then they will also be extracted and considered for differential updates when pulling data from the Aershed database.
Outputs
This program will create the following output files relative to the base_output_dir
and output_folder
defined in the configuration file.
data_imports_new_only/DATE_infrastructure_upload_new_OPERATOR.json
: this file contains onlyCREATE_SITE
records. This is useful if you only want to import new data.data_imports/DATE_infrastructure_upload_updated_OPERATOR.json
: this file contains both create and update records. This is all changes so can be used without the need for the new only data.
There will also be one or two (depending on if there are updates or not) KML files created in the root of the datalab_tools
project, these are ad-hoc files that are useful for inspection in Google Earth.
Configuration file
The following explains the possible values in the configuration for thr infrastructure_update
program. An example configuration including all available options is given below.
infrastructure_update:
- output_folder: ""
overlap_threshold: 0.5
handle_nearby_points: true
nearby_point_threshold: 50
extra_data:
- "extra_info"
import_parameters:
cross_operator_match_distance_m: 25
inactive_status_values:
- "Reclaimed"
- "Abandoned"
updated_data:
- path: "sites.csv"
geometry_column: "geometry"
type: "csv"
priority: 2
geometry_source: "OPERATOR_PROVIDED"
data_source: "Test data"
field_mapping:
"name": "site_name"
"Status": "operator_status"
"unique-id": "operator_unique_id"
- path: "extra/polygons.kml"
type: "kml"
priority: 3
geometry_source: "DRAWN_BY_AERSCAPE"
load_points: false
At the program level we have:
output_folder
: if not null then a sub-folder will be created underbase_output_dir
and all outputs will be relative to that directory.overlap_threshold
: default value is 0.9 corresponding to 90%. If two sites have an overlap greater than this value then we consider them to be duplicates and remove one. If both sites have the same priority value then we keep the site with the largest area, otherwise the site with the highest priority is kept.handle_nearby_points
: (defaultFalse
) if set totrue
then points close to polygons are considered in the matching. If a point is sufficiently close then it can be used to update a polygon, such as providing a site name to a nearby polygon that does not have one, and combining nearby points into sites. See the process documentation and example scenarios for more information.nearby_point_threshold
: (default50
) the approximate distance in metres within which point sites will be merged with polygon sites. The value is converted to degrees using the approximate conversion ofdegrees = nearby_point_threshold / 111320
.extra_data
: this is a list of additional columns that should be pulled through the update process. These will appear with the given names in the data import JSON file and will be put into theextra_data
field of the infrastructure in the Aershed database.inactive_status_values
: an array of status values that are considered inactive. Sites with these vales in theoperator_status
column may be ignored, for more details on what scenarios they are still imported see the process documentation.import_parameters
: this is an object of any type that will be passed directly in the same form into theparams
field in the infrastructure import file.handle_nearby_points_in_new_data
: (defaultTrue
) this is intended as an override for testing only. Generally we always want to avoid importing new data with points very close to polygons, but setting this toFalse
will enable that to happen. For consistent processinghandle_nearby_points
must also be undefined (default value) or set toFalse
.
At the input file level we have (under the updated_data
key):
path
: the path to the file relative to thebase_input_dir
.geometry_source
: must be one ofUNKNOWN
,BRIDGER
,OPERATOR_PROVIDED
,DRAWN_BY_AERSCAPE
,SCRAPED_FROM_PUBLIC_DATA
,GROUND_SURVEY
.geometry_column
: default isgeometry
but can also be changed to the name of the column in the input file that contains the WKT data.type
: the type of import, can becsv
,kml
orkmz
.priority
: the priority to give to this data when comparing it with other files and the existing database contents. Can be any integer value except for1
. The existing data in the Aershed database has a priority of1
. New data can be ranked below or above (>1
).data_source
: the value to put into thedata_source
field in the resultant output, can be any string value.geometry_source
: the value to put into thegeometry_source
field in the resultant output.field_mapping
: this is a list of key-value pairs used to rename columns in the input data (CSV only for now).load_points
: KML and KMZ only, if set tofalse
it will ignore any points in the file.end_date
: if an existing site has changed to an inactive status, or a site must be imported despite having an inactive status, then set this as theend_date
if theend_date
column is null or not provided. This will override any value forend_date
in the import file.import_inactive_sites
: If set totrue
then import sites from this data set even if they have a status considered inactive.