Infrastructure update process
The infrastructure update process is used to compare new data with what is already in the database. The result is an import file that indicates what changes need to be made to the database following the standard prescribed for the infrastructure import API.
New sites
These are new sites that are not spatially connected to any existing sites. They are simply added to the database. These are added using the CREATE_SITES
operation.
New sites combined
This case can include the following scenarios.
There are multiple point sites at the same location or extremely close to each other (this tolerance is configurable, and is currently set to 2 meters). In this case all existing point sites are marked to be archived and a new site is created.
There are multiple sites, either points or polygons, contained within a new polygon.
When combining existing sites it is ambiguous which of the existing sites the new site should inherit metadata from, so only metadata from the new record can be retained. However, a link can be created to the existing sites by referring to the list of IDs from the existing sites that will be present in the aerscape_id
column.
New sites split
In this scenario an existing site is split into multiple smaller sites. The new sites are created and the existing site is marked for deletion. This may occur when a large polygon is replaced with multiple smaller polygons.
Updated sites
These are sites that have been updated, where it is clear that the new site is spatially related to the existing site. This can include changes to the geometry or metadata. In this case no records are marked for archiving, and the existing site is updated with the new data. The aerscape_id
in the update record will match the ID of the site to be updated.
Archived sites
These are sites that need to be removed from the database for the new data to be uploaded. This does not include sites that simply didn't appear in the new data, as this would be require the new data to be a complete snapshot every time, which is not likely. These instead are the result of site combinations or splitting as described above.
Process documentation
The following is a work and progress and probably doesn't reflect exactly what is happening any more. It will be reviewed and improved soon!
Handling polygon site updates
Discuss overlaps (minor and more than 90%), unions and cuts. Include discussion of what metadata is carried over and when sites are updated versus deleted.
Combining geometries
If there is sufficient overlap between two sites then we combine them into one site. This is done by taking the union of the two polygons. The threshold for combining sites is 30% overlap (but less than or equal to 90%). If two sites overlap by more than 30% then we combine them into one site. If two sites overlap by less than 30% then we treat them as separate sites. The overlapping part is removed from one of the polygons.
When two sites are combined we need to handle the passing on of the metadata. There are multiple decisions that must be made.
1.) If the two sites are in the same survey then we choose the metadata from the site with the largest polygon. 2.) If the two sites are in different surveys then we choose the metadata from the most recent survey. If there are more than two sites being combined in the latest survey, then again the choice is made based on the largest area.
All sites are given a transaction values to represent that a union has taken place. The unique ID value is concatenated with the unique IDs of the sites that have been combined. This is used to identify and track the sites that have been combined.
For example with MX960 we get the following output for a new combined site:
996cbfb2-c524-4ad1-953b-991d19e095fc;8e924458-01a0-4131-a295-8cfd1ae7a41c;87026fe2-0de7-4919-aadd-a786c9d2e9c0;64e1a7a4-7c53-4b17-90be-bd43b2864003
In this case all these existing site have been combined together. What has happened here is the following:
- When the new data is loaded we clean it. The new data also contained these four overlapping sites.
- During the initial clean these four sites in the new data set are combined into a union and from then on they are treated as one site.
- Now this one site is compared to the existing data, and then each of the existing sites are absorbed into this new larger site (since by the above step this new site is simply a union of the four sites).|
This process can be tracked by the transaction column which has a value of:
union;updated_geometry;updated_geometry;updated_geometry;updated_geometry
Representing that the four sites have been combined into one site, and each geometry update reflects the smaller site being absorbed into the new larger site.
Handling overlapping sites
If site polygons have a maximum overlap of more than 90%, or the value defined by the configuration, then we treat these as duplicates and follow a process to decide which site to keep.
If the priority column is present then we choose the site with the highest priority. This means that if a site is split into two then we will end up with two sites. The unique Aerscape site ID (or multiple update IDs if there were more than one site under the new site) will be retained in the aerscape_id
column. All other metadata will be taken from the site with the highest priority unless the value is null then it is take from the old site. This means we will have a single delete record for the original site and one or more new records for the new sites. The new sites will all contain the same aerscape_id
value, so this must be treated with care.
If the priority column is not present then we choose the site with the largest area.
Handling point sites
Point sites are sites with a geometry defined by a point rather than a polygon. There are point sites in the database already so we need to account for these as well.
During the update process we check for point sites that are contained with polygons from other sites. These can be point sites from the existing data already in the database contained within the polygon of a new site or a new point site that is contained within an existing site's polygon.
The following scenarios are handled.
New point site inside existing polygon site
If the new data contains a point site that is contained within an existing site's polygon (and is the only one) then we keep the existing site and discard the point site. The metadata from the point and polygon sites are merged using the metadata merge process.
Multiple new point sites inside existing polygon site
If the new data contains multiple point sites that are contained within an existing site's polygon then we keep the existing site and discard the point sites. The metadata is not transferred in this case as there are multiple point sites so it is ambiguous which metadata should be transferred. No update record is produced.
If both the new and existing data contain exactly the same point site.
If geometries match they will be handled by the identical geometry step.
New polygon site containing existing single point site
The existing site geometry is replaced with the polygon. Metadata is merged using the metadata merge process.
New polygon site containing multiple existing point sites
If the new data contains a site with a polygon that contains any number of existing point sites then we create a delete record for those point sites and replace them with the new polygon site. Metadata is merged using a process where only if the polygon has null values do we bring values from the points. In this case, column by column, we concatenate the values from the point sites.
Any new point sites that are not covered in the above scenarios are added as new sites.
There is one other scenario that arises due to existing data have point sites. There are cases where a site polygon contains another point site. This seems like it should not be possible but for MX960 there are three cases where it occurs
Point 066ac1c7-9517-46e6-98fa-6416d0208792 in polygon b80a8c84-7741-4ae7-9f11-a61e632adb4c
Point 528b1a08-7895-4dcc-b112-14a1527081a0 contained in polygon 670b5be3-3f70-44e5-a8ae-4a84aa39b94e
Point d6d04f1d-731d-4823-b1e2-18cafa74bf82 contained in polygon e33724ff-ec60-4d35-95ef-94d6b920e158
To handle these cases the point sites appear in the infrastructure update table as sites to be deleted. There is no change to the polygon site (unless it is modified for some other reason) so it will not appear in the update table.
Filtering by status and setting an end-date
The end date for a site can be set either directly from an import file (make sure the column exists with name end_date
or that there is a mapping defined in the configuration file) or it can be inferred from a status change. The process for when this occurs in different scenarios is described here.
First we must define when a site should have an end date.
1.) If a site that already exists in the database has changed status, or is removed entirely from the operator's inventory, then we assign it an appropriate end date rather than deleting the record entirely. This allows us to still match historic emission events to that site. 2.) We may want to import a new site into the database with a end date already defined. This would be required when the end date was recent and we have emission data that we will want to match to this site.
We may also have sites that we do not want to import at all based on the value provided in the operator_status
column. So we need sufficient control over the infrastructure update process to cover all these scenarios.
In the configuration file we can define a list of status values, such as inactive
or retired
for sites that should either be given an end date, or not imported at all.
New sites with an inactive status can be imported by setting the value import_inactive_sites
in the configuration file at the import file level to true
. If a site appears in multiple files, and only has an inactive status in one file, the import_inactive_sites
will force it to be imported regardless of what import file that flag was set on. For example, this allows us to combine historic Bridger data with operator supplied infrastructure while ensuring the Bridger sites will still be imported even if they are inactive in the latest operator files. In this case the site will still be imported, but an end date must be set.
If the existence of an end date is to be inferred from a change in site status then we must set a value for the date. This can be given on the data source configuration (at the file import level) under the field end_date
. This will override any value (null or not) in the import file. If an inactive site (according to status) is required to be imported, and no end date is defined (either in the configuration file or in the data file) an error will be raised.
Therefore a site with an end date will only be created if it appears in at least one import file with the import_inactive_sites
flag set to true, or the site already exists in the database.
Identifying changes
One case that is possible is when an old site is split into multiple smaller sites. If the new data contains multiple smaller sites then we must respect this change. So the old site will need to be deprecated and two (or more) new sites will be created. This only occurs when the old site completely overlaps with the new sites, and therefore they will be marked as updated geometries within the same area. One example is West Seminole SA 1214 which is split into two new sites WEST SEMINOLE SAN ANDRES TRACT 12 SAT and W SEMINOLE SA UNIT 1214.
This scenario could result in equipment being orphaned. In this case the closest site polygon is extended to include the equipment. TODO: old equipment should be marked as deprecated and not result in the site polygon being modified.