Pricing estimates
The pricing model
The pricing estimates module computes the estimated cost of a survey given a list of sites and a cost model. The cost model is based on the following:
- Each survey area must be of at least a prescribed minimum size. This minimum survey area must be a square.
- A survey area can be larger than the minimum size. If this is the case then the cost scales proportionally to the area.
- The cost is calculated based on the total required survey area and the cost per unit area regardless of the shape of the survey area (assuming the above two conditions are followed).
Therefore the cost model is described by two parameters, the minimum area (assumed to be a square) and the cost per unit area.
In the current implementation we have defined two cost models with the following parameters:
- The less than 30 day product with a minimum area of 100 km2 and a cost of $9 per km2.
- The greater than 30 day product with a minimum area of 10 km2 and a cost of $3.375 per km2.
The process
The process to compute the estimated cost is given below.
- Sites are clustered into groups with a tolerance of approximately 1 km. This reduces the computational cost of the algorithm since we need to do the calculations only for the centroids of the clusters instead of every site. This is important as some of the following steps are computationally expensive.
- At each cluster we create a square ("box") area of interest with the minimum area defined by the cost model. This box is centred on the centroid of the cluster.
- We then match clusters to boxes and record what clusters are within each box. Due to the density of sites many of these boxes will actually cover multiple clusters of sites.
- We then execute an algorithm that finds the minimum set of these boxes that cover all the clusters. This is a variant of the set cover problem.
- We then identify the boxes that have any overlap with other boxes and merge them into a single connected region (otherwise we would end up double counting the overlapping areas).
- Finally we calculate the area of each box plus the connected regions and multiply by the cost per unit area to get the total cost.
This algorithm gives us a close approximation to the minimum required area. It is not the theoretical minimum due to the choice in step 2 to place the boxes at the centroids of the clusters. In theory this is not a strict requirement, and by relaxing this condition a more optimal solution could be found. However, this would be computationally expensive and is not necessary for the purposes of this project as it would only give a marginal improvement in the cost estimate.
The algorithm is also potentially an underestimate of what an actual desirable survey area would be. This is because in step 4 while finding the optimal cover we end up with some site clusters right on the edge of a survey area. However, in practice while this would mean we get coverage of that site we may actually want to see a larger area beyond the site to observe large plume structures. This could be addressed by adding a buffer around each site; this is not currently implemented as it is a marginal difference and would make the process more convoluted.
Usage
The algorithm is implemented in the estimate_pricing
program accessible via the CLI tool. The program uses sites from the Aershed Platform database. The program needs to be provided with an owner ID and an output folder name.
Output
The pricing estimates program produces the following outputs:
- A CSV file (
coverage_areas.csv
) with a row for each box or connected region. For each row we provide the area in km2 for that region, the basin name and the number of sites contained within it. - For each cost model we provide the following.
- An HTML file showing the centroid of each cluster of sites and a polygon representing the computed coverage area. Clicking on each centroid marker will show a label with the number of sites in that cluster.
- A text file containing a markdown style table summarising the costs for the survey.
An example of the output table is given below.
Basin | N single | N > 1 | N > 100 | Single ($) | N > 1 ($) | N > 100 ($) | Total area | Total price ($) |
---|---|---|---|---|---|---|---|---|
DJ | 1 | 5 | 0 | 900 | 38,351 | 0 | 4,362 | 39,251 |
Permian | 8 | 27 | 0 | 7,200 | 63,884 | 0 | 7,900 | 71,084 |
Haynesville | 0 | 3 | 1 | 0 | 7,100 | 3,097 | 789 | 7,100 |
Other | 64 | 49 | 3 | 57,600 | 93,835 | 7,849 | 17,101 | 151,435 |
Total | 73 | 84 | 4 | 65,700 | 203,170 | 10,946 | 30,152 | 268,870 |
This table shows the cost split by different types of coverage areas. The columns are as follows:
- Basin: The name of the basin.
- N single: The number of coverage areas that contain a single site. Each of these will by the model definition be a square box of the minimum required area. Therefore these are the least cost effective sites to survey.
- N > 1: The number of survey regions with more than one site. It is still possible that some of these are areas of the minimum size.
- N > 100: The number of survey regions with more than 100 sites. These are the most cost effective areas to survey and are likely to be large connected regions of more interesting shapes.
- Single ($): The total cost of the single site areas.
- N > 1 ($): The total cost of the areas with more than one site (note that this includes the N > 100 areas too).
- N > 100 ($): The cost of the areas with more than 100 sites.
- Total area: The total area of all the survey regions in km2.
- Total price ($): The total cost of the survey if all the regions are surveyed.
Emission based price optimisation
The pricing estimates can be adjusted to fit a fixed budget while finding a survey area that maximises the probability of observing emissions. The optimisation we present here is done in two steps. First we select all areas that have had historic emission events over a given threshold, and if this area is less than our budget then we expand out from these in such a way to either 1.) maximise the number of sites covered or 2.) maximise the expected number of emission events.
To maximise the expected number of emission events we need to consider the probability of a detection being observed. This is difficult to do at the site level, but we can make some estimates. If a site has no historic detections then we will assign it a probability equal to that for any site. If there are detections observed and the probability is above the average then this could be considered.