Skip to content

Introduction

This package contains programs for processing and importing data into the Aershed platform, analysis and reporting, and various other custom tasks. It is a multi-purpose tool designed to support a range of use cases. The intention is to enable all processing jobs, including one-off and custom tasks, to be done within a structured version controlled framework with the ability to leverage common code. The broad range of use cases are summarised below.

  • Production processes for preparing raw data for import.
  • Importing data into the Aershed platform.
  • Retrieving data from the Aershed platform and generating reports and analysis.
  • Combining alternative data sources with Aershed data to generate reports and analysis.
  • Custom analysis and processing that do not involve the Aershed platform.
  • Any task which involves the manipulation of data that would benefit from the common tooling and structure available in this package.

Framework

There is one entry point for this application in the cli.py file. Any task must be defined as a program which must be defined in the programs.py file. Each programs must have a standard interface. To run the program a configuration file is used, with the structure explained in the example in the Program Setup Section below.

The programs be trivially simple, or involve complex tasks and make use of interfaces to other systems. For more details see the developer notes in the Framework chapter of this documentation.

Setup and usage

Docker - first time user

One of the biggest challenges with using Python for data processing and analysis is the setup of various packages in different environments. Spatial analysis in particular involves system libraries such as gdal which if not installed correctly can causes issues at runtime. To simplify this and create a consistent experience for all users this tool should be used within a Docker container.

To make this simple all programs can be run using docker-compose via a bash script. The following steps will get your environment setup and test that it is working correctly.

Run the following commands.

git clone https://github.com/Aerscape/datalab-tools.git
git submodule update --init --recursive
cd datalab-tools
touch .env

The .env file is where you set local configuration required for all environments. The following parameters must be specified.

RAW_DATA_DIR=/home/username/data/
PROCESSING_DIR=/home/username/data/
S3_DIR=/home/username/data/

For a simple local use case without a S3 directory mounted you can set these calues to anything; it simply needs to be set to a directory that exists so that docker can correctly mount the volumes.

The RAW_DATA_DIR is where we store customer data that has been provided by means other than s3, so we keep it on the DataLab disk (ideally this will be moved to s3 eventually for persistent storage). This can be set to any value you want especially if not intending to use this for real data, but it must be a full path (run pwd to get the full path to your current directory and use that if you need to).

If you are interacting with the Aershed environment (most situations you will be) then you need to specify environment specific credentials. For this we must create a .env file for each environment, such as .env.dev, .env.prod-testing or .env.prod. The following is an example environment specific file.

DB_HOST=
DB_USER=
DB_PASSWORD=
API_KEY=

Here API_KEY is the key used for importing to Aershed, and can be obtained from the administration page under the data imports tab.

If you are using the datalab you may get an issue with the docker user not being setup properly for your user, run the following command:

sudo ./scripts/add-docker-user.sh

This will avoid the need to use sudo for any future docker commands.

Then to build the container run the following command.

./scripts/rebuild.sh

To run the tests and check everything is setup okay run the following command.

./scripts/run-tests.sh

Docker - after updating

If you have recently updated the version of the repository then you may need to rebuild the docker image. Use the following command to do this:

 ./scripts/rebuild.sh
 ./scripts/run-tests.sh

Program setup

In this example we are using the example configuration file to run a simple test program. Once it has completed a file will be created in the data directory. Additional command line arguments can be supplied after the configuration file path.

Environment variables must be set for any processes that require them, such as accessing databases or external APIs which require tokens. To set these create a file called .env in the datalab-tools directory. An example .env file is given below.

DB_HOST=123.11.1.00
DB_USER=myuser
DB_PASSWORD=<...>
PL_API_KEY=<...>

This file sets the configuration required to access the Aershed database and the Planet API. Other services require additional parameters, add these as required.

All programs take a Config object. This object contains configuration sourced from the supplied YAML configuration file, additional command line arguments or environment variables.

Each run is defined by a configuration file. This file controls what programs run and the configuration that is passed into them. One configuration file can be used to drive several programs, or multiple runs of the same program with different configuration.

yaml
config:
  base_output_dir: "data/example/"

example_program:
  - output_folder: "first-test"
    column_name: "test 1"
  - output_folder: "second-test"
    column_name: "test 2"

This configuration file runs the example_program program twice with different parameters for the output_folder and column_name. The output_folder should be specified for all programs where there will be file output, if left blank it will use a default value which is then concatenated with the base_output_dir specified at the top of the file.

The configuration file allows us to define multiple individual jobs to be run in series. The first block (with key config) defines global parameters for all jobs. Each key after this must be the name of a program in programs.py. Below each of these keys there must be an array with at least one element, and within that array are the parameters specific to that job. This enables us to run multiple jobs using the same program in a row, each with different parameters. This is particularly useful for processing many different raw files.

Parameters not specified in the configuration file can also be provided in the command line arguments. However, these must also be defined in arguments.py in order to be recognised.

Tests

To run all tests using the command

./scripts/run-tests.sh

To run a specific test add an additional argument

./scripts/run-tests.sh tests/test_timezone_finder.py::test_timezone_finder_daylight_savings

Development

The devcontainer is not actively maintained right now. Due to issues with excessive memory use on the EC2 instance I now develop outside the container and run all tasks via docker compose.

For development we have a Visual Studio Code devcontainer setup. You can use the following command to start the devcontainer.

Once inside the dev container you can install the package in development mode to make it easy to test changes. For this use the following command

pip install -e .

If there are any issues are installation try installing the dependencies using the following command

pip install -r requirements.txt
pip install pytest

We also need to install the latex packages (used for fonts on figures) which require an update from the base container, this needs to be fixed properly but for now simply run the following to get this installed

apt-get update
apt-get install -y latexmk dvipng texlive-latex-extra texlive-fonts-recommended cm-super

Tests can be run using the command

pytest