Skip to content

CLI tools for data processing

This package contains programs for processing data and infrastructure. It is a multi-purpose package intended to be a place to keep various processes that are not part of the Aershed platform. This includes cleaning of raw data and processes that require more manual intervention or are not possible to fully automate. It can also be a place to build processes that will eventually become part of the production data processing pipeline.

Setup and usage

Docker

All programs can be run using docker-compose to avoid the need to install dependencies. Programs are defined in datalab_tools/programs.py. Run the following commands to test that the tool is working correctly.

git clone https://github.com/Aerscape/datalab-tools.git
cd datalab-tools
docker compose run --rm cli-tool python datalab_tools/cli.py config/example.yaml

If the container has been updated but your local version has not been rebuilt then add the --build flag like in the following command.

docker compose run --rm --build cli-tool python datalab_tools/cli.py config/example.yaml

In this case we are using the example configuration file to run a simple test program. Once it has completed a file will be created in the data directory. Additional command line arguments can be supplied after the configuration file path.

Environment variables must be set for any processes that require them, such as accessing databases or external APIs which require tokens. To set these create a file called .env in the datalab-tools directory. An example .env file is given below.

DB_HOST=prod-readonly.<...>.rds.amazonaws.com
DB_USER=myuser
DB_PASSWORD=<...>
DB_NAME=postgres
PL_API_KEY=<...>

This file sets the configuration required to access the Aershed database and the Planet API.

Local installation

To install the package locally see the instructions below. This can be done in a virtual environment or in the devcontainer.

Once installed programs can be run using the following command

python datalab_tools/cli_tool.py <config_file_path> <additional_parameters>

Program and configuration structure

All programs take a Config object. This object contains configuration sourced from the supplied YAML configuration file, additional command line arguments or environment variables.

Each run is defined by a configuration file. This file controls what programs run and the configuration that is passed into them. One configuration file can be used to drive several programs, or multiple runs of the same program with different configuration.

yaml
config:
  base_output_dir: "data/example/"

example_program:
  - output_folder: "first-test"
    column_name: "test 1"
  - output_folder: "second-test"
    column_name: "test 2"

This configuration file runs the example_program program twice with different parameters for the output_folder and column_name. The output_folder should be specified for all programs where there will be file output, if left blank it will use a default value which is then concatenated with the base_output_dir specified at the top of the file.

The configuration file allows us to define multiple individual jobs to be run in series. The first block (with key config) defines global parameters for all jobs. Each key after this must be the name of a program in programs.py. Below each of these keys there must be an array with at least one element, and within that array are the parameters specific to that job. This enables us to run multiple jobs using the same program in a row, each with different parameters. This is particularly useful for processing many different raw files.

Parameters not specified in the configuration file can also be provided in the command line arguments. However, these must also be defined in arguments.py in order to be recognised.

Tests

To run tests via the Docker container use the following command.

docker compose run --rm --build datalab-tools-tests pytest tests/*

Development

For development we have a Visual Studio Code devcontainer setup. You can use the following command to start the devcontainer.

Once inside the dev container you can install the package in development mode to make it easy to test changes. For this use the following command

pip install -e .

If there are any issues are installation try installing the dependencies using the following command

pip install -r requirements.txt
pip install pytest

We also need to install the latex packages (used for fonts on figures) which require an update from the base container, this needs to be fixed properly but for now simply run the following to get this installed

apt-get update
apt-get install -y latexmk dvipng texlive-latex-extra texlive-fonts-recommended cm-super

Tests can be run using the command

pytest