CLI tools for data processing
This package contains programs for processing data and infrastructure. It is a multi-purpose package intended to be a place to keep various processes that are not part of the Aershed platform. This includes cleaning of raw data and processes that require more manual intervention or are not possible to fully automate. It can also be a place to build processes that will eventually become part of the production data processing pipeline.
Setup and usage
Docker
All programs can be run using docker-compose to avoid the need to install dependencies. Programs are defined in datalab_tools/programs.py
. Run the following commands to test that the tool is working correctly.
git clone https://github.com/Aerscape/datalab-tools.git
cd datalab-tools
docker compose run --rm cli-tool python datalab_tools/cli.py config/example.yaml
If the container has been updated but your local version has not been rebuilt then add the --build
flag like in the following command.
docker compose run --rm --build cli-tool python datalab_tools/cli.py config/example.yaml
In this case we are using the example configuration file to run a simple test program. Once it has completed a file will be created in the data directory. Additional command line arguments can be supplied after the configuration file path.
Environment variables must be set for any processes that require them, such as accessing databases or external APIs which require tokens. To set these create a file called .env
in the datalab-tools
directory. An example .env
file is given below.
DB_HOST=prod-readonly.<...>.rds.amazonaws.com
DB_USER=myuser
DB_PASSWORD=<...>
DB_NAME=postgres
PL_API_KEY=<...>
This file sets the configuration required to access the Aershed database and the Planet API.
Local installation
To install the package locally see the instructions below. This can be done in a virtual environment or in the devcontainer.
Once installed programs can be run using the following command
python datalab_tools/cli_tool.py <config_file_path> <additional_parameters>
Program and configuration structure
All programs take a Config
object. This object contains configuration sourced from the supplied YAML configuration file, additional command line arguments or environment variables.
Each run is defined by a configuration file. This file controls what programs run and the configuration that is passed into them. One configuration file can be used to drive several programs, or multiple runs of the same program with different configuration.
config:
base_output_dir: "data/example/"
example_program:
- output_folder: "first-test"
column_name: "test 1"
- output_folder: "second-test"
column_name: "test 2"
This configuration file runs the example_program
program twice with different parameters for the output_folder
and column_name
. The output_folder
should be specified for all programs where there will be file output, if left blank it will use a default value which is then concatenated with the base_output_dir
specified at the top of the file.
The configuration file allows us to define multiple individual jobs to be run in series. The first block (with key config
) defines global parameters for all jobs. Each key after this must be the name of a program in programs.py
. Below each of these keys there must be an array with at least one element, and within that array are the parameters specific to that job. This enables us to run multiple jobs using the same program in a row, each with different parameters. This is particularly useful for processing many different raw files.
Parameters not specified in the configuration file can also be provided in the command line arguments. However, these must also be defined in arguments.py
in order to be recognised.
Tests
To run tests via the Docker container use the following command.
docker compose run --rm --build datalab-tools-tests pytest tests/*
Development
For development we have a Visual Studio Code devcontainer setup. You can use the following command to start the devcontainer.
Once inside the dev container you can install the package in development mode to make it easy to test changes. For this use the following command
pip install -e .
If there are any issues are installation try installing the dependencies using the following command
pip install -r requirements.txt
pip install pytest
We also need to install the latex packages (used for fonts on figures) which require an update from the base container, this needs to be fixed properly but for now simply run the following to get this installed
apt-get update
apt-get install -y latexmk dvipng texlive-latex-extra texlive-fonts-recommended cm-super
Tests can be run using the command
pytest