Skip to content

Introduction

Setup and usage

Docker - first time user

All programs can be run using docker-compose to avoid the need to install dependencies. Programs are defined in datalab_tools/programs.py. Run the following commands to test that the tool is working correctly.

git clone https://github.com/Aerscape/datalab-tools.git
cd datalab-tools
touch .env

The .env file is where you set sensitive or local configuration. For example a complete file would have the following.

RAW_DATA_DIR=/home/username/data/
DB_HOST=
DB_USER=
DB_PASSWORD=
BOX_CLIENT_ID=
BOX_CLIENT_SECRET=
BOX_ACCESS_TOKEN=

The RAW_DATA_DIR must be supplied. This is where we store customer data that has been provided by means other than s3, so we keep it on the DataLab disk (ideally this will be moved to s3 soon). You can set this to any value you want especially if not intending to use this for real data, must it must be a full path (run pwd to get the full path to your current directory). The other credentials are self-explanatory or explained elsewhere in this documentation.

If you are using the datalab you may get an issue with the docker user not being setup properly for your user, run the following command:

sudo ./scripts/add-docker-user.sh

This will avoid the need to use sudo for any future docker commands.

Then to build the container run the following command.

./scripts/rebuild.sh

To run the tests and check everything is setup okay run the following command.

./scripts/run-tests.sh

Docker - after updating

If you have recently updated the version of the repository then you may need to rebuild the docker image. Use the following command to do this:

 ./scripts/rebuild.sh
 ./scripts/run-tests.sh

Program setup

In this example we are using the example configuration file to run a simple test program. Once it has completed a file will be created in the data directory. Additional command line arguments can be supplied after the configuration file path.

Environment variables must be set for any processes that require them, such as accessing databases or external APIs which require tokens. To set these create a file called .env in the datalab-tools directory. An example .env file is given below.

DB_HOST=123.11.1.00
DB_USER=myuser
DB_PASSWORD=<...>
PL_API_KEY=<...>

From November 2025 we no longer require the DB_NAME=postgres parameter as this is handled automatically based on the user name.

This file sets the configuration required to access the Aershed database and the Planet API. Other services require additional parameters, add these as required.

All programs take a Config object. This object contains configuration sourced from the supplied YAML configuration file, additional command line arguments or environment variables.

Each run is defined by a configuration file. This file controls what programs run and the configuration that is passed into them. One configuration file can be used to drive several programs, or multiple runs of the same program with different configuration.

yaml
config:
  base_output_dir: "data/example/"

example_program:
  - output_folder: "first-test"
    column_name: "test 1"
  - output_folder: "second-test"
    column_name: "test 2"

This configuration file runs the example_program program twice with different parameters for the output_folder and column_name. The output_folder should be specified for all programs where there will be file output, if left blank it will use a default value which is then concatenated with the base_output_dir specified at the top of the file.

The configuration file allows us to define multiple individual jobs to be run in series. The first block (with key config) defines global parameters for all jobs. Each key after this must be the name of a program in programs.py. Below each of these keys there must be an array with at least one element, and within that array are the parameters specific to that job. This enables us to run multiple jobs using the same program in a row, each with different parameters. This is particularly useful for processing many different raw files.

Parameters not specified in the configuration file can also be provided in the command line arguments. However, these must also be defined in arguments.py in order to be recognised.

Tests

To run all tests using the command

./scripts/run-tests.sh

To run a specific test add an additional argument

./scripts/run-tests.sh tests/test_timezone_finder.py::test_timezone_finder_daylight_savings

Development

The devcontainer is not actively maintained right now. Due to issues with excessive memory use on the EC2 instance I now develop outside the container and run all tasks via docker compose.

For development we have a Visual Studio Code devcontainer setup. You can use the following command to start the devcontainer.

Once inside the dev container you can install the package in development mode to make it easy to test changes. For this use the following command

pip install -e .

If there are any issues are installation try installing the dependencies using the following command

pip install -r requirements.txt
pip install pytest

We also need to install the latex packages (used for fonts on figures) which require an update from the base container, this needs to be fixed properly but for now simply run the following to get this installed

apt-get update
apt-get install -y latexmk dvipng texlive-latex-extra texlive-fonts-recommended cm-super

Tests can be run using the command

pytest