Box integration
Box is a data storage service that some customers use to provide data to Aerscape. The data stored here can be viewed and downloaded manually but ideally we pull it in directly to the datalab environment.
The box integration is setup to treat the files in Box as if they are on a local disk. When writting a script that uses these files there is no special handling required if you use the IFileLoader interface. To use the Box file system instead of the local file system, specify box_integration to True in the configuration file.
The following environment variables must also be set in either the base of environment specific .env file:
BOX_CLIENT_ID=
BOX_CLIENT_SECRET=
BOX_ACCESS_TOKEN=Indexing
The box API does not allow us to download a file based on the path that it appears in on the Box website. Instead all files are referenced with respect to an identifier. This makes it difficult to interact with as if it is a normal file system. To allow us to specify which files want using the path we run a process which walks through the whole Box repository and creates a list of paths and file identifiers. This is cached in the directory given in the configuration file. The following is an example of how to cache this.
config:
owner: "TEST"
base_input_dir: ""
base_output_dir: ""
box_integration: True
box_cache_directory: "${PROCESSING_DIR}/box_cache"
cache_box_index:
- output_folder: ""The cache_box_index program must be run first before any program which then depends on this.
Caching of files
Files downloaded from Box will also be cached so avoid downloading them again on subsequent runs. If the same output directory is used, then the Box file loader will first look for the file locally before downloading it.