When using large datasets, it might not be possible to store all of the necessary data at a single time when running the computations on it. This leads to the need of processing smaller sections of the data when possible. Without automation, this would require manually transferring data and cleaning out unneeded data. The use of the Globus CLI tool allows us to automate the transfer and cleanup of data in a job script.

Data Transfer Job Script

This job script will load the Globus CLI tool and transfer an entire folder from Blackmore to Tempest, and then wait for the transfer to completely finish before finishing the job. This makes it so that the processing job script doesn't run while data is still being transferred. The first section of code below is used to log in to Globus with the CLI. This is a one time requirement and should be run outside of the job and will give access to both Blackmore and Tempest collections.

module load Globus-CLI
globus loginglobus session consent 'urn:globus:auth:scope:transfer.api.globus.org:all[*https:/auth.globus.org/scopes/5485832e-723e-4b52-8472-0410e90902ad/data_access *https:/auth.globus.org/scopes/0dc1297f-9868-4c68-8637-c9b6bd65d3aa/data_access]'

Once you are logged in, you can begin using the tool by loading the module and running the built in commands. The full documentation for the Globus CLI contains all possible ways of using the tool. The given code below will transfer an entire folder using the 'recursive' option for transfers. This code can be put in a job with minimal resources compared to a job script that executes processing code.

module load Globus-CLI

globus login

globus whoami

export
blackmore=5485832e-723e-4b52-8472-0410e90902ad

export
tempest=0dc1297f-9868-4c68-8637-c9b6bd65d3aa

globus transfer $blackmore:$~/path/to/source/ $tempest:~/path/to/dst/ --recursive

Processing Job Script

This portion of the job operates like any regular job script with the necessary resources depending on what the job entails. You can use a previous job script that ran code on specific data here.

Cleanup Job Script

Once we have our results and no longer need the data, we can simply remove the files from Tempest. This can be run at the end of the job script, or put in its own job script that also contains minimal resources similarly to the data transfer script.

rm -rf folder/that/contains/transfer/data

Dependency Script

Since we need to follow a specific order of operations for the job scripts, we can use dependencies to launch jobs once the previous 'dependent' job is completed. This allows us to launch all of the jobs with a single command.

#!/bin/bash
  
# Submit data transfer job and give ID to transferJob variable.
# The cut command is for trimming down stdout to just give the job ID.
transferJob=$(sbatch run_transferData.sbatch | cut -d ' ' -f 4)

# Run the data processing script after the data transfer job is completed.
processJob=$(sbatch  --dependency=afterok:"$transferJob" run_processData.sbatch | cut -d ' ' -f 4)

# Run the cleanup job after the data processing job is complete.
sbatch --dependency=afterok:"$processJob" run_cleanupData.sbatch