Skip to content

Step Data Transfer

Objective

  • Learn how to pass data between steps
  • Use the built-in data transfer system
  • Create Python scripts that output data
  • Connect steps using input/output configurations

Understanding Step Data Transfer

The orchestrator provides a built-in system for transferring data between steps. This allows one step to produce data that can be used by subsequent steps. Data transfer is achieved through a special output format and can be used in both shell commands and Python scripts.

Creating Data-Producing Scripts

Let's create two Python scripts: one that generates some data and another that processes it.

generate_data.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import random
from cosmotech.orchestrator.utils.logger import log_data

# Generate a random temperature between 0 and 30
temperature = random.uniform(0, 30)

# Output the temperature using the data logger
log_data("temperature", f"{temperature:.2f}")

# Regular logging still works
print(f"Generated temperature: {temperature:.2f}°C")
process_data.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import os

# Get the temperature from environment variable
temperature = float(os.environ["INPUT_TEMP"])

# Process the temperature
if temperature < 10:
    category = "Cold"
elif temperature < 20:
    category = "Mild"
else:
    category = "Hot"

print(f"Temperature {temperature:.2f}°C is categorized as: {category}")

Writing the Orchestration File

Now let's create an orchestration file that connects these scripts using the data transfer system:

temperature_analysis.json
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
{
  "steps": [
    {
      "id": "generate-temp",
      "command": "python generate_data.py",
      "description": "Generate a random temperature",
      "outputs": {
        "temperature": {
          "description": "Generated temperature value",
          "defaultValue": "20.0"
        }
      }
    },
    {
      "id": "analyze-temp",
      "command": "python process_data.py",
      "description": "Analyze the temperature",
      "inputs": {
        "temp": {
          "stepId": "generate-temp",
          "output": "temperature",
          "as": "INPUT_TEMP",
          "defaultValue": "15.0"
        }
      },
      "precedents": ["generate-temp"]
    }
  ]
}

Let's break down the key elements:

The generate-temp step:

  • Defines an output named "temperature"
  • Uses the log_data function to output the value
  • Provides a default value as fallback

The analyze-temp step:

  • Defines an input that connects to the previous step's output
  • Maps the input to an environment variable named "INPUT_TEMP"
  • Lists "generate-temp" as a precedent to ensure correct ordering

Running the Example

You can run this example with:

csm-orc run temperature_analysis.json

The output will show:

  • The generated temperature
  • The analysis result
  • Debug logs showing the data transfer

Alternative Output Methods

There are two ways to output data from a step:

The Python logger (recommended for Python scripts):

from cosmotech.orchestrator.utils.logger import log_data
log_data("name", "value")

The direct output format (good for shell commands):

echo "CSM-OUTPUT-DATA:name:value"

Both methods achieve the same result, but the logger provides a cleaner interface for Python code.

Advanced Features

The data transfer system supports several advanced features:

Optional outputs and inputs:

{
  "outputs": {
    "debug_data": {
      "description": "Optional debug information",
      "optional": true
    }
  }
}

Default values as fallbacks:

{
  "inputs": {
    "data": {
      "stepId": "previous-step",
      "output": "result",
      "as": "INPUT_DATA",
      "defaultValue": "fallback-value"
    }
  }
}

Debug logging of transfers:
You can enable detailed logging of data transfers by setting the LOG_LEVEL environment variable:

# Using environment variable
LOG_LEVEL=debug csm-orc run temperature_analysis.json

# Or using command line flag
csm-orc --log-level debug run temperature_analysis.json
This will show:

  • When data is captured from a step's output
  • When data is transferred between steps
  • Default value usage
  • Missing value warnings

Best Practices

  • Always provide default values for critical inputs
  • Use descriptive names for outputs and inputs
  • Document the expected format of data
  • Use the Python logger in Python scripts
  • Test with debug logging enabled during development