Step Data Transfer¶

Objective

Learn how to pass data between steps
Use the built-in data transfer system
Create Python scripts that output data
Connect steps using input/output configurations

Understanding Step Data Transfer¶

The orchestrator provides a built-in system for transferring data between steps. This allows one step to produce data that can be used by subsequent steps. Data transfer is achieved through a special output format and can be used in both shell commands and Python scripts.

Creating Data-Producing Scripts¶

Let's create two Python scripts: one that generates some data and another that processes it.

Generate DataProcess Data

generate_data.py
import random
from cosmotech.orchestrator.utils.logger import log_data

# Generate a random temperature between 0 and 30
temperature = random.uniform(0, 30)

# Output the temperature using the data logger
log_data("temperature", f"{temperature:.2f}")

# Regular logging still works
print(f"Generated temperature: {temperature:.2f}°C")

process_data.py
import os

# Get the temperature from environment variable
temperature = float(os.environ["INPUT_TEMP"])

# Process the temperature
if temperature < 10:
    category = "Cold"
elif temperature < 20:
    category = "Mild"
else:
    category = "Hot"

print(f"Temperature {temperature:.2f}°C is categorized as: {category}")

Writing the Orchestration File¶

Now let's create an orchestration file that connects these scripts using the data transfer system:

temperature_analysis.json
{
  "steps": [
    {
      "id": "generate-temp",
      "command": "python generate_data.py",
      "description": "Generate a random temperature",
      "outputs": {
        "temperature": {
          "description": "Generated temperature value",
          "defaultValue": "20.0"
        }
      }
    },
    {
      "id": "analyze-temp",
      "command": "python process_data.py",
      "description": "Analyze the temperature",
      "inputs": {
        "temp": {
          "stepId": "generate-temp",
          "output": "temperature",
          "as": "INPUT_TEMP",
          "defaultValue": "15.0"
        }
      },
      "precedents": ["generate-temp"]
    }
  ]
}

Let's break down the key elements:

The generate-temp step:

Defines an output named "temperature"
Uses the log_data function to output the value
Provides a default value as fallback

The analyze-temp step:

Defines an input that connects to the previous step's output
Maps the input to an environment variable named "INPUT_TEMP"
Lists "generate-temp" as a precedent to ensure correct ordering

Running the Example¶

You can run this example with:

csm-orc run temperature_analysis.json

The output will show:

The generated temperature
The analysis result
Debug logs showing the data transfer

Alternative Output Methods¶

There are two ways to output data from a step:

The Python logger (recommended for Python scripts):

from cosmotech.orchestrator.utils.logger import log_data
log_data("name", "value")

The direct output format (good for shell commands):

echo "CSM-OUTPUT-DATA:name:value"

Both methods achieve the same result, but the logger provides a cleaner interface for Python code.

Advanced Features¶

The data transfer system supports several advanced features:

Optional outputs and inputs:

{
  "outputs": {
    "debug_data": {
      "description": "Optional debug information",
      "optional": true
    }
  }
}

Default values as fallbacks:

{
  "inputs": {
    "data": {
      "stepId": "previous-step",
      "output": "result",
      "as": "INPUT_DATA",
      "defaultValue": "fallback-value"
    }
  }
}

Debug logging of transfers:
You can enable detailed logging of data transfers by setting the LOG_LEVEL environment variable:

# Using environment variable
LOG_LEVEL=debug csm-orc run temperature_analysis.json

# Or using command line flag
csm-orc --log-level debug run temperature_analysis.json

This will show:

When data is captured from a step's output
When data is transferred between steps
Default value usage
Missing value warnings

Best Practices¶

Always provide default values for critical inputs
Use descriptive names for outputs and inputs
Document the expected format of data
Use the Python logger in Python scripts
Test with debug logging enabled during development