ETL¶

ETL processes for data transformation and loading in your solution.

Overview¶

ETL (Extract, Transform, Load) run templates handle data processing workflows. They differ from simulation templates by focusing on data ingestion, transformation, and output to various storage systems rather than running simulations.

What are ETL Templates?¶

ETL templates are run templates specifically designed for: - Extract: Getting data from external sources - Transform: Converting data formats and structures - Load: Storing processed data in target systems

Key Differences from Simulation Templates¶

Based on the meetings, ETL templates: - Use the Runner API (newer) instead of Scenario API (legacy) - Are marked with "tags": ["data-source"] for web app integration - Focus on data processing rather than simulation execution - Often create datasets for later use in simulations

API Versions and Compatibility¶

Scenario API vs Runner API¶

Scenario API: Legacy approach (API v3.x, WebApp v6.x)
Runner API: Modern approach (API v4.x+, WebApp v7.x+)
Compatibility Code: Some solutions support both for transition

[Placeholder: Code example showing API version switching]

Environment Variables¶

Different variables are provided: - Legacy: CSM_SCENARIO_ID, scenario-based variables - Modern: CSM_RUNNER_ID, runner-based variables

ETL Workflow Pattern¶

Typical ETL workflow: 1. Download Data: Use csm-data api run-load-data or custom fetch 2. Transform: Convert Excel/CSV to standardized format 3. Validate: Check data quality and structure 4. Load: Send to Twin Data Layer, PostgreSQL, or other storage

[Placeholder: JSON showing ETL step sequence]

Data Storage Options¶

Based on the meetings, options include: - Twin Data Layer (TDL): Graph-based storage (being deprecated) - Workspace Files: Direct file upload to platform - PostgreSQL: Relational database storage - Azure Data Explorer (ADX): Analytics database - Custom Storage: S3, Azure Blob, etc.

Using CoAL for ETL¶

The CosmoTech Acceleration Library provides: - csm-data commands for common operations - Data store (SQLite) for intermediate processing - Standard connectors for various services

[Placeholder: Example using csm data commands]

Web App Integration¶

ETL templates appear in the web app: - Data Source Templates: Tagged for dataset creation - Create Dataset: Runs ETL to generate new datasets - Dataset Selection: Results can be used in simulations

Common ETL Patterns¶

Excel to CSV Transformation¶

Upload Excel file via web app
ETL reads and processes Excel data
Output standardized CSV files
Create dataset for simulation use

API Data Integration¶

Fetch data from external APIs
Transform to required format
Store in platform-compatible format
Make available for scenarios

Local Development¶

Test ETL templates locally: - Use same environment variables as production - Test with sample data files - Validate output formats

Best Practices¶

Design for data validation at each step
Handle errors gracefully
Use intermediate storage (data store) for complex transformations
Test with various data sizes
Implement proper logging for debugging

Migration Considerations¶

When moving from Scenario to Runner API: - Update environment variable references - Change API endpoints in custom code - Test compatibility with both web app versions