Contributing to CoAL¶
Objective
- Set up your development environment with Black and pre-commit hooks
- Understand the CoAL architecture and contribution workflow
- Learn how to implement a new feature with a practical example
- Master the process of writing unit tests and documentation
- Successfully submit a pull request
Introduction¶
Contributing to the CosmoTech Acceleration Library (CoAL) is a great way to enhance the platform's capabilities and share your expertise with the community. This tutorial will guide you through the entire process of contributing a new feature to CoAL, from setting up your development environment to submitting a pull request.
We'll use a practical example throughout this tutorial: implementing a new store write functionality for MongoDB and creating a corresponding csm-data command. This example will demonstrate all the key aspects of the contribution process, including:
- Setting up your development environment
- Understanding the CoAL architecture
- Implementing new functionality
- Creating CLI commands
- Writing unit tests
- Documenting your work
- Submitting a pull request
By the end of this tutorial, you'll have a solid understanding of how to contribute to CoAL and be ready to implement your own features.
Setting Up Your Development Environment¶
Before you start contributing, you need to set up your development environment. This includes forking and cloning the repository, installing dependencies, and configuring code formatting tools.
Forking and Cloning the Repository¶
- Fork the CosmoTech-Acceleration-Library repository on GitHub
-
Clone your fork locally:
git clone https://github.com/your-username/CosmoTech-Acceleration-Library.git cd CosmoTech-Acceleration-Library
-
Add the upstream repository as a remote:
git remote add upstream https://github.com/Cosmo-Tech/CosmoTech-Acceleration-Library.git
Installing Dependencies¶
Install the package in development mode along with all development dependencies:
pip install -e ".[dev]"
This will install the package in editable mode, allowing you to make changes to the code without reinstalling it. It will also install all the development dependencies specified in the pyproject.toml
file.
Setting Up Black for Code Formatting¶
CoAL uses Black for code formatting to ensure consistent code style across the codebase. Black is configured in the pyproject.toml
file with specific settings for line length, target Python version, and file exclusions.
To manually run Black on your codebase:
# Format all Python files in the project
python -m black .
# Format a specific directory
python -m black cosmotech/coal/
# Check if files would be reformatted without actually changing them
python -m black --check .
# Show diff of changes without writing files
python -m black --diff .
Configuring Pre-commit Hooks¶
CoAL uses pre-commit hooks to automatically run checks before each commit, including Black formatting, trailing whitespace removal, and test coverage verification.
To install pre-commit:
pip install pre-commit
pre-commit install
Now, when you commit changes, the pre-commit hooks will automatically run and check your code. If any issues are found, the commit will be aborted, and you'll need to fix the issues before committing again.
The pre-commit configuration includes:
- Trailing whitespace removal
- End-of-file fixer
- YAML syntax checking
- Black code formatting
- Pytest checks with coverage requirements
- Verification that all functions have tests
Understanding the CoAL Architecture¶
Before implementing a new feature, it's important to understand the architecture of CoAL and how its components interact.
Core Modules¶
CoAL is organized into several key modules:
- coal: The core library functionality
- store: Data storage and retrieval
- cosmotech_api: Interaction with the CosmoTech API
- aws: AWS integration
- azure: Azure integration
- postgresql: PostgreSQL integration
- utils: Utility functions
- csm_data: CLI commands for data operations
- orchestrator_plugins: Plugins for csm-orc
- translation: Translation resources
Store Module Architecture¶
The store module provides a unified interface for data storage and retrieval. It's built around the Store
class, which provides methods for:
- Adding and retrieving tables
- Executing SQL queries
- Listing available tables
- Resetting the store
The store module also includes adapters for different data formats:
- native_python: Python dictionaries and lists
- csv: CSV files
- pandas: Pandas DataFrames
- pyarrow: PyArrow Tables
External storage systems are implemented as separate modules that interact with the core Store
class:
- postgresql: PostgreSQL integration
- singlestore: SingleStore integration
CLI Command Structure¶
The csm_data
CLI is organized into command groups, each focused on specific types of operations:
- api: Commands for interacting with the CosmoTech API
- store: Commands for working with the CoAL datastore
- s3-bucket-*: Commands for S3 bucket operations
- adx-send-scenariodata: Command for sending scenario data to Azure Data Explorer
- az-storage-upload: Command for uploading to Azure Storage
Each command is implemented as a separate Python file in the appropriate directory, using the Click library for command-line interface creation.
Implementing a New Store Feature¶
Now that we understand the architecture, let's implement a new store feature: MongoDB integration. This will allow users to write data from the CoAL datastore to MongoDB.
Creating the Module Structure¶
First, we'll create a new module for MongoDB integration:
mkdir -p cosmotech/coal/mongodb
touch cosmotech/coal/mongodb/__init__.py
touch cosmotech/coal/mongodb/store.py
Implementing the Core Functionality¶
Now, let's implement the core functionality in cosmotech/coal/mongodb/store.py
:
# Copyright (C) - 2023 - 2025 - Cosmo Tech
# This document and all information contained herein is the exclusive property -
# including all intellectual property rights pertaining thereto - of Cosmo Tech.
# Any use, reproduction, translation, broadcasting, transmission, distribution,
# etc., to any person is prohibited unless it has been previously and
# specifically authorized by written means by Cosmo Tech.
"""
MongoDB store operations module.
This module provides functions for interacting with MongoDB databases
for store operations.
"""
from time import perf_counter
import pyarrow
import pymongo
from cosmotech.coal.store.store import Store
from cosmotech.coal.utils.logger import LOGGER
from cosmotech.orchestrator.utils.translate import T
def send_pyarrow_table_to_mongodb(
data: pyarrow.Table,
collection_name: str,
mongodb_uri: str,
mongodb_db: str,
replace: bool = True,
) -> int:
"""
Send a PyArrow table to MongoDB.
Args:
data: PyArrow table to send
collection_name: MongoDB collection name
mongodb_uri: MongoDB connection URI
mongodb_db: MongoDB database name
replace: Whether to replace existing collection
Returns:
Number of documents inserted
"""
# Convert PyArrow table to list of dictionaries
records = data.to_pylist()
# Connect to MongoDB
client = pymongo.MongoClient(mongodb_uri)
db = client[mongodb_db]
# Drop collection if replace is True and collection exists
if replace and collection_name in db.list_collection_names():
db[collection_name].drop()
# Insert records
if records:
result = db[collection_name].insert_many(records)
return len(result.inserted_ids)
return 0
def dump_store_to_mongodb(
store_folder: str,
mongodb_uri: str,
mongodb_db: str,
collection_prefix: str = "Cosmotech_",
replace: bool = True,
) -> None:
"""
Dump Store data to a MongoDB database.
Args:
store_folder: Folder containing the Store
mongodb_uri: MongoDB connection URI
mongodb_db: MongoDB database name
collection_prefix: Collection prefix
replace: Whether to replace existing collections
"""
_s = Store(store_location=store_folder)
tables = list(_s.list_tables())
if len(tables):
LOGGER.info(T("coal.logs.database.sending_data").format(table=mongodb_db))
total_rows = 0
_process_start = perf_counter()
for table_name in tables:
_s_time = perf_counter()
target_collection_name = f"{collection_prefix}{table_name}"
LOGGER.info(T("coal.logs.database.table_entry").format(table=target_collection_name))
data = _s.get_table(table_name)
if not len(data):
LOGGER.info(T("coal.logs.database.no_rows"))
continue
_dl_time = perf_counter()
rows = send_pyarrow_table_to_mongodb(
data,
target_collection_name,
mongodb_uri,
mongodb_db,
replace,
)
total_rows += rows
_up_time = perf_counter()
LOGGER.info(T("coal.logs.database.row_count").format(count=rows))
LOGGER.debug(
T("coal.logs.progress.operation_timing").format(
operation="Load from datastore", time=f"{_dl_time - _s_time:0.3}"
)
)
LOGGER.debug(
T("coal.logs.progress.operation_timing").format(
operation="Send to MongoDB", time=f"{_up_time - _dl_time:0.3}"
)
)
_process_end = perf_counter()
LOGGER.info(
T("coal.logs.database.rows_fetched").format(
table="all tables",
count=total_rows,
time=f"{_process_end - _process_start:0.3}",
)
)
else:
LOGGER.info(T("coal.logs.database.store_empty"))
Updating the Package Initialization¶
Next, we need to update the __init__.py
file to expose our new function:
# Copyright (C) - 2023 - 2025 - Cosmo Tech
# This document and all information contained herein is the exclusive property -
# including all intellectual property rights pertaining thereto - of Cosmo Tech.
# Any use, reproduction, translation, broadcasting, transmission, distribution,
# etc., to any person is prohibited unless it has been previously and
# specifically authorized by written means by Cosmo Tech.
from cosmotech.coal.mongodb.store import dump_store_to_mongodb
__all__ = ["dump_store_to_mongodb"]
Adding Dependencies¶
We need to add pymongo as a dependency. Update the pyproject.toml
file to include pymongo in the optional dependencies:
[project.optional-dependencies]
mongodb = ["pymongo>=4.3.3"]
Creating a new CSM-DATA Command¶
Now that we have implemented the core functionality, let's create a new csm-data command to expose this functionality to users.
Creating the Command File¶
Create a new file for the command:
touch cosmotech/csm_data/commands/store/dump_to_mongodb.py
Implementing the Command¶
Now, let's implement the command:
# Copyright (C) - 2023 - 2025 - Cosmo Tech
# This document and all information contained herein is the exclusive property -
# including all intellectual property rights pertaining thereto - of Cosmo Tech.
# Any use, reproduction, translation, broadcasting, transmission, distribution,
# etc., to any person is prohibited unless it has been previously and
# specifically authorized by written means by Cosmo Tech.
from cosmotech.csm_data.utils.click import click
from cosmotech.csm_data.utils.decorators import web_help, translate_help
from cosmotech.orchestrator.utils.translate import T
@click.command()
@web_help("csm-data/store/dump-to-mongodb")
@translate_help("csm-data.commands.store.dump_to_mongodb.description")
@click.option(
"--store-folder",
envvar="CSM_PARAMETERS_ABSOLUTE_PATH",
help=T("csm-data.commands.store.dump_to_mongodb.parameters.store_folder"),
metavar="PATH",
type=str,
show_envvar=True,
required=True,
)
@click.option(
"--collection-prefix",
help=T("csm-data.commands.store.dump_to_mongodb.parameters.collection_prefix"),
metavar="PREFIX",
type=str,
default="Cosmotech_",
)
@click.option(
"--mongodb-uri",
help=T("csm-data.commands.store.dump_to_mongodb.parameters.mongodb_uri"),
envvar="MONGODB_URI",
show_envvar=True,
required=True,
)
@click.option(
"--mongodb-db",
help=T("csm-data.commands.store.dump_to_mongodb.parameters.mongodb_db"),
envvar="MONGODB_DB_NAME",
show_envvar=True,
required=True,
)
@click.option(
"--replace/--append",
"replace",
help=T("csm-data.commands.store.dump_to_mongodb.parameters.replace"),
default=True,
is_flag=True,
show_default=True,
)
def dump_to_mongodb(
store_folder,
collection_prefix: str,
mongodb_uri,
mongodb_db,
replace: bool,
):
# Import the function at the start of the command
from cosmotech.coal.mongodb import dump_store_to_mongodb
dump_store_to_mongodb(
store_folder=store_folder,
collection_prefix=collection_prefix,
mongodb_uri=mongodb_uri,
mongodb_db=mongodb_db,
replace=replace,
)
Registering the Command¶
Update the cosmotech/csm_data/commands/store/__init__.py
file to register the new command:
# Copyright (C) - 2023 - 2025 - Cosmo Tech
# This document and all information contained herein is the exclusive property -
# including all intellectual property rights pertaining thereto - of Cosmo Tech.
# Any use, reproduction, translation, broadcasting, transmission, distribution,
# etc., to any person is prohibited unless it has been previously and
# specifically authorized by written means by Cosmo Tech.
from cosmotech.csm_data.commands.store.dump_to_azure import dump_to_azure
from cosmotech.csm_data.commands.store.dump_to_postgresql import dump_to_postgresql
from cosmotech.csm_data.commands.store.dump_to_s3 import dump_to_s3
from cosmotech.csm_data.commands.store.dump_to_mongodb import dump_to_mongodb # Add this line
from cosmotech.csm_data.commands.store.list_tables import list_tables
from cosmotech.csm_data.commands.store.load_csv_folder import load_csv_folder
from cosmotech.csm_data.commands.store.load_from_singlestore import load_from_singlestore
from cosmotech.csm_data.commands.store.reset import reset
from cosmotech.csm_data.commands.store.store import store
__all__ = [
"dump_to_azure",
"dump_to_postgresql",
"dump_to_s3",
"dump_to_mongodb", # Add this line
"list_tables",
"load_csv_folder",
"load_from_singlestore",
"reset",
"store",
]
Adding Translation Strings¶
Create translation files for the new command:
- For English (en-US):
touch cosmotech/translation/csm_data/en-US/commands/store/dump_to_mongodb.yml
commands:
store:
dump_to_mongodb:
description: |
Dump store data to MongoDB.
parameters:
store_folder: Folder containing the store
collection_prefix: Prefix for MongoDB collections
mongodb_uri: MongoDB connection URI
mongodb_db: MongoDB database name
replace: Replace existing collections
- For French (fr-FR):
touch cosmotech/translation/csm_data/fr-FR/commands/store/dump_to_mongodb.yml
commands:
store:
dump_to_mongodb:
description: |
Exporter les données du store vers MongoDB.
parameters:
store_folder: Dossier contenant le store
collection_prefix: Préfixe pour les collections MongoDB
mongodb_uri: URI de connexion MongoDB
mongodb_db: Nom de la base de données MongoDB
replace: Remplacer les collections existantes
Writing Unit Tests¶
Testing is a critical part of the contribution process. All new functionality must be thoroughly tested to ensure it works as expected and to prevent regressions.
Creating Test Files¶
Create test files for the new functionality:
mkdir -p tests/unit/coal/mongodb
touch tests/unit/coal/mongodb/__init__.py
touch tests/unit/coal/mongodb/test_store.py
Implementing Unit Tests¶
Now, let's implement the unit tests for the MongoDB store functionality:
# Copyright (C) - 2023 - 2025 - Cosmo Tech
# This document and all information contained herein is the exclusive property -
# including all intellectual property rights pertaining thereto - of Cosmo Tech.
# Any use, reproduction, translation, broadcasting, transmission, distribution,
# etc., to any person is prohibited unless it has been previously and
# specifically authorized by written means by Cosmo Tech.
import os
import tempfile
from unittest.mock import patch, MagicMock
import pyarrow
import pytest
from cosmotech.coal.mongodb.store import send_pyarrow_table_to_mongodb, dump_store_to_mongodb
from cosmotech.coal.store.store import Store
@pytest.fixture
def sample_table():
"""Create a sample PyArrow table for testing."""
data = {
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"age": [30, 25, 35],
}
return pyarrow.Table.from_pydict(data)
@pytest.fixture
def temp_store():
"""Create a temporary store for testing."""
with tempfile.TemporaryDirectory() as temp_dir:
store = Store(store_location=temp_dir)
yield store, temp_dir
class TestSendPyarrowTableToMongoDB:
@patch("pymongo.MongoClient")
def test_send_pyarrow_table_to_mongodb(self, mock_client, sample_table):
# Set up mocks
mock_db = MagicMock()
mock_collection = MagicMock()
mock_client.return_value.__getitem__.return_value = mock_db
mock_db.__getitem__.return_value = mock_collection
mock_db.list_collection_names.return_value = []
mock_collection.insert_many.return_value.inserted_ids = ["id1", "id2", "id3"]
# Call the function
result = send_pyarrow_table_to_mongodb(
sample_table,
"test_collection",
"mongodb://localhost:27017",
"test_db",
True,
)
# Verify the result
assert result == 3
mock_client.assert_called_once_with("mongodb://localhost:27017")
mock_client.return_value.__getitem__.assert_called_once_with("test_db")
mock_db.list_collection_names.assert_called_once()
mock_collection.insert_many.assert_called_once()
@patch("pymongo.MongoClient")
def test_send_pyarrow_table_to_mongodb_replace(self, mock_client, sample_table):
# Set up mocks
mock_db = MagicMock()
mock_collection = MagicMock()
mock_client.return_value.__getitem__.return_value = mock_db
mock_db.__getitem__.return_value = mock_collection
mock_db.list_collection_names.return_value = ["test_collection"]
mock_collection.insert_many.return_value.inserted_ids = ["id1", "id2", "id3"]
# Call the function
result = send_pyarrow_table_to_mongodb(
sample_table,
"test_collection",
"mongodb://localhost:27017",
"test_db",
True,
)
# Verify the result
assert result == 3
mock_client.assert_called_once_with("mongodb://localhost:27017")
mock_client.return_value.__getitem__.assert_called_once_with("test_db")
mock_db.list_collection_names.assert_called_once()
mock_collection.drop.assert_called_once()
mock_collection.insert_many.assert_called_once()
@patch("pymongo.MongoClient")
def test_send_pyarrow_table_to_mongodb_append(self, mock_client, sample_table):
# Set up mocks
mock_db = MagicMock()
mock_collection = MagicMock()
mock_client.return_value.__getitem__.return_value = mock_db
mock_db.__getitem__.return_value = mock_collection
mock_db.list_collection_names.return_value = ["test_collection"]
mock_collection.insert_many.return_value.inserted_ids = ["id1", "id2", "id3"]
# Call the function
result = send_pyarrow_table_to_mongodb(
sample_table,
"test_collection",
"mongodb://localhost:27017",
"test_db",
False,
)
# Verify the result
assert result == 3
mock_client.assert_called_once_with("mongodb://localhost:27017")
mock_client.return_value.__getitem__.assert_called_once_with("test_db")
mock_db.list_collection_names.assert_called_once()
mock_collection.drop.assert_not_called()
mock_collection.insert_many.assert_called_once()
@patch("pymongo.MongoClient")
def test_send_pyarrow_table_to_mongodb_empty(self, mock_client):
# Set up mocks
mock_db = MagicMock()
mock_collection = MagicMock()
mock_client.return_value.__getitem__.return_value = mock_db
mock_db.__getitem__.return_value = mock_collection
mock_db.list_collection_names.return_value = []
# Create an empty table
empty_table = pyarrow.Table.from_pydict({})
# Call the function
result = send_pyarrow_table_to_mongodb(
empty_table,
"test_collection",
"mongodb://localhost:27017",
"test_db",
True,
)
# Verify the result
assert result == 0
mock_client.assert_called_once_with("mongodb://localhost:27017")
mock_client.return_value.__getitem__.assert_called_once_with("test_db")
mock_db.list_collection_names.assert_called_once()
mock_collection.insert_many.assert_not_called()
class TestDumpStoreToMongoDB:
@patch("cosmotech.coal.mongodb.store.send_pyarrow_table_to_mongodb")
def test_dump_store_to_mongodb(self, mock_send, temp_store):
store, temp_dir = temp_store
# Add a table to the store
sample_data = pyarrow.Table.from_pydict(
{
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"age": [30, 25, 35],
}
)
store.add_table("test_table", sample_data)
# Set up mock
mock_send.return_value = 3
# Call the function
dump_store_to_mongodb(
temp_dir,
"mongodb://localhost:27017",
"test_db",
"Cosmotech_",
True,
)
# Verify the mock was called correctly
mock_send.assert_called_once()
args, kwargs = mock_send.call_args
assert kwargs["collection_name"] == "Cosmotech_test_table"
assert kwargs["mongodb_uri"] == "mongodb://localhost:27017"
assert kwargs["mongodb_db"] == "test_db"
assert kwargs["replace"] is True
@patch("cosmotech.coal.mongodb.store.send_pyarrow_table_to_mongodb")
def test_dump_store_to_mongodb_empty(self, mock_send, temp_store):
_, temp_dir = temp_store
# Call the function with an empty store
dump_store_to_mongodb(
temp_dir,
"mongodb://localhost:27017",
"test_db",
"Cosmotech_",
True,
)
# Verify the mock was not called
mock_send.assert_not_called()
@patch("cosmotech.coal.mongodb.store.send_pyarrow_table_to_mongodb")
def test_dump_store_to_mongodb_multiple_tables(self, mock_send, temp_store):
store, temp_dir = temp_store
# Add multiple tables to the store
table1 = pyarrow.Table.from_pydict(
{
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
}
)
table2 = pyarrow.Table.from_pydict(
{
"id": [4, 5],
"name": ["Dave", "Eve"],
}
)
store.add_table("table1", table1)
store.add_table("table2", table2)
# Set up mock
mock_send.side_effect = [3, 2]
# Call the function
dump_store_to_mongodb(
temp_dir,
"mongodb://localhost:27017",
"test_db",
"Cosmotech_",
True,
)
# Verify the mock was called correctly for each table
assert mock_send.call_count == 2
call_args_list = mock_send.call_args_list
# Check first call
args, kwargs = call_args_list[0]
assert kwargs["collection_name"] in ["Cosmotech_table1", "Cosmotech_table2"]
# Check second call
args, kwargs = call_args_list[1]
assert kwargs["collection_name"] in ["Cosmotech_table1", "Cosmotech_table2"]
# Ensure both tables were processed
collection_names = [
call_args_list[0][1]["collection_name"],
call_args_list[1][1]["collection_name"],
]
assert "Cosmotech_table1" in collection_names
assert "Cosmotech_table2" in collection_names
Running the Tests¶
To run the tests, use pytest:
# Run all tests
pytest tests/unit/coal/mongodb/
# Run with coverage
pytest tests/unit/coal/mongodb/ --cov=cosmotech.coal.mongodb --cov-report=term-missing
Make sure all tests pass and that you have adequate code coverage (at least 80%).
Documentation¶
Documentation is a critical part of the contribution process. All new features must be documented to ensure users can understand and use them effectively.
Updating CLI Documentation¶
Let's add csm-data documentation for our new functionality. Create a new file:
touch docs/csm-data/store/dump-to-mongodb.md
Add the following content:
---
hide:
- toc
description: "Command help: `csm-data store dump-to-mongodb`"
---
# dump-to-mongodb
!!! info "Help command"
```text
```
The documentation build system will generate the content that will be inserted in the file to add a minimal documentation. You can then add more elements as necessary.
Pull Request Checklist¶
Before submitting your pull request, make sure you've completed all the necessary steps:
- Code Quality
- Code follows the project's style guidelines (Black formatting)
- All linting checks pass
- Code is well-documented with docstrings
- Code is efficient and follows best practices
- No unnecessary dependencies are added
- Testing
- All unit tests pass
- Test coverage meets or exceeds 80%
- All functions have at least one test
- Edge cases and error conditions are tested
- Mocks are used for external services
- Documentation
- API documentation is updated
- Command help text is clear and comprehensive
- Translation strings are added for all user-facing text
- Usage examples are provided
- Any necessary tutorials are created or updated
- Integration
- New functionality integrates well with existing code
- No breaking changes to existing APIs
- Dependencies are properly specified in pyproject.toml
- Command is registered in the appropriate init.py file
- Pull Request Description
- Clear description of the changes
- Explanation of why the changes are needed
- Any potential issues or limitations
- References to related issues or discussions
Conclusion¶
Congratulations! You've now learned how to contribute to CoAL by implementing a new feature, creating a new csm-data command, writing unit tests, and documenting your work.
By following this tutorial, you've gained practical experience with:
- Setting up your development environment with Black and pre-commit hooks
- Understanding the CoAL architecture
- Implementing new functionality
- Creating CLI commands
- Writing unit tests
- Documenting your work
- Preparing for a pull request
You're now ready to contribute your own features to CoAL and help improve the platform for everyone.
Remember that the CoAL community is here to help. If you have any questions or need assistance, don't hesitate to reach out through GitHub issues or discussions.
Happy contributing!