Skip to content

Raw test framework

The unit tests for the raw layer verify validation and quarantine behavior. For this, the enriched framework has been adapted to focus on inserts and validation. As with the changes made to the enriched layer, we make sure all IO is done in the run_table.py file. This simplifies the arguments we pass to the run files, and makes it more in line with the enriched layer.

Raw unit test framework

The steps are straightforward if you are familiar with the enriched framework. The main change to the framework is that instead of testing business logic for transformations, we test validation logic. Specifically, we test whether our validator functions work on our data as intended. Valid rows pass and are stored to the table, while invalid rows get rejected and are stored to the quarantine table. To facilitate this, a change was made to the mock data lake. In the enriched framework we capture the DataFrame that gets upserted, but since the raw layer often produces two distinct tables--one for valid results and one for quarantined results--we capture both these DataFrames.

Step 1: Create Test Data

# biocloudcore/raw/nanopore/tests/data/test_amplicons_data.py
import datetime

FIXED_TEST_TIMESTAMP = datetime.datetime(1989, 11, 9)

AMPLICONS_LANDING_ZONE = [
    {
        "id": "A-01d2fa566f49ee0b",
        "marker": "COI-5P",
        "name": "RMNH.5238070",
        # ... all source fields
    },
]

EXPECTED_OUTPUT = [
    {
        "id": "A-01d2fa566f49ee0b",
        "marker": "COI-5P",
        "name": "RMNH.5238070",
        # Source fields + timestamps
        "inserted_ts_utc": FIXED_TEST_TIMESTAMP,
        "updated_ts_utc": FIXED_TEST_TIMESTAMP,
    },
]

Step 2: Write Tests

# biocloudcore/raw/nanopore/tests/test_amplicons.py
@pytest.mark.unit
@pytest.mark.patch_module("biocloudcore.raw.nanopore.amplicons")
@pytest.mark.fixed_timestamp(data.FIXED_TEST_TIMESTAMP)
def test_amplicons_happy_path(raw_test_mocks, spark, mock_data_lake, test_data):
    """Test normal amplicons processing with valid input data."""
    Amplicons(...).run(landing_zone_amplicons=test_data.landing_zone_amplicons.df)

    assert_test_output(
        spark=spark,
        actual=raw_test_mocks.raw_df,  # Valid records
        expected=data.EXPECTED_OUTPUT,
        schema=test_data.amplicons.schema,
    )


@pytest.mark.unit
@pytest.mark.patch_module("biocloudcore.raw.nanopore.amplicons")
@pytest.mark.fixed_timestamp(data.FIXED_TEST_TIMESTAMP)
def test_amplicons_with_invalid_registration_number(raw_test_mocks, spark, mock_data_lake, test_data):
    """Test amplicons quarantine processing with invalid input data."""
    invalid_record = override(
        data.AMPLICONS_LANDING_ZONE[0],
        marker="INVALID_REGISTRATION_NUMBER",  # Not in the allowed list
    )
    invalid_df = spark.createDataFrame([invalid_record], test_data.landing_zone_amplicons.schema)

    Amplicons(...).run(landing_zone_amplicons=invalid_df)

    # Verify quarantine behavior
    assert raw_test_mocks.raw_df.count() == 0, "Expected 0 valid records in raw"
    assert raw_test_mocks.quarantine_df.count() == 1, "Expected 1 invalid record in quarantine"

    quarantined = raw_test_mocks.quarantine_df.collect()[0]
    assert quarantined.rejection_reason == "Invalid registration number", "Should have a rejection reason"
    assert quarantined.is_resolved == False, "Should be unresolved"


@pytest.mark.unit
@pytest.mark.patch_module("biocloudcore.raw.nanopore.amplicons")
@pytest.mark.fixed_timestamp(data.FIXED_TEST_TIMESTAMP)
def test_amplicons_with_empty_input(raw_test_mocks, spark, mock_data_lake, test_data):
    """Test handling of completely empty raw input.

    This verifies the pipeline doesn't process empty input, as the validator rejects it.
    """
    empty_landing_zone = spark.createDataFrame([], test_data.landing_zone_amplicons.schema)

    Amplicons(
        spark=spark,
        data_lake=mock_data_lake,
        data_contract=test_data.amplicons.contract,
        source="nanopore",
        table_name="amplicons",
        layer="raw",
    ).run(landing_zone_amplicons=empty_landing_zone)

    assert raw_test_mocks.mock_upsert.call_count == 0, "No upsert should occur for empty input"

What raw tests verify:

  • ✅ Valid data passes through to raw table
  • ✅ Invalid data goes to quarantine table
  • ✅ Rejection reasons are set correctly
  • ✅ Insert happens as expected

Key differences from enriched:

  • Uses raw_test_mocks instead of enriched_test_mocks
  • Can also access raw_df and quarantine_df properties
  • Focuses on insert and validation rules, not transformations or lookups
  • No primary key generation (uses source IDs)