Skip to content

Common Functions in the Enriched Layer

The common functions in the enriched layer are used for data model manipulation and column generation.

biocloudcore.enriched.common_datamodel_functions

This file contains common datamodel functions, mostly used for generating new columns.

generate_enriched_dataframes_with_golden_ids(spark, df, logical_keys, table_name, existing_id_table)

Generates a dict with 2 dataframes.

  • one with the enriched data which includes the correct golden_ids, that can be tracked back in the ID table
  • and one with the ID table that contains the golden_ids and the logical keys to track back the golden_ids to the source dataset.

A Golden ID is a unique system generated identifier assigned to each record to ensure that each record is uniquely identifiable within our Data Lake.

For more info on Golden IDs, see the wiki page: https://gitlab.com/groups/arise-biodiversity/biocloud/-/wikis/4.-Data-model/4.2-Golden-records-&-ids

Steps: 1. Join the existing ID table with the filtered new data based on the logical keys and source columns. 2. Fill in the Golden IDs for rows with the same logical keys by checking if the previous row has the same logical keys. 3. Assign new Golden IDs to rows with NULL Golden IDs by adding the maximum existing Golden ID to a row number generated for each row. 4. Select the data in the correct schema to return as the ID table and the enriched table.

Note: - If the existing_id_table is empty, the Golden IDs will start from 1.

Parameters:

Name Type Description Default
spark SparkSession

spark session

required
df DataFrame

the input DataFrame to which Golden IDs will be added

required
logical_keys list

A list of column names that constitute the logical keys for deduplication

required
table_name str

The name of the table for which Golden IDs are being generated

required
existing_id_table DataFrame

DataFrame with existing ID table data (can be empty for initial load)

required

Returns:

Name Type Description
dict dict[str, DataFrame]

A dict containing the enriched DataFrame and the id DataFrame as {table_name} and {table_name}_id

generate_primary_keys(df, logical_keys, primary_key_column, existing_data)

Generates primary keys for a given DataFrame based on logical keys.

This function is an alternative for the generate_enriched_dataframes_with_golden_ids function, when the primary keys are needed instead of the golden ids. This is for example the case relationship tables, transactional data tables or other tables that do not have a golden id because they need to use the original primary keys from the source data to have unique rows.

This function ensures that each record in the DataFrame has a unique primary key. It first fills in the primary keys for rows with the same logical keys by checking if the previous row has the same logical keys. If not, it assigns new primary key to rows with NULL primary keys by adding the maximum existing primary key to a row number generated for each row.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to which Golden IDs will be added.

required
logical_keys list

A list of column names that constitute the logical keys for deduplication. This would typically be the source_id and the name of the source.

required
primary_key_column str

The name of the column where the primary keys will be stored.

required
existing_data DataFrame

A dataframe (usually a delta table) that contains the existing data, which we will upsert new or updated primary keys to.

required

Returns:

Name Type Description
DataFrame DataFrame

A DataFrame with the primary keys added to it.

generate_unique_ids(df, logical_keys, id_column)

Generates unique IDS for a given DataFrame based on logical keys.

This function ensures that each record in the DataFrame has a unique Golden ID. It first fills in the Golden IDs for rows with the same logical keys by checking if the previous row has the same logical keys. If not, it assigns new Golden IDs to rows with NULL Golden IDs by adding the maximum existing Golden ID to a row number generated for each row.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to which Golden IDs will be added.

required
logical_keys list

A list of column names that constitute the logical keys for deduplication.

required
id_column str

The name of the column where the Golden IDs will be stored.

required

Returns:

Name Type Description
DataFrame DataFrame

A DataFrame with the Golden IDs filled in.