Common Functions in the Enriched Layer¶
The common functions in the enriched layer are used for data model manipulation and column generation.
biocloudcore.enriched.common_datamodel_functions
¶
This file contains common datamodel functions, mostly used for generating new columns.
generate_enriched_dataframes_with_golden_ids(spark, df, logical_keys, table_name, existing_id_table)
¶
Generates a dict with 2 dataframes.
- one with the enriched data which includes the correct golden_ids, that can be tracked back in the ID table
- and one with the ID table that contains the golden_ids and the logical keys to track back the golden_ids to the source dataset.
A Golden ID is a unique system generated identifier assigned to each record to ensure that each record is uniquely identifiable within our Data Lake.
For more info on Golden IDs, see the wiki page: https://gitlab.com/groups/arise-biodiversity/biocloud/-/wikis/4.-Data-model/4.2-Golden-records-&-ids
Steps: 1. Join the existing ID table with the filtered new data based on the logical keys and source columns. 2. Fill in the Golden IDs for rows with the same logical keys by checking if the previous row has the same logical keys. 3. Assign new Golden IDs to rows with NULL Golden IDs by adding the maximum existing Golden ID to a row number generated for each row. 4. Select the data in the correct schema to return as the ID table and the enriched table.
Note: - If the existing_id_table is empty, the Golden IDs will start from 1.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
spark
|
SparkSession
|
spark session |
required |
df
|
DataFrame
|
the input DataFrame to which Golden IDs will be added |
required |
logical_keys
|
list
|
A list of column names that constitute the logical keys for deduplication |
required |
table_name
|
str
|
The name of the table for which Golden IDs are being generated |
required |
existing_id_table
|
DataFrame
|
DataFrame with existing ID table data (can be empty for initial load) |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict[str, DataFrame]
|
A dict containing the enriched DataFrame and the id DataFrame as {table_name} and {table_name}_id |
generate_primary_keys(df, logical_keys, primary_key_column, existing_data)
¶
Generates primary keys for a given DataFrame based on logical keys.
This function is an alternative for the generate_enriched_dataframes_with_golden_ids function, when the primary keys are needed instead of the golden ids. This is for example the case relationship tables, transactional data tables or other tables that do not have a golden id because they need to use the original primary keys from the source data to have unique rows.
This function ensures that each record in the DataFrame has a unique primary key. It first fills in the primary keys for rows with the same logical keys by checking if the previous row has the same logical keys. If not, it assigns new primary key to rows with NULL primary keys by adding the maximum existing primary key to a row number generated for each row.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame to which Golden IDs will be added. |
required |
logical_keys
|
list
|
A list of column names that constitute the logical keys for deduplication. This would typically be the source_id and the name of the source. |
required |
primary_key_column
|
str
|
The name of the column where the primary keys will be stored. |
required |
existing_data
|
DataFrame
|
A dataframe (usually a delta table) that contains the existing data, which we will upsert new or updated primary keys to. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame |
DataFrame
|
A DataFrame with the primary keys added to it. |
generate_unique_ids(df, logical_keys, id_column)
¶
Generates unique IDS for a given DataFrame based on logical keys.
This function ensures that each record in the DataFrame has a unique Golden ID. It first fills in the Golden IDs for rows with the same logical keys by checking if the previous row has the same logical keys. If not, it assigns new Golden IDs to rows with NULL Golden IDs by adding the maximum existing Golden ID to a row number generated for each row.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame to which Golden IDs will be added. |
required |
logical_keys
|
list
|
A list of column names that constitute the logical keys for deduplication. |
required |
id_column
|
str
|
The name of the column where the Golden IDs will be stored. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame |
DataFrame
|
A DataFrame with the Golden IDs filled in. |