Skip to content

Common Functions in the Curated Layer

The common functions in the curated layer are used for shorthand querying, formatting, writing, and post-write cleanup.

biocloudcore.curated.common_dna_functions

This file contains common DNA functions used for performing recurring queries on DNA data.

add_best_blast_match_level(df)

Fill the 'blast_identification_match' column with data according to the following rules: - if there is no corresponding identification_dna record for a given material entity, we have not yet blasted this entity and will write null - if for at least one identification_dna record: identification_dna.verbatim_identification = identification_morphology.verbatim_identification for a given material entity, "species match" - if the identification_dna.taxonomically_validated = TRUE for a given material entity, "family match" - else, "no match" As we have multiple blasts per consensus_sequence_id + date_blasted and each can have different matches (e.g. one blast has family match and one has species match), we need to give each blast a blast_identification_match number. Then per unique blast we will create a window that will keep the highest match, with species being the highest and no match the lowest. The number is then converted to the string as described above: e.g. 3 being a species match. So if a unique blast has multiple family matches and one species match, we will save that it had a species match, as this is the highest and best match.

The returned dataframe selects the columns used throughout the curated layer. We sometimes reuse this query in the curated layer, so this common function may be used to shorten such logic.

Parameters:

Name Type Description Default
df DataFrame

Dataframe with blast DNA identification data joined with the taxon and sequence

required

Returns:

Name Type Description
DataFrame DataFrame

Dataframe with blast DNA identification data abd the best match level (no match, family match or species match)

add_blast_identifications(spark, df)

Selects blast DNA identifications and joins them with their taxon and sequence metadata.

The returned dataframe selects the columns used throughout the curated layer. We sometimes reuse this query in the curated layer, so this common function may be used to shorten such logic.

Parameters:

Name Type Description Default
spark SparkSession

Spark session

required
df DataFrame

Dataframe with taxon and sequence metadata

required

Returns:

Name Type Description
DataFrame DataFrame

Dataframe with DNA identification data joined with the taxon and sequence

get_taxon_with_sequence(spark)

Selects material entity and their associated DNA data by joining them.

The returned dataframe selects the columns used throughout the curated layer. We sometimes reuse this query in the curated layer, so this common function may be used to shorten such logic.

Parameters:

Name Type Description Default
spark SparkSession

Spark session

required

Returns:

Name Type Description
DataFrame DataFrame

Dataframe with all columns from material entity and its associated DNA data

biocloudcore.curated.dsi.common_dsi_functions

This file contains common DSI functions used for interacting with DSI data.

cleanup_curated_folders(dsi_curated_root_path, data_lake, run_date, amount_of_data_days)

Cleans up folders in a specified S3 path by deleting those older than a given number of days.

The function checks folders based on their naming convention, assuming that the folder names are in a date format (e.g., 'YYYY-MM-DD').

Folders that are older than the 'run_date' minus 'amount_of_data_days' are deleted along with their contents.

Parameters:

Name Type Description Default
dsi_curated_root_path str

The root S3 path where the folders are located.

required
data_lake DataLake

The DataLake object that provides access to the S3 client.

required
run_date str

The date from which to calculate the cutoff date for folder deletion. Folders older than (run_date - amount_of_data_days) will be deleted.

required
amount_of_data_days int

The number of days of data to keep. Folders older than this threshold will be deleted.

required

write_manifest(relative_path, data_lake, run_date)

Creates a manifest JSON file for S3 directory contents.

Creates a manifest JSON file that contains information about the files in a specified S3 directory and uploads it to the same directory with the name 'manifest.json'.

The manifest includes a 'write_date' key that specifies the date the manifest was created and a 'files' key that lists the files in the specified S3 path. This we do so that the DSI team knows we are not writing at the same time we are reading. By checking on the write_date of this manifest file, DSI can ingest our data with confidence.

Parameters:

Name Type Description Default
relative_path str

The relative S3 path where the files are located and where the manifest will be uploaded.

required
data_lake DataLake

The DataLake object that provides access to the S3 client.

required
run_date str

The date the manifest is being written. This date will be included in the manifest.

required

biocloudcore.curated.crs_harvest_monitor.common_harvest_monitor_functions

This file contains common functions used for interacting with harvest monitor data.

format_harvest_monitor_data(df)

This function formats the raw harvest monitor data according to the wishes mapped by the collection managers. In essence, we apply the defined mapping and format to fit the requirements of the dashboard.

TODO: should this be done in an enriched layer?

Parameters:

Name Type Description Default
df DataFrame

sparks dataframe to be transformed/Formatted

required

Returns:

Type Description
DataFrame

DataFrame