Common Functions in the Curated Layer¶
The common functions in the curated layer are used for shorthand querying, formatting, writing, and post-write cleanup.
biocloudcore.curated.common_dna_functions
¶
This file contains common DNA functions used for performing recurring queries on DNA data.
add_best_blast_match_level(df)
¶
Fill the 'blast_identification_match' column with data according to the following rules: - if there is no corresponding identification_dna record for a given material entity, we have not yet blasted this entity and will write null - if for at least one identification_dna record: identification_dna.verbatim_identification = identification_morphology.verbatim_identification for a given material entity, "species match" - if the identification_dna.taxonomically_validated = TRUE for a given material entity, "family match" - else, "no match" As we have multiple blasts per consensus_sequence_id + date_blasted and each can have different matches (e.g. one blast has family match and one has species match), we need to give each blast a blast_identification_match number. Then per unique blast we will create a window that will keep the highest match, with species being the highest and no match the lowest. The number is then converted to the string as described above: e.g. 3 being a species match. So if a unique blast has multiple family matches and one species match, we will save that it had a species match, as this is the highest and best match.
The returned dataframe selects the columns used throughout the curated layer. We sometimes reuse this query in the curated layer, so this common function may be used to shorten such logic.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Dataframe with blast DNA identification data joined with the taxon and sequence |
required |
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame |
DataFrame
|
Dataframe with blast DNA identification data abd the best match level (no match, family match or species match) |
add_blast_identifications(spark, df)
¶
Selects blast DNA identifications and joins them with their taxon and sequence metadata.
The returned dataframe selects the columns used throughout the curated layer. We sometimes reuse this query in the curated layer, so this common function may be used to shorten such logic.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
spark
|
SparkSession
|
Spark session |
required |
df
|
DataFrame
|
Dataframe with taxon and sequence metadata |
required |
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame |
DataFrame
|
Dataframe with DNA identification data joined with the taxon and sequence |
get_taxon_with_sequence(spark)
¶
Selects material entity and their associated DNA data by joining them.
The returned dataframe selects the columns used throughout the curated layer. We sometimes reuse this query in the curated layer, so this common function may be used to shorten such logic.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
spark
|
SparkSession
|
Spark session |
required |
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame |
DataFrame
|
Dataframe with all columns from material entity and its associated DNA data |
biocloudcore.curated.dsi.common_dsi_functions
¶
This file contains common DSI functions used for interacting with DSI data.
cleanup_curated_folders(dsi_curated_root_path, data_lake, run_date, amount_of_data_days)
¶
Cleans up folders in a specified S3 path by deleting those older than a given number of days.
The function checks folders based on their naming convention, assuming that the folder names are in a date format (e.g., 'YYYY-MM-DD').
Folders that are older than the 'run_date' minus 'amount_of_data_days' are deleted along with their contents.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dsi_curated_root_path
|
str
|
The root S3 path where the folders are located. |
required |
data_lake
|
DataLake
|
The DataLake object that provides access to the S3 client. |
required |
run_date
|
str
|
The date from which to calculate the cutoff date for folder deletion. Folders older than (run_date - amount_of_data_days) will be deleted. |
required |
amount_of_data_days
|
int
|
The number of days of data to keep. Folders older than this threshold will be deleted. |
required |
write_manifest(relative_path, data_lake, run_date)
¶
Creates a manifest JSON file for S3 directory contents.
Creates a manifest JSON file that contains information about the files in a specified S3 directory and uploads it to the same directory with the name 'manifest.json'.
The manifest includes a 'write_date' key that specifies the date the manifest was created and a 'files' key that lists the files in the specified S3 path. This we do so that the DSI team knows we are not writing at the same time we are reading. By checking on the write_date of this manifest file, DSI can ingest our data with confidence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
relative_path
|
str
|
The relative S3 path where the files are located and where the manifest will be uploaded. |
required |
data_lake
|
DataLake
|
The DataLake object that provides access to the S3 client. |
required |
run_date
|
str
|
The date the manifest is being written. This date will be included in the manifest. |
required |
biocloudcore.curated.crs_harvest_monitor.common_harvest_monitor_functions
¶
This file contains common functions used for interacting with harvest monitor data.
format_harvest_monitor_data(df)
¶
This function formats the raw harvest monitor data according to the wishes mapped by the collection managers. In essence, we apply the defined mapping and format to fit the requirements of the dashboard.
TODO: should this be done in an enriched layer?
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
sparks dataframe to be transformed/Formatted |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame |