Skip to content

Common Functions used Codebase-wide

The codebase-wide common functions are used for DataFrame manipulation, Slack notifications, and general utilities.

biocloudcore.common_dataframe_functions

This file contains common DataFrame functions for interacting with PySpark DataFrames.

add_timestamp_columns(df)

Add timestamp columns to the given dataframe. inserted_ts_utc timestamp updated_ts_utc timestamp

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame

required

Returns:

Name Type Description
DataFrame DataFrame

DataFrame with added timestamp columns

column_values_to_list(df, column_name)

Returns all values from a dataframe column as a list.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame

required
column_name str

Name of the column to extract

required

Returns:

Name Type Description
list list

List of column values

biocloudcore.common_slack_functions

This file contains common slack functions to allow us to communicate errors to slack.

send_slack_message(webhook_url, message)

This allows slack to post messages to channels. Args: webhook_url (str): with the url slack needs to post to a channel. See https://api.slack.com/apps/A08U5CFQMDG/incoming-webhooks. message (str): message that will be posted to slack.

biocloudcore.common_utility_functions

This file contains common utility functions that may be used throughout the code base. Please put only functions here that are too abstract to be put in any of the other common functions files.

find_file_in_repo(file_to_find)

Returns the absolute path of the file we are trying to find. This is useful when trying to find a file on different modes of compute. For example, a datacontract has different locations on local and cluster compute: On local compute (example on my local machine): $HOME/naturalis/biocloud/biocloud-core/biocloudcore/raw/dienekes/contract.yaml. On the cluster: /Workspace/Repos/.internal/4b660e87e8_commits/35f92533494205f053f2df37/biocloudcore/raw/dienekes/contract.yaml.

This allows us to find files both on local compute and on the cluster (development and production). Also, using relative paths might fail locally if the PythonPath has not been set correctly or needs different defaults than PyCharm.

Parameters:

Name Type Description Default
file_to_find str

file we are looking for. Preferably with a preceding path like 'raw/dienekes/contract.yaml'.

required

Returns: str: of the absolute path of the file we were looking for. Raises exception is nothing is found.

nba_with_nsr_is_up()

Checks whether the NBA is up and running. Sends a prepared statement that checks whether we can query the NSR for family names. We check for the sunflower family, the Helianthus, which we know is in the database. If the resultSet is empty, we know that the NBA is not functioning properly and we can stop the blasting process, as no family names will be resolved.

Returns:

Name Type Description
Bool bool

whether the nba is up or not.

split_list_in_equal_batches(lst, n)

Split a list into n approximately equal chunks.

Parameters:

Name Type Description Default
lst list

List to be split

required
n int

Number of splits

required

Returns:

Type Description
list[list]

list[list]: List of chunks