Common Functions used Codebase-wide¶
The codebase-wide common functions are used for DataFrame manipulation, Slack notifications, and general utilities.
biocloudcore.common_dataframe_functions
¶
This file contains common DataFrame functions for interacting with PySpark DataFrames.
add_timestamp_columns(df)
¶
Add timestamp columns to the given dataframe. inserted_ts_utc timestamp updated_ts_utc timestamp
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame |
required |
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame |
DataFrame
|
DataFrame with added timestamp columns |
column_values_to_list(df, column_name)
¶
Returns all values from a dataframe column as a list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame |
required |
column_name
|
str
|
Name of the column to extract |
required |
Returns:
| Name | Type | Description |
|---|---|---|
list |
list
|
List of column values |
biocloudcore.common_slack_functions
¶
This file contains common slack functions to allow us to communicate errors to slack.
send_slack_message(webhook_url, message)
¶
This allows slack to post messages to channels. Args: webhook_url (str): with the url slack needs to post to a channel. See https://api.slack.com/apps/A08U5CFQMDG/incoming-webhooks. message (str): message that will be posted to slack.
biocloudcore.common_utility_functions
¶
This file contains common utility functions that may be used throughout the code base. Please put only functions here that are too abstract to be put in any of the other common functions files.
find_file_in_repo(file_to_find)
¶
Returns the absolute path of the file we are trying to find. This is useful when trying to find a file on different modes of compute. For example, a datacontract has different locations on local and cluster compute: On local compute (example on my local machine): $HOME/naturalis/biocloud/biocloud-core/biocloudcore/raw/dienekes/contract.yaml. On the cluster: /Workspace/Repos/.internal/4b660e87e8_commits/35f92533494205f053f2df37/biocloudcore/raw/dienekes/contract.yaml.
This allows us to find files both on local compute and on the cluster (development and production). Also, using relative paths might fail locally if the PythonPath has not been set correctly or needs different defaults than PyCharm.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_to_find
|
str
|
file we are looking for. Preferably with a preceding path like 'raw/dienekes/contract.yaml'. |
required |
Returns: str: of the absolute path of the file we were looking for. Raises exception is nothing is found.
nba_with_nsr_is_up()
¶
Checks whether the NBA is up and running. Sends a prepared statement that checks whether we can query the NSR for family names. We check for the sunflower family, the Helianthus, which we know is in the database. If the resultSet is empty, we know that the NBA is not functioning properly and we can stop the blasting process, as no family names will be resolved.
Returns:
| Name | Type | Description |
|---|---|---|
Bool |
bool
|
whether the nba is up or not. |
split_list_in_equal_batches(lst, n)
¶
Split a list into n approximately equal chunks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lst
|
list
|
List to be split |
required |
n
|
int
|
Number of splits |
required |
Returns:
| Type | Description |
|---|---|
list[list]
|
list[list]: List of chunks |