Title: | Provides a 'PySpark' Back-End for the 'sparklyr' Package |
---|---|
Description: | It enables 'sparklyr' to integrate with 'Spark Connect', and 'Databricks Connect' by providing a wrapper over the 'PySpark' 'python' library. |
Authors: | Edgar Ruiz [aut, cre], Posit Software, PBC [cph, fnd] |
Maintainer: | Edgar Ruiz <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.5.9002 |
Built: | 2024-10-31 05:36:09 UTC |
Source: | https://github.com/mlverse/pysparklyr |
This is a convenience function that is meant to make it easier for you to publish your Databricks backed content to a publishing server. It is meant to be primarily used with Posit Connect.
deploy_databricks( appDir = NULL, python = NULL, account = NULL, server = NULL, lint = FALSE, forceGeneratePythonEnvironment = TRUE, version = NULL, cluster_id = NULL, host = NULL, token = NULL, confirm = interactive(), ... )
deploy_databricks( appDir = NULL, python = NULL, account = NULL, server = NULL, lint = FALSE, forceGeneratePythonEnvironment = TRUE, version = NULL, cluster_id = NULL, host = NULL, token = NULL, confirm = interactive(), ... )
appDir |
A directory containing an application (e.g. a Shiny app or plumber API)
Defaults to NULL. If left NULL, and if called within RStudio, it will attempt
to use the folder of the currently opened document within the IDE. If there are
no opened documents, or not working in the RStudio IDE, then it will use
|
python |
Full path to a python binary for use by
|
account |
The name of the account to use to publish |
server |
The name of the target server to publish |
lint |
Lint the project before initiating the project? Default to FALSE. It has been causing issues for this type of content. |
forceGeneratePythonEnvironment |
If an existing requirements.txt file is found, it will be overwritten when this argument is TRUE. |
version |
The Databricks Runtime (DBR) version. Use if |
cluster_id |
The Databricks cluster ID. Use if |
host |
The Databricks host URL. Defaults to NULL. If left NULL, it will
use the environment variable |
token |
The Databricks authentication token. Defaults to NULL. If left NULL, it will
use the environment variable |
confirm |
Should the user be prompted to confirm that the correct
information is being used for deployment? Defaults to |
... |
Additional named arguments passed to |
No value is returned to R. Only output to the console.
Installs PySpark and Python dependencies
Installs Databricks Connect and Python dependencies
install_pyspark( version = NULL, envname = NULL, python_version = NULL, new_env = TRUE, method = c("auto", "virtualenv", "conda"), as_job = TRUE, install_ml = FALSE, ... ) install_databricks( version = NULL, cluster_id = NULL, envname = NULL, python_version = NULL, new_env = TRUE, method = c("auto", "virtualenv", "conda"), as_job = TRUE, install_ml = FALSE, ... )
install_pyspark( version = NULL, envname = NULL, python_version = NULL, new_env = TRUE, method = c("auto", "virtualenv", "conda"), as_job = TRUE, install_ml = FALSE, ... ) install_databricks( version = NULL, cluster_id = NULL, envname = NULL, python_version = NULL, new_env = TRUE, method = c("auto", "virtualenv", "conda"), as_job = TRUE, install_ml = FALSE, ... )
version |
Version of 'databricks.connect' to install. Defaults to |
envname |
The name of the Python Environment to use to install the
Python libraries. Defaults to |
python_version |
The minimum required version of Python to use to create
the Python environment. Defaults to |
new_env |
If |
method |
The installation method to use. If creating a new environment,
|
as_job |
Runs the installation if using this function within the RStudio IDE. |
install_ml |
Installs ML related Python libraries. Defaults to TRUE. This is mainly for machines with limited storage to avoid installing the rather large 'torch' library if the ML features are not going to be used. This will apply to any environment backed by 'Spark' version 3.5 or above. |
... |
Passed on to |
cluster_id |
Target of the cluster ID that will be used with. If provided, this value will be used to extract the cluster's version |
It returns no value to the R session. This function purpose is to create the 'Python' environment, and install the appropriate set of 'Python' libraries inside the new environment. During runtime, this function will send messages to the console describing the steps that the function is taking. For example, it will let the user know if it is getting the latest version of the Python library from 'PyPi.org', and the result of such query.
Lists installed Python libraries
installed_components(list_all = FALSE)
installed_components(list_all = FALSE)
list_all |
Flag that indicates to display all of the installed packages
or only the top two, namely, |
Returns no value, only sends information to the console. The information includes the current versions of 'sparklyr', and 'pysparklyr', as well as the 'Python' environment currently loaded.
Creates the 'label' and 'features' columns
ml_prepare_dataset( x, formula = NULL, label = NULL, features = NULL, label_col = "label", features_col = "features", keep_original = TRUE, ... )
ml_prepare_dataset( x, formula = NULL, label = NULL, features = NULL, label_col = "label", features_col = "features", keep_original = TRUE, ... )
x |
A |
formula |
Used when |
label |
The name of the label column. |
features |
The name(s) of the feature columns as a character vector. |
label_col |
Label column name, as a length-one character vector. |
features_col |
Features column name, as a length-one character vector. |
keep_original |
Boolean flag that indicates if the output will contain,
or not, the original columns from |
... |
Added for backwards compatibility. Not in use today. |
At this time, 'Spark ML Connect', does not include a Vector Assembler transformer. The main thing that this function does, is create a 'Pyspark' array column. Pipelines require a 'label' and 'features' columns. Even though it is is single column in the dataset, the 'features' column will contain all of the predictors insde an array. This function also creates a new 'label' column that copies the outcome variable. This makes it a lot easier to remove the 'label', and 'outcome' columns.
A tbl_pyspark
, with either the original columns from x
, plus the
'label' and 'features' column, or, the 'label' and 'features' columns only.
This is a helper function that it is meant to be used for deployments
of the document or application. By default, deploy_databricks()
will run this
function the first time you use that function to deploy content to Posit Connect.
requirements_write( envname = NULL, destfile = "requirements.txt", overwrite = FALSE, ... )
requirements_write( envname = NULL, destfile = "requirements.txt", overwrite = FALSE, ... )
envname |
The name of, or path to, a Python virtual environment. |
destfile |
Target path for the requirements file. Defaults to 'requirements.txt'. |
overwrite |
Replace the contents of the file if it already exists? |
... |
Additional arguments passed to |
No value is returned to R. The output is a text file with the list of Python libraries.
Starts and stops Spark Connect locally
spark_connect_service_start( version = "3.5", scala_version = "2.12", include_args = TRUE, ... ) spark_connect_service_stop(version = "3.5", ...)
spark_connect_service_start( version = "3.5", scala_version = "2.12", include_args = TRUE, ... ) spark_connect_service_stop(version = "3.5", ...)
version |
Spark version to use (3.4 or above) |
scala_version |
Acceptable Scala version of packages to be loaded |
include_args |
Flag that indicates whether to add the additional arguments to the command that starts the service. At this time, only the 'packages' argument is submitted. |
... |
Optional arguments; currently unused |
It returns messages to the console with the status of starting, and stopping the local Spark Connect service.