Package 'pysparklyr' reference manual

Title:	Provides a 'PySpark' Back-End for the 'sparklyr' Package
Description:	It enables 'sparklyr' to integrate with 'Spark Connect', and 'Databricks Connect' by providing a wrapper over the 'PySpark' 'python' library.
Authors:	Edgar Ruiz [aut, cre], Posit Software, PBC [cph, fnd]
Maintainer:	Edgar Ruiz <[email protected]>
License:	MIT + file LICENSE
Version:	0.1.6.9000
Built:	2025-03-07 15:34:01 UTC
Source:	https://github.com/mlverse/pysparklyr

Deploys Databricks backed content to publishing server

Description

This is a convenience function that is meant to make it easier for you to publish your Databricks backed content to a publishing server. It is meant to be primarily used with Posit Connect.

Usage

deploy_databricks(
  appDir = NULL,
  python = NULL,
  account = NULL,
  server = NULL,
  lint = FALSE,
  forceGeneratePythonEnvironment = TRUE,
  version = NULL,
  cluster_id = NULL,
  host = NULL,
  token = NULL,
  confirm = interactive(),
  ...
)
deploy_databricks(
  appDir = NULL,
  python = NULL,
  account = NULL,
  server = NULL,
  lint = FALSE,
  forceGeneratePythonEnvironment = TRUE,
  version = NULL,
  cluster_id = NULL,
  host = NULL,
  token = NULL,
  confirm = interactive(),
  ...
)

Arguments

`appDir`	A directory containing an application (e.g. a Shiny app or plumber API) Defaults to NULL. If left NULL, and if called within RStudio, it will attempt to use the folder of the currently opened document within the IDE. If there are no opened documents, or not working in the RStudio IDE, then it will use `getwd()` as the default value.
`python`	Full path to a python binary for use by `reticulate.` It defaults to NULL. If left NULL, this function will attempt to find a viable local Python environment to replicate using the following hierarchy: `version` - Cluster's DBR version `cluster_id` - Query the cluster to obtain its DBR version If one is loaded in the current R session, it will verify that the Python environment is suited to be used as the one to use
`account`	The name of the account to use to publish
`server`	The name of the target server to publish
`lint`	Lint the project before initiating the project? Default to FALSE. It has been causing issues for this type of content.
`forceGeneratePythonEnvironment`	If an existing requirements.txt file is found, it will be overwritten when this argument is TRUE.
`version`	The Databricks Runtime (DBR) version. Use if `python` is NULL.
`cluster_id`	The Databricks cluster ID. Use if `python`, and `version` are NULL
`host`	The Databricks host URL. Defaults to NULL. If left NULL, it will use the environment variable `DATABRICKS_HOST`
`token`	The Databricks authentication token. Defaults to NULL. If left NULL, it will use the environment variable `DATABRICKS_TOKEN`
`confirm`	Should the user be prompted to confirm that the correct information is being used for deployment? Defaults to `interactive()`
`...`	Additional named arguments passed to `rsconnect::deployApp()` function

Value

No value is returned to R. Only output to the console.

Installs PySpark and Python dependencies

Description

Installs PySpark and Python dependencies

Installs Databricks Connect and Python dependencies

Usage

install_pyspark(
  version = NULL,
  envname = NULL,
  python_version = NULL,
  new_env = TRUE,
  method = c("auto", "virtualenv", "conda"),
  as_job = TRUE,
  install_ml = FALSE,
  ...
)

install_databricks(
  version = NULL,
  cluster_id = NULL,
  envname = NULL,
  python_version = NULL,
  new_env = TRUE,
  method = c("auto", "virtualenv", "conda"),
  as_job = TRUE,
  install_ml = FALSE,
  ...
)
install_pyspark(
  version = NULL,
  envname = NULL,
  python_version = NULL,
  new_env = TRUE,
  method = c("auto", "virtualenv", "conda"),
  as_job = TRUE,
  install_ml = FALSE,
  ...
)

install_databricks(
  version = NULL,
  cluster_id = NULL,
  envname = NULL,
  python_version = NULL,
  new_env = TRUE,
  method = c("auto", "virtualenv", "conda"),
  as_job = TRUE,
  install_ml = FALSE,
  ...
)

Arguments

`version`	Version of 'databricks.connect' to install. Defaults to `NULL`. If `NULL`, it will check against PyPi to get the current library version.
`envname`	The name of the Python Environment to use to install the Python libraries. Defaults to `NULL.` If `NULL`, a name will automatically be assigned based on the version that will be installed
`python_version`	The minimum required version of Python to use to create the Python environment. Defaults to `NULL`. If `NULL`, it will check against PyPi to get the minimum required Python version.
`new_env`	If `TRUE`, any existing Python virtual environment and/or Conda environment specified by `envname` is deleted first.
`method`	The installation method to use. If creating a new environment, `"auto"` (the default) is equivalent to `"virtualenv"`. Otherwise `"auto"` infers the installation method based on the type of Python environment specified by `envname`.
`as_job`	Runs the installation if using this function within the RStudio IDE.
`install_ml`	Installs ML related Python libraries. Defaults to TRUE. This is mainly for machines with limited storage to avoid installing the rather large 'torch' library if the ML features are not going to be used. This will apply to any environment backed by 'Spark' version 3.5 or above.
`...`	Passed on to `reticulate::py_install()`
`cluster_id`	Target of the cluster ID that will be used with. If provided, this value will be used to extract the cluster's version

Value

It returns no value to the R session. This function purpose is to create the 'Python' environment, and install the appropriate set of 'Python' libraries inside the new environment. During runtime, this function will send messages to the console describing the steps that the function is taking. For example, it will let the user know if it is getting the latest version of the Python library from 'PyPi.org', and the result of such query.

Lists installed Python libraries

Description

Lists installed Python libraries

Usage

installed_components(list_all = FALSE)
installed_components(list_all = FALSE)

Arguments

list_all

Flag that indicates to display all of the installed packages or only the top two, namely, pyspark and databricks.connect

Value

Returns no value, only sends information to the console. The information includes the current versions of 'sparklyr', and 'pysparklyr', as well as the 'Python' environment currently loaded.

Creates the 'label' and 'features' columns

Description

Creates the 'label' and 'features' columns

Usage

ml_prepare_dataset(
  x,
  formula = NULL,
  label = NULL,
  features = NULL,
  label_col = "label",
  features_col = "features",
  keep_original = TRUE,
  ...
)
ml_prepare_dataset(
  x,
  formula = NULL,
  label = NULL,
  features = NULL,
  label_col = "label",
  features_col = "features",
  keep_original = TRUE,
  ...
)

Arguments

`x`	A `tbl_pyspark` object
`formula`	Used when `x` is a `tbl_spark`. R formula.
`label`	The name of the label column.
`features`	The name(s) of the feature columns as a character vector.
`label_col`	Label column name, as a length-one character vector.
`features_col`	Features column name, as a length-one character vector.
`keep_original`	Boolean flag that indicates if the output will contain, or not, the original columns from `x`. Defaults to `TRUE`.
`...`	Added for backwards compatibility. Not in use today.

Details

At this time, 'Spark ML Connect', does not include a Vector Assembler transformer. The main thing that this function does, is create a 'Pyspark' array column. Pipelines require a 'label' and 'features' columns. Even though it is is single column in the dataset, the 'features' column will contain all of the predictors insde an array. This function also creates a new 'label' column that copies the outcome variable. This makes it a lot easier to remove the 'label', and 'outcome' columns.

Value

A tbl_pyspark, with either the original columns from x, plus the 'label' and 'features' column, or, the 'label' and 'features' columns only.

Writes the 'requirements.txt' file, containing the needed Python libraries

Description

This is a helper function that it is meant to be used for deployments of the document or application. By default, deploy_databricks() will run this function the first time you use that function to deploy content to Posit Connect.

Usage

requirements_write(
  envname = NULL,
  destfile = "requirements.txt",
  overwrite = FALSE,
  ...
)
requirements_write(
  envname = NULL,
  destfile = "requirements.txt",
  overwrite = FALSE,
  ...
)

Arguments

`envname`	The name of, or path to, a Python virtual environment.
`destfile`	Target path for the requirements file. Defaults to 'requirements.txt'.
`overwrite`	Replace the contents of the file if it already exists?
`...`	Additional arguments passed to `reticulate::py_list_packages()`

Value

No value is returned to R. The output is a text file with the list of Python libraries.

Starts and stops Spark Connect locally

Description

Starts and stops Spark Connect locally

Usage

spark_connect_service_start(
  version = "3.5",
  scala_version = "2.12",
  include_args = TRUE,
  ...
)

spark_connect_service_stop(version = "3.5", ...)
spark_connect_service_start(
  version = "3.5",
  scala_version = "2.12",
  include_args = TRUE,
  ...
)

spark_connect_service_stop(version = "3.5", ...)

Arguments

`version`	Spark version to use (3.4 or above)
`scala_version`	Acceptable Scala version of packages to be loaded
`include_args`	Flag that indicates whether to add the additional arguments to the command that starts the service. At this time, only the 'packages' argument is submitted.
`...`	Optional arguments; currently unused

Value

It returns messages to the console with the status of starting, and stopping the local Spark Connect service.

Package 'pysparklyr'

Help Index

Deploys Databricks backed content to publishing server

Description

Usage

Arguments

Value

Installs PySpark and Python dependencies

Description

Usage

Arguments

Value

Lists installed Python libraries

Description

Usage

Arguments

Value

Creates the 'label' and 'features' columns

Description

Usage

Arguments

Details

Value

Writes the 'requirements.txt' file, containing the needed Python libraries

Description

Usage

Arguments

Value

Starts and stops Spark Connect locally

Description

Usage

Arguments

Value