| Title: | Fit 'TabNet' Models for Classification and Regression |
|---|---|
| Description: | Implements the 'TabNet' model by Sercan O. Arik et al. (2019) <doi:10.48550/arXiv.1908.07442> with 'Coherent Hierarchical Multi-label Classification Networks' by Giunchiglia et al. <doi:10.48550/arXiv.2010.10151> and provides a consistent interface for fitting and creating predictions. It's also fully compatible with the 'tidymodels' ecosystem. |
| Authors: | Daniel Falbel [aut], RStudio [cph], Christophe Regouby [cre, ctb], Egill Fridgeirsson [ctb], Philipp Haarmeyer [ctb], Sven Verweij [ctb] (ORCID: <https://orcid.org/0000-0002-5573-3952>) |
| Maintainer: | Christophe Regouby <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.8.0.9000 |
| Built: | 2026-06-07 18:09:01 UTC |
| Source: | https://github.com/mlverse/tabnet |
Parameters for the tabnet model
attention_width(range = c(8L, 64L), trans = NULL) decision_width(range = c(8L, 64L), trans = NULL) feature_reusage(range = c(1, 2), trans = NULL) momentum(range = c(0.01, 0.4), trans = NULL) mask_type(values = c("sparsemax", "entmax")) num_independent(range = c(1L, 5L), trans = NULL) num_shared(range = c(1L, 5L), trans = NULL) num_steps(range = c(3L, 10L), trans = NULL)attention_width(range = c(8L, 64L), trans = NULL) decision_width(range = c(8L, 64L), trans = NULL) feature_reusage(range = c(1, 2), trans = NULL) momentum(range = c(0.01, 0.4), trans = NULL) mask_type(values = c("sparsemax", "entmax")) num_independent(range = c(1L, 5L), trans = NULL) num_shared(range = c(1L, 5L), trans = NULL) num_steps(range = c(3L, 10L), trans = NULL)
range |
the default range for the parameter value |
trans |
whether to apply a transformation to the parameter |
values |
possible values for factor parameters These functions are used with |
A dials parameter to be used when tuning TabNet models.
model <- tabnet(attention_width = tune(), feature_reusage = tune(), momentum = tune(), penalty = tune(), rate_step_size = tune()) %>% parsnip::set_mode("regression") %>% parsnip::set_engine("torch")model <- tabnet(attention_width = tune(), feature_reusage = tune(), momentum = tune(), penalty = tune(), rate_step_size = tune()) %>% parsnip::set_mode("regression") %>% parsnip::set_engine("torch")
Plot tabnet_explain mask importance heatmap
autoplot.tabnet_explain( object, type = c("mask_agg", "steps"), quantile = 1, ... )autoplot.tabnet_explain( object, type = c("mask_agg", "steps"), quantile = 1, ... )
object |
A |
type |
a character value. Either |
quantile |
numerical value between 0 and 1. Provides quantile clipping of the mask values |
... |
not used. |
Plot the tabnet_explain object mask importance per variable along the predicted dataset.
type="mask_agg" output a single heatmap of mask aggregated values,
type="steps" provides a plot faceted along the n_steps mask present in the model.
quantile=.995 may be used for strong outlier clipping, in order to better highlight
low values. quantile=1, the default, do not clip any values.
A ggplot object.
## Not run: library(ggplot2) data("attrition", package = "modeldata") ## Single-outcome binary classification of `Attrition` in `attrition` dataset attrition_fit <- tabnet_fit(Attrition ~. , data=attrition, epoch=11) attrition_explain <- tabnet_explain(attrition_fit, attrition) # Plot the model aggregated mask interpretation heatmap autoplot(attrition_explain) ## Multi-outcome regression on `Sale_Price` and `Pool_Area` in `ames` dataset, data("ames", package = "modeldata") x <- ames[,-which(names(ames) %in% c("Sale_Price", "Pool_Area"))] y <- ames[, c("Sale_Price", "Pool_Area")] ames_fit <- tabnet_fit(x, y, epochs = 1, verbose=TRUE) ames_explain <- tabnet_explain(ames_fit, x) autoplot(ames_explain, quantile = 0.99) ## End(Not run)## Not run: library(ggplot2) data("attrition", package = "modeldata") ## Single-outcome binary classification of `Attrition` in `attrition` dataset attrition_fit <- tabnet_fit(Attrition ~. , data=attrition, epoch=11) attrition_explain <- tabnet_explain(attrition_fit, attrition) # Plot the model aggregated mask interpretation heatmap autoplot(attrition_explain) ## Multi-outcome regression on `Sale_Price` and `Pool_Area` in `ames` dataset, data("ames", package = "modeldata") x <- ames[,-which(names(ames) %in% c("Sale_Price", "Pool_Area"))] y <- ames[, c("Sale_Price", "Pool_Area")] ames_fit <- tabnet_fit(x, y, epochs = 1, verbose=TRUE) ames_explain <- tabnet_explain(ames_fit, x) autoplot(ames_explain, quantile = 0.99) ## End(Not run)
Plot tabnet_fit model loss along epochs
autoplot.tabnet_fit(object, ...) autoplot.tabnet_pretrain(object, ...)autoplot.tabnet_fit(object, ...) autoplot.tabnet_pretrain(object, ...)
object |
A |
... |
not used. |
Plot the training loss along epochs, and validation loss along epochs if any.
A dot is added on epochs where model snapshot is available, helping
the choice of from_epoch value for later model training resume.
A ggplot object.
## Not run: library(ggplot2) data("attrition", package = "modeldata") attrition_fit <- tabnet_fit(Attrition ~. , data=attrition, valid_split=0.2, epoch=11) # Plot the model loss over epochs autoplot(attrition_fit) ## End(Not run)## Not run: library(ggplot2) data("attrition", package = "modeldata") attrition_fit <- tabnet_fit(Attrition ~. , data=attrition, valid_split=0.2, epoch=11) # Plot the model loss over epochs autoplot(attrition_fit) ## End(Not run)
Extracts class names from the outcome tibble (factor levels) and builds the ancestor matrix only for classes that actually appear in the data.
build_ancestor_matrix_from_outcomes(x, outcomes, device = "cpu")build_ancestor_matrix_from_outcomes(x, outcomes, device = "cpu")
x |
A |
outcomes |
A tibble with factor columns (one per hierarchy level),
as returned by |
device |
Torch device ("cpu" or "cuda"). |
A torch_tensor of shape (1, n_classes, n_classes).
Non-tunable parameters for the tabnet model
cat_emb_dim(range = NULL, trans = NULL) checkpoint_epochs(range = NULL, trans = NULL) drop_last(range = NULL, trans = NULL) encoder_activation(range = NULL, trans = NULL) lr_scheduler(range = NULL, trans = NULL) mlp_activation(range = NULL, trans = NULL) mlp_hidden_multiplier(range = NULL, trans = NULL) num_independent_decoder(range = NULL, trans = NULL) num_shared_decoder(range = NULL, trans = NULL) optimizer(range = NULL, trans = NULL) penalty(range = NULL, trans = NULL) verbose(range = NULL, trans = NULL) virtual_batch_size(range = NULL, trans = NULL)cat_emb_dim(range = NULL, trans = NULL) checkpoint_epochs(range = NULL, trans = NULL) drop_last(range = NULL, trans = NULL) encoder_activation(range = NULL, trans = NULL) lr_scheduler(range = NULL, trans = NULL) mlp_activation(range = NULL, trans = NULL) mlp_hidden_multiplier(range = NULL, trans = NULL) num_independent_decoder(range = NULL, trans = NULL) num_shared_decoder(range = NULL, trans = NULL) optimizer(range = NULL, trans = NULL) penalty(range = NULL, trans = NULL) verbose(range = NULL, trans = NULL) virtual_batch_size(range = NULL, trans = NULL)
range |
unused |
trans |
unused |
Check that Node object names are compliant
check_compliant_node(node)check_compliant_node(node)
node |
the Node object, or a dataframe ready to be parsed by |
node if it is compliant, else an Error with the column names to fix
library(dplyr) library(data.tree) data(starwars) starwars_tree <- starwars %>% mutate(pathString = paste("tree", species, homeworld, `name`, sep = "/")) # pre as.Node() check try(check_compliant_node(starwars_tree)) # post as.Node() check check_compliant_node(as.Node(starwars_tree))library(dplyr) library(data.tree) data(starwars) starwars_tree <- starwars %>% mutate(pathString = paste("tree", species, homeworld, `name`, sep = "/")) # pre as.Node() check try(check_compliant_node(starwars_tree)) # post as.Node() check check_compliant_node(as.Node(starwars_tree))
With alpha = 1.5 and normalizing sparse transform (a la softmax).
entmax(dim = -1) entmax15(dim = -1L, k = NULL)entmax(dim = -1) entmax15(dim = -1L, k = NULL)
dim |
The dimension along which to apply 1.5-entmax. |
k |
The number of largest elements to partial-sort input over. For optimal
performance, should be slightly bigger than the expected number of
non-zeros in the solution. If the solution is more than k-sparse,
this function is recursively called with a 2*k schedule. If |
Solves the optimization problem:
where is the Tsallis alpha-entropy with .
The projection result P of the same shape as input, such that
elementwise.
## Not run: input <- torch::torch_randn(10,5, requires_grad = TRUE) # create a top3 alpha=1.5 entmax on last input dimension nn_entmax <- entmax15(dim=-1L, k = 3) result <- nn_entmax(input) ## End(Not run)## Not run: input <- torch::torch_randn(10,5, requires_grad = TRUE) # create a top3 alpha=1.5 entmax on last input dimension nn_entmax <- entmax15(dim=-1L, k = 3) result <- nn_entmax(input) ## End(Not run)
Given neural network outputs x and ancestor matrix R, enforces that
if a class is predicted positive, all its ancestors must also be positive.
Implements: final_out[i] = max{x[j] : R[i,j] = 1}
get_constr_output(x, R)get_constr_output(x, R)
x |
A |
R |
A |
A torch_tensor of shape (batch_size, n_classes) with constrained outputs.
Optimal threshold (tau) computation for 1.5-entmax
get_tau(input, dim = -1L, k = NULL)get_tau(input, dim = -1L, k = NULL)
input |
The input tensor to compute thresholds over. |
dim |
The dimension along which to apply 1.5-entmax. Default is -1. |
k |
The number of largest elements to partial-sort over. For optimal
performance, should be slightly bigger than the expected number of
non-zeros in the solution. If the solution is more than k-sparse,
this function is recursively called with a 2*k schedule. If |
The threshold value for each vector, with all but the dim
dimension intact.
Creates a criterion that measures the Area under the (AUM) between each
element in the input and target .
nn_aum_loss()nn_aum_loss()
This is used for measuring the error of a binary reconstruction within highly unbalanced dataset,
where the goal is optimizing the ROC curve. Note that the targets should be factor
level of the binary outcome, i.e. with values 1L and 2L.
loss <- nn_aum_loss() input <- torch::torch_randn(4, 6, requires_grad = TRUE) target <- input > 1.5 output <- loss(input, target) output$backward()loss <- nn_aum_loss() input <- torch::torch_randn(4, 6, requires_grad = TRUE) target <- input > 1.5 output <- loss(input, target) output$backward()
Module wrapper for nnf_mc_loss() with configurable parameters.
Stores the ancestor matrix R and evaluation mask for reuse across batches.
nn_mc_loss( R, to_eval = NULL, criterion = torch::nnf_binary_cross_entropy_with_logits, reduction = "mean" )nn_mc_loss( R, to_eval = NULL, criterion = torch::nnf_binary_cross_entropy_with_logits, reduction = "mean" )
R |
Ancestor matrix tensor of shape |
to_eval |
Optional logical tensor of shape |
criterion |
Loss function module or functional to apply after constraint
propagation. Default: |
reduction |
(string, optional): Reduction method: |
Input output: where N = batch size, C = number of classes
Input target: , same shape as output, binary values
Output: scalar by default. If reduction = "none", then
where C' is the number of evaluated classes
nnf_mc_loss(), build_ancestor_matrix_from_outcomes(), get_constr_output()
## Not run: # Build ancestor matrix from hierarchy R <- build_ancestor_matrix_from_outcomes(my_tree, processed$outcomes, device = "cuda") # Create loss module loss_fn <- nn_mc_loss(R = R, reduction = "mean") # Forward pass output <- model(x) # (batch, n_classes) loss <- loss_fn(output, labels) loss$backward() ## End(Not run)## Not run: # Build ancestor matrix from hierarchy R <- build_ancestor_matrix_from_outcomes(my_tree, processed$outcomes, device = "cuda") # Create loss module loss_fn <- nn_mc_loss(R = R, reduction = "mean") # Forward pass output <- model(x) # (batch, n_classes) loss <- loss_fn(output, labels) loss$backward() ## End(Not run)
Prune head_size last layers of a tabnet network in order to
use the pruned module as a sequential embedding module.
## S3 method for class 'tabnet_fit' nn_prune_head(x, head_size) ## S3 method for class 'tabnet_pretrain' nn_prune_head(x, head_size)## S3 method for class 'tabnet_fit' nn_prune_head(x, head_size) ## S3 method for class 'tabnet_pretrain' nn_prune_head(x, head_size)
x |
nn_network to prune |
head_size |
number of nn_layers to prune, should be less than 2 |
a tabnet network with the top nn_layer removed
data("ames", package = "modeldata") x <- ames[,-which(names(ames) == "Sale_Price")] y <- ames$Sale_Price # pretrain a tabnet model on ames dataset ames_pretrain <- tabnet_pretrain(x, y, epoch = 2, checkpoint_epochs = 1) # prune classification head to get an embedding model pruned_pretrain <- torch::nn_prune_head(ames_pretrain, 1)data("ames", package = "modeldata") x <- ames[,-which(names(ames) == "Sale_Price")] y <- ames$Sale_Price # pretrain a tabnet model on ames dataset ames_pretrain <- tabnet_pretrain(x, y, epoch = 2, checkpoint_epochs = 1) # prune classification head to get an embedding model pruned_pretrain <- torch::nn_prune_head(ames_pretrain, 1)
Computes the hierarchy-constrained loss for multi-label classification. Enforces that if a class is predicted positive, all its ancestors must also be positive, using the ancestor matrix R.
nnf_mc_loss( output, target, R, to_eval = NULL, criterion = nnf_binary_cross_entropy_with_logits )nnf_mc_loss( output, target, R, to_eval = NULL, criterion = nnf_binary_cross_entropy_with_logits )
output |
A |
target |
Binary target labels, shape |
R |
Ancestor matrix tensor of shape |
to_eval |
Optional logical tensor of shape |
criterion |
Loss function to apply after constraint propagation.
Default: |
The loss combines constrained outputs differently for positive and negative labels:
For positive labels: uses constrained output of label-weighted predictions
For negative labels: uses constrained raw predictions (penalizes ancestor violations)
A scalar torch_tensor containing the computed loss, or a tensor
of shape (batch_size, n_classes) if reduction = "none".
nn_mc_loss(), get_constr_output()
Transforms a tensor of class indices (one column per hierarchy level) into a binary tensor where each column corresponds to a class.
nnf_multilabel_one_hot(y, outcomes, device = "cpu")nnf_multilabel_one_hot(y, outcomes, device = "cpu")
y |
A |
outcomes |
A tibble with factor columns (as from |
device |
Torch device. |
A torch_tensor of shape (batch_size, n_classes) with binary values.
Turn a Node object into predictor and outcome.
node_to_df(x, drop_last_level = TRUE)node_to_df(x, drop_last_level = TRUE)
x |
Node object |
drop_last_level |
TRUE unused |
a named list of x and y, being respectively the predictor data-frame and the outcomes data-frame,
as expected inputs for hardhat::mold() function.
library(dplyr) library(data.tree) data(starwars) starwars_tree <- starwars %>% mutate(pathString = paste("tree", species, homeworld, `name`, sep = "/")) %>% as.Node() node_to_df(starwars_tree)$x %>% head() node_to_df(starwars_tree)$y %>% head()library(dplyr) library(data.tree) data(starwars) starwars_tree <- starwars %>% mutate(pathString = paste("tree", species, homeworld, `name`, sep = "/")) %>% as.Node() node_to_df(starwars_tree)$x %>% head() node_to_df(starwars_tree)$y %>% head()
Normalizing sparse transform (a la softmax).
sparsemax(dim = -1L) sparsemax15(dim = -1L, k = NULL)sparsemax(dim = -1L) sparsemax15(dim = -1L, k = NULL)
dim |
The dimension along which to apply sparsemax. |
k |
The number of largest elements to partial-sort input over. For optimal
performance, |
Solves the projection:
The projection result, such that elementwise.
## Not run: input <- torch::torch_randn(10, 5, requires_grad = TRUE) # create a top3 alpha=1.5 sparsemax on last input dimension nn_sparsemax <- sparsemax15(dim=1, k=3) result <- nn_sparsemax(input) print(result) ## End(Not run)## Not run: input <- torch::torch_randn(10, 5, requires_grad = TRUE) # create a top3 alpha=1.5 sparsemax on last input dimension nn_sparsemax <- sparsemax15(dim=1, k=3) result <- nn_sparsemax(input) print(result) ## End(Not run)
Parsnip compatible tabnet model
tabnet( mode = "unknown", cat_emb_dim = NULL, decision_width = NULL, attention_width = NULL, num_steps = NULL, mask_type = NULL, mask_topk = NULL, num_independent = NULL, num_shared = NULL, num_independent_decoder = NULL, num_shared_decoder = NULL, penalty = NULL, feature_reusage = NULL, momentum = NULL, epochs = NULL, batch_size = NULL, virtual_batch_size = NULL, learn_rate = NULL, optimizer = NULL, loss = NULL, clip_value = NULL, drop_last = NULL, lr_scheduler = NULL, rate_decay = NULL, rate_step_size = NULL, checkpoint_epochs = NULL, verbose = NULL, importance_sample_size = NULL, early_stopping_monitor = NULL, early_stopping_tolerance = NULL, early_stopping_patience = NULL, skip_importance = NULL, tabnet_model = NULL, from_epoch = NULL )tabnet( mode = "unknown", cat_emb_dim = NULL, decision_width = NULL, attention_width = NULL, num_steps = NULL, mask_type = NULL, mask_topk = NULL, num_independent = NULL, num_shared = NULL, num_independent_decoder = NULL, num_shared_decoder = NULL, penalty = NULL, feature_reusage = NULL, momentum = NULL, epochs = NULL, batch_size = NULL, virtual_batch_size = NULL, learn_rate = NULL, optimizer = NULL, loss = NULL, clip_value = NULL, drop_last = NULL, lr_scheduler = NULL, rate_decay = NULL, rate_step_size = NULL, checkpoint_epochs = NULL, verbose = NULL, importance_sample_size = NULL, early_stopping_monitor = NULL, early_stopping_tolerance = NULL, early_stopping_patience = NULL, skip_importance = NULL, tabnet_model = NULL, from_epoch = NULL )
mode |
A single character string for the type of model. Possible values for this model are "unknown", "regression", or "classification". |
cat_emb_dim |
Size of the embedding of categorical features. If int, all categorical features will have same embedding size, if list of int, every corresponding feature will have specific embedding size. |
decision_width |
(int) Width of the decision prediction layer. Bigger values gives more capacity to the model with the risk of overfitting. Values typically range from 8 to 64. |
attention_width |
(int) Width of the attention embedding for each mask. According to the paper n_d = n_a is usually a good choice. (default=8) |
num_steps |
(int) Number of steps in the architecture (usually between 3 and 10) |
mask_type |
(character) Final layer of feature selector in the attentive_transformer
block, either |
mask_topk |
(int) mask sparsity top-k for |
num_independent |
Number of independent Gated Linear Units layers at each step of the encoder. Usual values range from 1 to 5. |
num_shared |
Number of shared Gated Linear Units at each step of the encoder. Usual values at each step of the decoder. range from 1 to 5 |
num_independent_decoder |
For pretraining, number of independent Gated Linear Units layers Usual values range from 1 to 5. |
num_shared_decoder |
For pretraining, number of shared Gated Linear Units at each step of the decoder. Usual values range from 1 to 5. |
penalty |
This is the extra sparsity loss coefficient as proposed in the original paper. The bigger this coefficient is, the sparser your model will be in terms of feature selection. Depending on the difficulty of your problem, reducing this value could help (default 1e-3). |
feature_reusage |
(num) This is the coefficient for feature reusage in the masks. A value close to 1 will make mask selection least correlated between layers. Values range from 1 to 2. |
momentum |
Momentum for batch normalization, typically ranges from 0.01 to 0.4 (default=0.02) |
epochs |
(int) Number of training epochs. |
batch_size |
(int) Number of examples per batch, large batch sizes are recommended. (default: 1024^2) |
virtual_batch_size |
(int) Size of the mini batches used for "Ghost Batch Normalization" (default=256^2) |
learn_rate |
initial learning rate for the optimizer. |
optimizer |
the optimization method. currently only |
loss |
(character or function) Loss function for training (default to mse for regression and cross entropy for classification) |
clip_value |
If a num is given this will clip the gradient at
clip_value. Pass |
drop_last |
(logical) Whether to drop last batch if not complete during training |
lr_scheduler |
if |
rate_decay |
multiplies the initial learning rate by |
rate_step_size |
the learning rate scheduler step size. Unused if
|
checkpoint_epochs |
checkpoint model weights and architecture every
|
verbose |
(logical) Whether to print progress and loss values during training. |
importance_sample_size |
sample of the dataset to compute importance metrics. If the dataset is larger than 1e5 obs we will use a sample of size 1e5 and display a warning. |
early_stopping_monitor |
Metric to monitor for early_stopping. One of "valid_loss", "train_loss" or "auto" (defaults to "auto"). |
early_stopping_tolerance |
Minimum relative improvement to reset the patience counter. 0.01 for 1% tolerance (default 0) |
early_stopping_patience |
Number of epochs without improving until stopping training. (default=5) |
skip_importance |
if feature importance calculation should be skipped (default: |
tabnet_model |
A previously fitted |
from_epoch |
When a |
A TabNet parsnip instance. It can be used to fit tabnet models using
parsnip machinery.
TabNet uses torch as its backend for computation and torch uses all
available threads by default.
You can control the number of threads used by torch with:
torch::torch_set_num_threads(1) torch::torch_set_num_interop_threads(1)
tabnet_fit
library(parsnip) data("ames", package = "modeldata") model <- tabnet() %>% set_mode("regression") %>% set_engine("torch") model %>% fit(Sale_Price ~ ., data = ames)library(parsnip) data("ames", package = "modeldata") model <- tabnet() %>% set_mode("regression") %>% set_engine("torch") model %>% fit(Sale_Price ~ ., data = ames)
Configuration for TabNet models
tabnet_config( batch_size = 1024^2, penalty = 0.001, clip_value = NULL, loss = "auto", epochs = 5, drop_last = FALSE, decision_width = NULL, attention_width = NULL, num_steps = 3, feature_reusage = 1.3, mask_type = "sparsemax", mask_topk = NULL, virtual_batch_size = 256^2, valid_split = 0, learn_rate = 0.02, optimizer = "adam", lr_scheduler = NULL, lr_decay = 0.1, step_size = 30, checkpoint_epochs = 10, cat_emb_dim = 1, num_independent = 2, num_shared = 2, num_independent_decoder = 1, num_shared_decoder = 1, momentum = 0.02, pretraining_ratio = 0.5, verbose = FALSE, device = "auto", importance_sample_size = NULL, early_stopping_monitor = "auto", early_stopping_tolerance = 0, early_stopping_patience = 0L, num_workers = 0L, skip_importance = FALSE )tabnet_config( batch_size = 1024^2, penalty = 0.001, clip_value = NULL, loss = "auto", epochs = 5, drop_last = FALSE, decision_width = NULL, attention_width = NULL, num_steps = 3, feature_reusage = 1.3, mask_type = "sparsemax", mask_topk = NULL, virtual_batch_size = 256^2, valid_split = 0, learn_rate = 0.02, optimizer = "adam", lr_scheduler = NULL, lr_decay = 0.1, step_size = 30, checkpoint_epochs = 10, cat_emb_dim = 1, num_independent = 2, num_shared = 2, num_independent_decoder = 1, num_shared_decoder = 1, momentum = 0.02, pretraining_ratio = 0.5, verbose = FALSE, device = "auto", importance_sample_size = NULL, early_stopping_monitor = "auto", early_stopping_tolerance = 0, early_stopping_patience = 0L, num_workers = 0L, skip_importance = FALSE )
batch_size |
(int) Number of examples per batch, large batch sizes are recommended. (default: 1024^2) |
penalty |
This is the extra sparsity loss coefficient as proposed in the original paper. The bigger this coefficient is, the sparser your model will be in terms of feature selection. Depending on the difficulty of your problem, reducing this value could help (default 1e-3). |
clip_value |
If a num is given this will clip the gradient at
clip_value. Pass |
loss |
(character or function) Loss function for training (default to mse for regression and cross entropy for classification) |
epochs |
(int) Number of training epochs. |
drop_last |
(logical) Whether to drop last batch if not complete during training |
decision_width |
(int) Width of the decision prediction layer. Bigger values gives more capacity to the model with the risk of overfitting. Values typically range from 8 to 64. |
attention_width |
(int) Width of the attention embedding for each mask. According to the paper n_d = n_a is usually a good choice. (default=8) |
num_steps |
(int) Number of steps in the architecture (usually between 3 and 10) |
feature_reusage |
(num) This is the coefficient for feature reusage in the masks. A value close to 1 will make mask selection least correlated between layers. Values range from 1 to 2. |
mask_type |
(character) Final layer of feature selector in the attentive_transformer
block, either |
mask_topk |
(int) mask sparsity top-k for |
virtual_batch_size |
(int) Size of the mini batches used for "Ghost Batch Normalization" (default=256^2) |
valid_split |
In [0, 1). The fraction of the dataset used for validation. (default = 0 means no split) |
learn_rate |
initial learning rate for the optimizer. |
optimizer |
the optimization method. currently only |
lr_scheduler |
if |
lr_decay |
multiplies the initial learning rate by |
step_size |
the learning rate scheduler step size. Unused if
|
checkpoint_epochs |
checkpoint model weights and architecture every
|
cat_emb_dim |
Size of the embedding of categorical features. If int, all categorical features will have same embedding size, if list of int, every corresponding feature will have specific embedding size. |
num_independent |
Number of independent Gated Linear Units layers at each step of the encoder. Usual values range from 1 to 5. |
num_shared |
Number of shared Gated Linear Units at each step of the encoder. Usual values at each step of the decoder. range from 1 to 5 |
num_independent_decoder |
For pretraining, number of independent Gated Linear Units layers Usual values range from 1 to 5. |
num_shared_decoder |
For pretraining, number of shared Gated Linear Units at each step of the decoder. Usual values range from 1 to 5. |
momentum |
Momentum for batch normalization, typically ranges from 0.01 to 0.4 (default=0.02) |
pretraining_ratio |
Ratio of features to mask for reconstruction during pretraining. Ranges from 0 to 1 (default=0.5) |
verbose |
(logical) Whether to print progress and loss values during training. |
device |
the device to use for training. "cpu" or "cuda". The default ("auto") uses to "cuda" if it's available, otherwise uses "cpu". |
importance_sample_size |
sample of the dataset to compute importance metrics. If the dataset is larger than 1e5 obs we will use a sample of size 1e5 and display a warning. |
early_stopping_monitor |
Metric to monitor for early_stopping. One of "valid_loss", "train_loss" or "auto" (defaults to "auto"). |
early_stopping_tolerance |
Minimum relative improvement to reset the patience counter. 0.01 for 1% tolerance (default 0) |
early_stopping_patience |
Number of epochs without improving until stopping training. (default=5) |
num_workers |
(int, optional): how many subprocesses to use for data
loading. 0 means that the data will be loaded in the main process.
(default: |
skip_importance |
if feature importance calculation should be skipped (default: |
A named list with all hyperparameters of the TabNet implementation.
data("ames", package = "modeldata") # change the model config for an faster ignite optimizer config <- tabnet_config(optimizer = torch::optim_ignite_adamw) ## Single-outcome regression using formula specification fit <- tabnet_fit(Sale_Price ~ ., data = ames, epochs = 1, config = config)data("ames", package = "modeldata") # change the model config for an faster ignite optimizer config <- tabnet_config(optimizer = torch::optim_ignite_adamw) ## Single-outcome regression using formula specification fit <- tabnet_fit(Sale_Price ~ ., data = ames, epochs = 1, config = config)
Interpretation metrics from a TabNet model
tabnet_explain(object, new_data) ## Default S3 method: tabnet_explain(object, new_data) ## S3 method for class 'tabnet_fit' tabnet_explain(object, new_data) ## S3 method for class 'tabnet_pretrain' tabnet_explain(object, new_data) ## S3 method for class 'model_fit' tabnet_explain(object, new_data)tabnet_explain(object, new_data) ## Default S3 method: tabnet_explain(object, new_data) ## S3 method for class 'tabnet_fit' tabnet_explain(object, new_data) ## S3 method for class 'tabnet_pretrain' tabnet_explain(object, new_data) ## S3 method for class 'model_fit' tabnet_explain(object, new_data)
object |
a TabNet fit object |
new_data |
a data.frame to obtain interpretation metrics. |
Returns a list with
M_explain: the aggregated feature importance masks as detailed in
TabNet's paper.
masks a list containing the masks for each step.
set.seed(2021) n <- 256 x <- data.frame( x = rnorm(n), y = rnorm(n), z = rnorm(n) ) y <- x$x fit <- tabnet_fit(x, y, epochs = 10, num_steps = 1, batch_size = 512, attention_width = 1, num_shared = 1, num_independent = 1) ex <- tabnet_explain(fit, x)set.seed(2021) n <- 256 x <- data.frame( x = rnorm(n), y = rnorm(n), z = rnorm(n) ) y <- x$x fit <- tabnet_fit(x, y, epochs = 10, num_steps = 1, batch_size = 512, attention_width = 1, num_shared = 1, num_independent = 1) ex <- tabnet_explain(fit, x)
Fits the TabNet: Attentive Interpretable Tabular Learning model
tabnet_fit(x, ...) ## Default S3 method: tabnet_fit(x, ...) ## S3 method for class 'data.frame' tabnet_fit( x, y, tabnet_model = NULL, config = tabnet_config(), ..., from_epoch = NULL, weights = NULL ) ## S3 method for class 'formula' tabnet_fit( formula, data, tabnet_model = NULL, config = tabnet_config(), ..., from_epoch = NULL, weights = NULL ) ## S3 method for class 'recipe' tabnet_fit( x, data, tabnet_model = NULL, config = tabnet_config(), ..., from_epoch = NULL, weights = NULL ) ## S3 method for class 'Node' tabnet_fit( x, tabnet_model = NULL, config = tabnet_config(), ..., from_epoch = NULL )tabnet_fit(x, ...) ## Default S3 method: tabnet_fit(x, ...) ## S3 method for class 'data.frame' tabnet_fit( x, y, tabnet_model = NULL, config = tabnet_config(), ..., from_epoch = NULL, weights = NULL ) ## S3 method for class 'formula' tabnet_fit( formula, data, tabnet_model = NULL, config = tabnet_config(), ..., from_epoch = NULL, weights = NULL ) ## S3 method for class 'recipe' tabnet_fit( x, data, tabnet_model = NULL, config = tabnet_config(), ..., from_epoch = NULL, weights = NULL ) ## S3 method for class 'Node' tabnet_fit( x, tabnet_model = NULL, config = tabnet_config(), ..., from_epoch = NULL )
x |
Depending on the context:
The predictor data should be standardized (e.g. centered or scaled). The model treats categorical predictors internally thus, you don't need to make any treatment. The model treats missing values internally thus, you don't need to make any treatment. |
... |
Model hyperparameters.
Any hyperparameters set here will update those set by the config argument.
See |
y |
When
|
tabnet_model |
A previously fitted |
config |
A set of hyperparameters created using the |
from_epoch |
When a |
weights |
Unused. Placeholder for hardhat::importance_weight() variables. |
formula |
A formula specifying the outcome terms on the left-hand side, and the predictor terms on the right-hand side. |
data |
When a recipe or formula is used,
|
A TabNet model object. It can be used for serialization, predictions, or further fitting.
When providing a parent tabnet_model parameter, the model fitting resumes from that model weights
at the following epoch:
last fitted epoch for a model already in torch context
Last model checkpoint epoch for a model loaded from file
the epoch related to a checkpoint matching or preceding the from_epoch value if provided
The model fitting metrics append on top of the parent metrics in the returned TabNet model.
TabNet allows multi-outcome prediction, which is usually named multi-label classification or multi-output regression when outcomes are numerical. Multi-outcome currently expect outcomes to be either all numeric or all categorical.
TabNet uses torch as its backend for computation and torch uses all
available threads by default.
You can control the number of threads used by torch with:
torch::torch_set_num_threads(1) torch::torch_set_num_interop_threads(1)
## Not run: data("ames", package = "modeldata") data("attrition", package = "modeldata") ## Single-outcome regression using formula specification fit <- tabnet_fit(Sale_Price ~ ., data = ames, epochs = 4) ## Single-outcome classification using data-frame specification attrition_x <- attrition[ids,-which(names(attrition) == "Attrition")] fit <- tabnet_fit(attrition_x, attrition$Attrition, epochs = 4, verbose = TRUE) ## Multi-outcome regression on `Sale_Price` and `Pool_Area` in `ames` dataset using formula, ames_fit <- tabnet_fit(Sale_Price + Pool_Area ~ ., data = ames, epochs = 4, valid_split = 0.2) ## Multi-label classification on `Attrition` and `JobSatisfaction` in ## `attrition` dataset using recipe library(recipes) rec <- recipe(Attrition + JobSatisfaction ~ ., data = attrition) %>% step_normalize(all_numeric(), -all_outcomes()) attrition_fit <- tabnet_fit(rec, data = attrition, epochs = 4, valid_split = 0.2) ## Hierarchical classification on `acme` data(acme, package = "data.tree") acme_fit <- tabnet_fit(acme, epochs = 4, verbose = TRUE) # Note: Model's number of epochs should be increased for publication-level results. ## End(Not run)## Not run: data("ames", package = "modeldata") data("attrition", package = "modeldata") ## Single-outcome regression using formula specification fit <- tabnet_fit(Sale_Price ~ ., data = ames, epochs = 4) ## Single-outcome classification using data-frame specification attrition_x <- attrition[ids,-which(names(attrition) == "Attrition")] fit <- tabnet_fit(attrition_x, attrition$Attrition, epochs = 4, verbose = TRUE) ## Multi-outcome regression on `Sale_Price` and `Pool_Area` in `ames` dataset using formula, ames_fit <- tabnet_fit(Sale_Price + Pool_Area ~ ., data = ames, epochs = 4, valid_split = 0.2) ## Multi-label classification on `Attrition` and `JobSatisfaction` in ## `attrition` dataset using recipe library(recipes) rec <- recipe(Attrition + JobSatisfaction ~ ., data = attrition) %>% step_normalize(all_numeric(), -all_outcomes()) attrition_fit <- tabnet_fit(rec, data = attrition, epochs = 4, valid_split = 0.2) ## Hierarchical classification on `acme` data(acme, package = "data.tree") acme_fit <- tabnet_fit(acme, epochs = 4, verbose = TRUE) # Note: Model's number of epochs should be increased for publication-level results. ## End(Not run)
This is a nn_module representing the TabNet architecture from
Attentive Interpretable Tabular Deep Learning.
tabnet_nn( input_dim, output_dim, n_d = 8, n_a = 8, n_steps = 3, gamma = 1.3, cat_idxs = c(), cat_dims = c(), cat_emb_dim = 1, n_independent = 2, n_shared = 2, epsilon = 1e-15, virtual_batch_size = 128, momentum = 0.02, mask_type = "sparsemax", mask_topk = NULL )tabnet_nn( input_dim, output_dim, n_d = 8, n_a = 8, n_steps = 3, gamma = 1.3, cat_idxs = c(), cat_dims = c(), cat_emb_dim = 1, n_independent = 2, n_shared = 2, epsilon = 1e-15, virtual_batch_size = 128, momentum = 0.02, mask_type = "sparsemax", mask_topk = NULL )
input_dim |
Initial number of features. |
output_dim |
Dimension of network output. Examples : one for regression, 2 for binary classification etc.. Vector of those dimensions in case of multi-output. |
n_d |
Dimension of the prediction layer (usually between 4 and 64). |
n_a |
Dimension of the attention layer (usually between 4 and 64). |
n_steps |
Number of successive steps in the network (usually between 3 and 10). |
gamma |
Scaling factor for attention updates (usually between 1 and 2). |
cat_idxs |
Index of each categorical column in the dataset. |
cat_dims |
Number of categories in each categorical column. |
cat_emb_dim |
Size of the embedding of categorical features if int, all categorical features will have same embedding size if list of int, every corresponding feature will have specific size. |
n_independent |
Number of independent GLU layer in each GLU block of the encoder. |
n_shared |
Number of shared GLU layer in each GLU block of the encoder. |
epsilon |
Avoid log(0), this should be kept very low. |
virtual_batch_size |
Batch size for Ghost Batch Normalization. |
momentum |
Numerical value between 0 and 1 which will be used for momentum in all batch norm. |
mask_type |
Either "sparsemax", "entmax" or "entmax15": the sparse masking function to use. |
mask_topk |
the mask top-k value for k-sparsity selection in the mask for |
Pretrain the TabNet: Attentive Interpretable Tabular Learning model on the predictor data exclusively (unsupervised training).
tabnet_pretrain(x, ...) ## Default S3 method: tabnet_pretrain(x, ...) ## S3 method for class 'data.frame' tabnet_pretrain( x, y = NULL, tabnet_model = NULL, config = tabnet_config(), ..., from_epoch = NULL ) ## S3 method for class 'formula' tabnet_pretrain( formula, data, tabnet_model = NULL, config = tabnet_config(), ..., from_epoch = NULL ) ## S3 method for class 'recipe' tabnet_pretrain( x, data, tabnet_model = NULL, config = tabnet_config(), ..., from_epoch = NULL ) ## S3 method for class 'Node' tabnet_pretrain( x, tabnet_model = NULL, config = tabnet_config(), ..., from_epoch = NULL )tabnet_pretrain(x, ...) ## Default S3 method: tabnet_pretrain(x, ...) ## S3 method for class 'data.frame' tabnet_pretrain( x, y = NULL, tabnet_model = NULL, config = tabnet_config(), ..., from_epoch = NULL ) ## S3 method for class 'formula' tabnet_pretrain( formula, data, tabnet_model = NULL, config = tabnet_config(), ..., from_epoch = NULL ) ## S3 method for class 'recipe' tabnet_pretrain( x, data, tabnet_model = NULL, config = tabnet_config(), ..., from_epoch = NULL ) ## S3 method for class 'Node' tabnet_pretrain( x, tabnet_model = NULL, config = tabnet_config(), ..., from_epoch = NULL )
x |
Depending on the context:
The predictor data should be standardized (e.g. centered or scaled). The model treats categorical predictors internally thus, you don't need to make any treatment. The model treats missing values internally thus, you don't need to make any treatment. |
... |
Model hyperparameters.
Any hyperparameters set here will update those set by the config argument.
See |
y |
(optional) When |
tabnet_model |
A pretrained |
config |
A set of hyperparameters created using the |
from_epoch |
When a |
formula |
A formula specifying the outcome terms on the left-hand side, and the predictor terms on the right-hand side. |
data |
When a recipe or formula is used,
|
A TabNet model object. It can be used for serialization, predictions, or further fitting.
Outcome value are accepted here only for consistent syntax with tabnet_fit, but
by design the outcome, if present, is ignored during pre-training.
When providing a parent tabnet_model parameter, the model pretraining resumes from that model weights
at the following epoch:
last pretrained epoch for a model already in torch context
Last model checkpoint epoch for a model loaded from file
the epoch related to a checkpoint matching or preceding the from_epoch value if provided
The model pretraining metrics append on top of the parent metrics in the returned TabNet model.
TabNet uses torch as its backend for computation and torch uses all
available threads by default.
You can control the number of threads used by torch with:
torch::torch_set_num_threads(1) torch::torch_set_num_interop_threads(1)
data("ames", package = "modeldata") pretrained <- tabnet_pretrain(Sale_Price ~ ., data = ames, epochs = 1)data("ames", package = "modeldata") pretrained <- tabnet_pretrain(Sale_Price ~ ., data = ames, epochs = 1)