Below is a minimal (yet complete) example of a machine learning pipeline that use’s R’s tidymodels framework and the Palmer Penguins dataset.
Note that the goal here isn’t necessarily to fit the best model or demonstrate all of the features; rather it’s just to demonstrate a tidymodels workflow.
library(tidymodels)library(tidyverse)set.seed(0408)# there's a package for this, but let's just grab the csvpenguins <-read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv")# drop rows missing body masspenguins_complete <- penguins |>filter(!is.na(body_mass_g))# split the data into training and validationpenguins_split <-initial_split(penguins, prop = .8)trn <-training(penguins_split)val <-testing(penguins_split)# define a recipe for preprocessingpenguins_rec <-recipe(body_mass_g ~ ., data = trn) |>step_impute_mode(all_nominal_predictors()) |>step_impute_mean(all_numeric_predictors()) |>step_normalize(all_numeric_predictors()) |>step_dummy(all_nominal_predictors())# define a model specificationlm_spec <-linear_reg() |>set_engine("lm")# define a workflow with our preprocessor and our modelwf <-workflow(penguins_rec, lm_spec)# fit the workflowwf_fit <- wf |>fit(data = trn)# predict testing datay_hat <-unlist(predict(wf_fit, new_data = val))# estimate performanceeval_tbl <-tibble(truth = val$body_mass_g,estimate = y_hat)rmse(eval_tbl, truth, estimate)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 315.