inria.github.io Open in urlscan Pro
2606:50c0:8000::153 Public Scan

Back to summary

Submitted URL:
http://inria.github.io/scikit-learn-mooc/python_scripts/parameter_tuning_manual.html
Effective URL:
https://inria.github.io/scikit-learn-mooc/python_scripts/parameter_tuning_manual.html
Submission: On June 22 via api (June 22nd 2023, 1:43:31 pm UTC) from CA — Scanned from CA

Form analysis
1 forms found in the DOM

GET ../search.html

<form class="bd-search d-flex align-items-center" action="../search.html" method="get">
  <i class="fa-solid fa-magnifying-glass"></i>
  <input type="search" class="form-control" name="q" id="search-input" placeholder="Search this book..." aria-label="Search this book..." autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false">
  <span class="search-button__kbd-shortcut"><kbd class="kbd-shortcut__modifier">Ctrl</kbd>+<kbd>K</kbd></span>
</form>

Text Content

Skip to main content
Ctrl+K

 * Introduction

Machine Learning Concepts

 * 🎥 Introducing machine-learning concepts
 * ✅ Quiz Intro.01

The predictive modeling pipeline

 * Module overview
 * Tabular data exploration
   * First look at our dataset
   * 📝 Exercise M1.01
   * 📃 Solution for Exercise M1.01
   * ✅ Quiz M1.01
 * Fitting a scikit-learn model on numerical data
   * First model with scikit-learn
   * 📝 Exercise M1.02
   * 📃 Solution for Exercise M1.02
   * Working with numerical data
   * 📝 Exercise M1.03
   * 📃 Solution for Exercise M1.03
   * Preprocessing for numerical features
   * 🎥 Validation of a model
   * Model evaluation using cross-validation
   * ✅ Quiz M1.02
 * Handling categorical data
   * Encoding of categorical variables
   * 📝 Exercise M1.04
   * 📃 Solution for Exercise M1.04
   * Using numerical and categorical variables together
   * 📝 Exercise M1.05
   * 📃 Solution for Exercise M1.05
   * 🎥 Visualizing scikit-learn pipelines in Jupyter
   * Visualizing scikit-learn pipelines in Jupyter
   * ✅ Quiz M1.03
 * 🏁 Wrap-up quiz 1
 * Main take-away

Selecting the best model

 * Module overview
 * Overfitting and underfitting
   * 🎥 Overfitting and Underfitting
   * Cross-validation framework
   * ✅ Quiz M2.01
 * Validation and learning curves
   * 🎥 Comparing train and test errors
   * Overfit-generalization-underfit
   * Effect of the sample size in cross-validation
   * 📝 Exercise M2.01
   * 📃 Solution for Exercise M2.01
   * ✅ Quiz M2.02
 * Bias versus variance trade-off
   * 🎥 Bias versus Variance
   * ✅ Quiz M2.03
 * 🏁 Wrap-up quiz 2
 * Main take-away

Hyperparameter tuning

 * Module overview
 * Manual tuning
   * Set and get hyperparameters in scikit-learn
   * 📝 Exercise M3.01
   * 📃 Solution for Exercise M3.01
   * ✅ Quiz M3.01
 * Automated tuning
   * Hyperparameter tuning by grid-search
   * Hyperparameter tuning by randomized-search
   * 🎥 Analysis of hyperparameter search results
   * Analysis of hyperparameter search results
   * Evaluation and hyperparameter tuning
   * 📝 Exercise M3.02
   * 📃 Solution for Exercise M3.02
   * ✅ Quiz M3.02
 * 🏁 Wrap-up quiz 3
 * Main take-away

Linear models

 * Module overview
 * Intuitions on linear models
   * 🎥 Intuitions on linear models
   * ✅ Quiz M4.01
 * Linear regression
   * Linear regression without scikit-learn
   * 📝 Exercise M4.01
   * 📃 Solution for Exercise M4.01
   * Linear regression using scikit-learn
   * ✅ Quiz M4.02
 * Modelling non-linear features-target relationships
   * 📝 Exercise M4.02
   * 📃 Solution for Exercise M4.02
   * Linear regression for a non-linear features-target relationship
   * 📝 Exercise M4.03
   * 📃 Solution for Exercise M4.03
   * ✅ Quiz M4.03
 * Regularization in linear model
   * 🎥 Intuitions on regularized linear models
   * Regularization of linear regression model
   * 📝 Exercise M4.04
   * 📃 Solution for Exercise M4.04
   * ✅ Quiz M4.04
 * Linear model for classification
   * Linear model for classification
   * 📝 Exercise M4.05
   * 📃 Solution for Exercise M4.05
   * Beyond linear separation in classification
   * ✅ Quiz M4.05
 * 🏁 Wrap-up quiz 4
 * Main take-away

Decision tree models

 * Module overview
 * Intuitions on tree-based models
   * 🎥 Intuitions on tree-based models
   * ✅ Quiz M5.01
 * Decision tree in classification
   * Build a classification decision tree
   * 📝 Exercise M5.01
   * 📃 Solution for Exercise M5.01
   * ✅ Quiz M5.02
 * Decision tree in regression
   * Decision tree for regression
   * 📝 Exercise M5.02
   * 📃 Solution for Exercise M5.02
   * ✅ Quiz M5.03
 * Hyperparameters of decision tree
   * Importance of decision tree hyperparameters on generalization
   * ✅ Quiz M5.04
 * 🏁 Wrap-up quiz 5
 * Main take-away

Ensemble of models

 * Module overview
 * Ensemble method using bootstrapping
   * 🎥 Intuitions on ensemble models: bagging
   * Introductory example to ensemble models
   * Bagging
   * 📝 Exercise M6.01
   * 📃 Solution for Exercise M6.01
   * Random forests
   * 📝 Exercise M6.02
   * 📃 Solution for Exercise M6.02
   * ✅ Quiz M6.01
 * Ensemble based on boosting
   * 🎥 Intuitions on ensemble models: boosting
   * Adaptive Boosting (AdaBoost)
   * Gradient-boosting decision tree (GBDT)
   * 📝 Exercise M6.03
   * 📃 Solution for Exercise M6.03
   * Speeding-up gradient-boosting
   * ✅ Quiz M6.02
 * Hyperparameter tuning with ensemble methods
   * Hyperparameter tuning
   * 📝 Exercise M6.04
   * 📃 Solution for Exercise M6.04
   * ✅ Quiz M6.03
 * 🏁 Wrap-up quiz 6
 * Main take-away

Evaluating model performance

 * Module overview
 * Comparing a model with simple baselines
   * Comparing model performance with a simple baseline
   * 📝 Exercise M7.01
   * 📃 Solution for Exercise M7.01
   * ✅ Quiz M7.01
 * Choice of cross-validation
   * Stratification
   * Sample grouping
   * Non i.i.d. data
   * ✅ Quiz M7.02
 * Nested cross-validation
   * Nested cross-validation
   * ✅ Quiz M7.03
 * Classification metrics
   * Classification
   * 📝 Exercise M7.02
   * 📃 Solution for Exercise M7.02
   * ✅ Quiz M7.04
 * Regression metrics
   * Regression
   * 📝 Exercise M7.03
   * 📃 Solution for Exercise M7.03
   * ✅ Quiz M7.05
 * 🏁 Wrap-up quiz 7
 * Main take-away

Concluding remarks

 * 🎥 Concluding remarks
 * Concluding remarks

Appendix

 * Glossary
 * Datasets description
   * The penguins datasets
   * The adult census dataset
   * The California housing dataset
   * The Ames housing dataset
   * The blood transfusion dataset
   * The bike rides dataset
 * Acknowledgement
 * Notebook timings
 * Table of contents

🚧 Feature selection

 * Module overview
 * Benefits of using feature selection
 * Caveats of feature selection
   * 📝 Exercise 01
   * 📃 Solution for Exercise 01
   * Limitation of selecting feature using a model
 * Main take-away
 * ✅ Quiz

🚧 Interpretation

 * Feature importance
 * ✅ Quiz



 * Binder

 * Repository
 * Suggest edit
 * Open issue

 * .py
 * .pdf


SET AND GET HYPERPARAMETERS IN SCIKIT-LEARN





SET AND GET HYPERPARAMETERS IN SCIKIT-LEARN#

The process of learning a predictive model is driven by a set of internal
parameters and a set of training data. These internal parameters are called
hyperparameters and are specific for each family of models. In addition, a
specific set of hyperparameters are optimal for a specific dataset and thus they
need to be optimized.

Note

In this notebook we will use the words “hyperparameters” and “parameters”
interchangeably.

This notebook shows how one can get and set the value of a hyperparameter in a
scikit-learn estimator. We recall that hyperparameters refer to the parameter
that will control the learning process.

They should not be confused with the fitted parameters, resulting from the
training. These fitted parameters are recognizable in scikit-learn because they
are spelled with a final underscore _, for instance model.coef_.

We will start by loading the adult census dataset and only use the numerical
features.

import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]

target = adult_census[target_name]
data = adult_census[numerical_columns]


Copy to clipboard

Our data is only numerical.

data.head()


Copy to clipboard

age capital-gain capital-loss hours-per-week 0 25 0 0 40 1 38 0 0 50 2 28 0 0 40
3 44 7688 0 40 4 18 0 0 30

Let’s create a simple predictive model made of a scaler followed by a logistic
regression classifier.

As mentioned in previous notebooks, many models, including linear ones, work
better if all features have a similar scaling. For this purpose, we use a
StandardScaler, which transforms the data by rescaling features.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

model = Pipeline(
    steps=[
        ("preprocessor", StandardScaler()),
        ("classifier", LogisticRegression()),
    ]
)


Copy to clipboard

We can evaluate the generalization performance of the model via
cross-validation.

from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data, target)
scores = cv_results["test_score"]
print(
    "Accuracy score via cross-validation:\n"
    f"{scores.mean():.3f} ± {scores.std():.3f}"
)


Copy to clipboard

Accuracy score via cross-validation:
0.800 ± 0.003


Copy to clipboard

We created a model with the default C value that is equal to 1. If we wanted to
use a different C parameter we could have done so when we created the
LogisticRegression object with something like LogisticRegression(C=1e-3).

Note

For more information on the model hyperparameter C, refer to the documentation.
Be aware that we will focus on linear models in an upcoming module.

We can also change the parameter of a model after it has been created with the
set_params method, which is available for all scikit-learn estimators. For
example, we can set C=1e-3, fit and evaluate the model:

model.set_params(classifier__C=1e-3)
cv_results = cross_validate(model, data, target)
scores = cv_results["test_score"]
print(
    "Accuracy score via cross-validation:\n"
    f"{scores.mean():.3f} ± {scores.std():.3f}"
)


Copy to clipboard

Accuracy score via cross-validation:
0.787 ± 0.002


Copy to clipboard

When the model of interest is a Pipeline, the parameter names are of the form
<model_name>__<parameter_name> (note the double underscore in the middle). In
our case, classifier comes from the Pipeline definition and C is the parameter
name of LogisticRegression.

In general, you can use the get_params method on scikit-learn models to list all
the parameters with their values. For example, if you want to get all the
parameter names, you can use:

for parameter in model.get_params():
    print(parameter)


Copy to clipboard

memory
steps
verbose
preprocessor
classifier
preprocessor__copy
preprocessor__with_mean
preprocessor__with_std
classifier__C
classifier__class_weight
classifier__dual
classifier__fit_intercept
classifier__intercept_scaling
classifier__l1_ratio
classifier__max_iter
classifier__multi_class
classifier__n_jobs
classifier__penalty
classifier__random_state
classifier__solver
classifier__tol
classifier__verbose
classifier__warm_start


Copy to clipboard

.get_params() returns a dict whose keys are the parameter names and whose values
are the parameter values. If you want to get the value of a single parameter,
for example classifier__C, you can use:

model.get_params()["classifier__C"]


Copy to clipboard

0.001


Copy to clipboard

We can systematically vary the value of C to see if there is an optimal value.

for C in [1e-3, 1e-2, 1e-1, 1, 10]:
    model.set_params(classifier__C=C)
    cv_results = cross_validate(model, data, target)
    scores = cv_results["test_score"]
    print(
        f"Accuracy score via cross-validation with C={C}:\n"
        f"{scores.mean():.3f} ± {scores.std():.3f}"
    )


Copy to clipboard

Accuracy score via cross-validation with C=0.001:
0.787 ± 0.002


Copy to clipboard

Accuracy score via cross-validation with C=0.01:
0.799 ± 0.003


Copy to clipboard

Accuracy score via cross-validation with C=0.1:
0.800 ± 0.003


Copy to clipboard

Accuracy score via cross-validation with C=1:
0.800 ± 0.003


Copy to clipboard

Accuracy score via cross-validation with C=10:
0.800 ± 0.003


Copy to clipboard

We can see that as long as C is high enough, the model seems to perform well.

What we did here is very manual: it involves scanning the values for C and
picking the best one manually. In the next lesson, we will see how to do this
automatically.

Warning

When we evaluate a family of models on test data and pick the best performer, we
can not trust the corresponding prediction accuracy, and we need to apply the
selected model to new data. Indeed, the test data has been used to select the
model, and it is thus no longer independent from this model.

In this notebook we have seen:

 * how to use get_params and set_params to get the parameters of a model and set
   them.

previous

Manual tuning

next

📝 Exercise M3.01

By scikit-learn developers

© Copyright 2022.



Join the full MOOC for better learning!
Brought to you under a CC-BY License by Inria Learning Lab, scikit-learn @ La
Fondation Inria, Inria Academy, with many thanks to the scikit-learn community
as a whole!

inria.github.io Open in urlscan Pro 2606:50c0:8000::153 Public Scan

Form analysis 1 forms found in the DOM

GET ../search.html

Text Content

inria.github.io Open in urlscan Pro
2606:50c0:8000::153 Public Scan

Form analysis
1 forms found in the DOM