machinelearningmastery.com Open in urlscan Pro
2606:4700:20::681a:94  Public Scan

URL: https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/
Submission: On March 25 via api from GB — Scanned from GB

Form analysis 2 forms found in the DOM

GET https://machinelearningmastery.com/

<form method="get" class="searchform" action="https://machinelearningmastery.com/">
  <input type="text" class="field s" name="s" value="Search..." onfocus="if (!window.__cfRLUnblockHandlers) return false; if (this.value == 'Search...') {this.value = '';}"
    onblur="if (!window.__cfRLUnblockHandlers) return false; if (this.value == '') {this.value = 'Search...';}">
  <input type="hidden" name="post_type" value="post">
  <button type="submit" class="fa fa-search submit" name="submit" value="Search"></button>
</form>

POST https://machinelearningmastery.com/wp-comments-post.php

<form action="https://machinelearningmastery.com/wp-comments-post.php" method="post" id="commentform" class="comment-form">
  <p class="comment-form-comment"><label class="hide" for="comment">Comment <span class="required">*</span></label> <textarea tabindex="4" id="comment" name="comment" cols="50" rows="10" maxlength="65525" required="required"></textarea></p>
  <p class="comment-form-author"><input id="author" name="author" type="text" class="txt" tabindex="1" value="" size="30" aria-required="true"><label for="author">Name <span class="required">(required)</span></label> </p>
  <p class="comment-form-email"><input id="email" name="email" type="text" class="txt" tabindex="2" value="" size="30" aria-required="true"><label for="email">Email (will not be published) <span class="required">(required)</span></label> </p>
  <p class="form-submit"><input name="submit" type="submit" id="submit" class="submit" value="Submit Comment"> <input type="hidden" name="comment_post_ID" value="1629" id="comment_post_ID">
    <input type="hidden" name="comment_parent" id="comment_parent" value="0">
  </p>
  <p style="display: none;"><input type="hidden" id="akismet_comment_nonce" name="akismet_comment_nonce" value="2ff04cb135"></p>
  <p style="display: none !important;" class="akismet-fields-container" data-prefix="ak_"><label>Δ<textarea name="ak_hp_textarea" cols="45" rows="8" maxlength="100"></textarea></label><input type="hidden" id="ak_js_1" name="ak_js" value="209">
    <script type="rocketlazyloadscript">document.getElementById( "ak_js_1" ).setAttribute( "value", ( new Date() ).getTime() );</script>
  </p>
</form>

Text Content

NAVIGATION

MachineLearningMastery.com Making developers awesome at machine learning
Click to Take the FREE Python Machine Learning Crash-Course

Home


MAIN MENU

 * Get Started
 * Blog
 * Topics
   * Attention
   * Better Deep Learning
   * Calculus
   * ChatGPT
   * Code Algorithms
   * Computer Vision
   * Data Preparation
   * Deep Learning (keras)
   * Deep Learning with PyTorch
   * Ensemble Learning
   * GANs
   * Neural Net Time Series
   * NLP (Text)
   * Imbalanced Learning
   * Intro to Time Series
   * Intro to Algorithms
   * Linear Algebra
   * LSTMs
   * OpenCV
   * Optimization
   * Probability
   * Python (scikit-learn)
   * Python for Machine Learning
   * R (caret)
   * Statistics
   * Weka (no code)
   * XGBoost
 * EBooks
 * FAQ
 * About
 * Contact


Return to Content


FEATURE SELECTION IN PYTHON WITH SCIKIT-LEARN

By Jason Brownlee on June 4, 2020 in Python Machine Learning 115
Share Tweet Share

Not all data attributes are created equal. More is not always better when it
comes to attributes or columns in your dataset.

In this post you will discover how to select attributes in your data before
creating a machine learning model using the scikit-learn library.

Kick-start your project with my new book Machine Learning Mastery With Python,
including step-by-step tutorials and the Python source code files for all
examples.

Let’s get started.

Update: For a more recent tutorial on feature selection in Python see the post:

 * Feature Selection For Machine Learning in Python

Cut Down on Your Options with Feature Selection
Photo by Josh Friedman, some rights reserved


SELECT FEATURES

Feature selection is a process where you automatically select those features in
your data that contribute most to the prediction variable or output in which you
are interested.

Having too many irrelevant features in your data can decrease the accuracy of
the models. Three benefits of performing feature selection before modeling your
data are:

 * Reduces Overfitting: Less redundant data means less opportunity to make
   decisions based on noise.
 * Improves Accuracy: Less misleading data means modeling accuracy improves.
 * Reduces Training Time: Less data means that algorithms train faster.

Two different feature selection methods provided by the scikit-learn Python
library are Recursive Feature Elimination and feature importance ranking.




RECURSIVE FEATURE ELIMINATION

The Recursive Feature Elimination (RFE) method is a feature selection approach.
It works by recursively removing attributes and building a model on those
attributes that remain. It uses the model accuracy to identify which attributes
(and combination of attributes) contribute the most to predicting the target
attribute.

This recipe shows the use of RFE on the Iris floweres dataset to select 3
attributes.

Recursive Feature Elimination

Python

# Recursive Feature Elimination from sklearn import datasets from
sklearn.feature_selection import RFE from sklearn.linear_model import
LogisticRegression # load the iris datasets dataset = datasets.load_iris() #
create a base classifier used to evaluate a subset of attributes model =
LogisticRegression() # create the RFE model and select 3 attributes rfe =
RFE(model, 3) rfe = rfe.fit(dataset.data, dataset.target) # summarize the
selection of the attributes print(rfe.support_) print(rfe.ranking_)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Recursive Feature Elimination
from sklearn import datasets
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# load the iris datasets
dataset = datasets.load_iris()
# create a base classifier used to evaluate a subset of attributes
model = LogisticRegression()
# create the RFE model and select 3 attributes
rfe = RFE(model, 3)
rfe = rfe.fit(dataset.data, dataset.target)
# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)

For a more extensive tutorial on RFE for classification and regression, see the
tutorial:

 * Recursive Feature Elimination (RFE) for Feature Selection in Python




FEATURE IMPORTANCE

Methods that use ensembles of decision trees (like Random Forest or Extra Trees)
can also compute the relative importance of each attribute. These importance
values can be used to inform a feature selection process.

This recipe shows the construction of an Extra Trees ensemble of the iris
flowers dataset and the display of the relative feature importance.

Feature Importance with datasets.load_iris() # fit an Extra

Python

# Feature Importance from sklearn import datasets from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier # load the iris datasets
dataset = datasets.load_iris() # fit an Extra Trees model to the data model =
ExtraTreesClassifier() model.fit(dataset.data, dataset.target) # display the
relative importance of each attribute print(model.feature_importances_)
1
2
3
4
5
6
7
8
9
10
11
# Feature Importance
from sklearn import datasets
from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
# load the iris datasets
dataset = datasets.load_iris()
# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(dataset.data, dataset.target)
# display the relative importance of each attribute
print(model.feature_importances_)

For a more extensive tutorial on feature importance with a range of algorithms,
see the tutorial:

 * How to Calculate Feature Importance With Python




SUMMARY

Feature selection methods can give you useful information on the relative
importance or relevance of features for a given problem. You can use this
information to create filtered versions of your dataset and increase the
accuracy of your models.

In this post you discovered two feature selection methods you can apply in
Python using the scikit-learn library.


DISCOVER FAST MACHINE LEARNING IN PYTHON!

DEVELOP YOUR OWN MODELS IN MINUTES

...with just a few lines of scikit-learn code

Learn how in my new Ebook:
Machine Learning Mastery With Python

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, modeling, tuning, and much more...

FINALLY BRING MACHINE LEARNING TO
YOUR OWN PROJECTS

Skip the Academics. Just Results.

See What's Inside





Share Tweet Share



MORE ON THIS TOPIC

 * Feature Importance and Feature Selection With…
 * Recursive Feature Elimination (RFE) for Feature…
 * A Gentle Introduction to Scikit-Learn: A Python…
 * Rescaling Data for Machine Learning in Python with…
 * How To Prepare Your Data For Machine Learning in…
 * Spot-Check Classification Machine Learning…




ABOUT JASON BROWNLEE

Jason Brownlee, PhD is a machine learning specialist who teaches developers how
to get results with modern machine learning methods via hands-on tutorials.
View all posts by Jason Brownlee →


Rescaling Data for Machine Learning in Python with Scikit-Learn
How to Tune Algorithm Parameters with Scikit-Learn



115 RESPONSES TO FEATURE SELECTION IN PYTHON WITH SCIKIT-LEARN

 1.  Harsh October 9, 2014 at 4:51 pm #
     
     Nice post, how does RFE and Feature selection like chi2 are different. I
     mean, finally they are achieving the same goal, right?
     
     Reply
     * jasonb October 10, 2014 at 6:52 am #
       
       Both seek to reduce the number of features, but they do so using
       different methods. chi squared is a univariate statistical measure that
       can be used to rank features, whereas RFE tests different subsets of
       features.
       
       Reply
       * Enny November 29, 2018 at 8:04 am #
         
         Is there any benchmarks, for example, P value, F score, or R square, to
         be used to score the importance of features?
         
         Reply
         * Jason Brownlee November 29, 2018 at 2:33 pm #
           
           No, the scores are relative and specific to a given problem.
           
           Reply
     * mitillo September 2, 2017 at 7:27 pm #
       
       Hello,
       
       I read and view a lot about machine learning but you are amazing,
       You are able to explain everything in a simple way and write code that
       everyone can understand and ‘play’ with it. and you give good resource
       for anyone who wants to deep in the topic
       
       you are good teacher
       
       Thank you for your work
       
       Reply
       * Jason Brownlee September 3, 2017 at 5:41 am #
         
         Thanks mitillo.
         
         Reply
 2.  Bozhidar June 26, 2015 at 11:04 pm #
     
     Hello,
     
     Can you tell me which feature selection methods you suggest for time-series
     data?
     
     Reply
     * Alex January 19, 2017 at 8:55 am #
       
       Please see tsfresh – it’s a new approach for feature selection designed
       for TS
       
       Reply
 3.  Max January 30, 2016 at 7:22 pm #
     
     Great site Jason!
     
     Reply
 4.  Alan February 24, 2016 at 9:48 am #
     
     Thanks for that good post. Just wondering whether RFE is also usable for
     linear regression? How it the model accuracy measured?
     
     Reply
 5.  Carmen January 4, 2017 at 1:31 am #
     
     Jason, quick question that may help someone else stumbling across this
     post.
     
     The example above does RFE using an untuned model. When would/would not
     make sense to find some optimised hyperparameters of the model using grid
     search *first*, and THEN doing RFE. In your experience, is this a good
     idea/helpful thing to do? If not, then why?
     
     Reply
     * Jason Brownlee January 4, 2017 at 8:58 am #
       
       Hi Carmen, nice catch.
       
       Short answer: we are interested in relative difference of feature
       subsets, not absolute best performance.
       
       Generally, it a good idea to use a robust method for feature selection –
       that is a method that performs well on most problems with little or no
       tuning. This provides a baseline and a wrapper method like RFE can focus
       on the relative difference in the feature subsets rather than on the
       optimized best performance of each subset.
       
       There are those cases where your general method (say a random forest)
       falls down. In those cases, you may want to try RFE with a suite of 3-5
       different wrapped methods and see what falls out. I expect that is this
       is overkill on most problems.
       
       Does that help?
       
       Reply
 6.  Carmen January 6, 2017 at 7:58 pm #
     
     Thanks that helps. The only reason I’d mentioned tuning a model first
     (light tuning) is that as you mentioned in your “spot checking” post, you
     want to give algorithms a chance to put their best step forward. If that
     applies there, I don’t see why it shouldn’t apply to RFE.
     
     So I figured light tuning (only on the most common hyperparameter with the
     most common grid values) may help here. But I see your point. Once I’ve got
     my code all sorted out I may try both and report back 🙂
     
     Reply
     * Jason Brownlee January 7, 2017 at 8:30 am #
       
       You’re absolutely right Carmen.
       
       There is a cost/benefit here and ultimately it will come down to
       experience and the “taste” of the practitioner.
       
       In fact, much of industrial machine learning comes down to taste 🙂
       Most top methods perform just as well say at the 90-95% effort-result
       level. The really hard work is trying to get above that, kaggle comps are
       good case in point.
       
       Reply
 7.  akram June 13, 2017 at 3:38 am #
     
     thanks so much for your post Jason
     
     i’am a beginner in scikit-learn and i’ve a little problem when using
     feature selection module VarianceThreshold, the problem is when i set the
     variance Var[X]=.8*(1-.8)
     
     it is supposed to remove all features (that have the same value in all
     samples) which have the probability p>0.8.
     in my case the fifth column should be removed, p=8/10>(threshold=0,7).
     
     #####################################
     
     from sklearn.feature_selection import VarianceThreshold
     X=[[0,1,1,1,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00],
     [0,1,1,1,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00],
     [0,1,1,1,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00],
     [0,1,1,1,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00],
     [0,1,1,1,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.01,0.00,0.00,0.00,0.00,0.00],
     [0,1,1,1,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.01,0.00,0.00,0.00,0.00,0.00],
     [0,1,2,1,29,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,0.00,0.00,0.00,0.00,0.50,1.00,0.00,10,3,0.30,0.30,0.30,0.00,0.00,0.00,0.00,0.00],
     [0,1,1,1,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,253,0.99,0.01,0.00,0.00,0.00,0.00,0.00,0.00],
     [0,1,1,1,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00],
     [0,2,3,1,223,185,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,4,4,0.00,0.00,0.00,0.00,1.00,0.00,0.00,71,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00]]
     sel=VarianceThreshold(threshold=(.7*(1-.7)))
     
     and this is what i get when running the script
     
     >>> sel.fit_transform(X)
     
     array([[ 1., 105., 146., 1., 1., 255., 254.],
     [ 1., 105., 146., 1., 1., 255., 254.],
     [ 1., 105., 146., 1., 1., 255., 254.],
     [ 1., 105., 146., 2., 2., 255., 254.],
     [ 1., 105., 146., 2., 2., 255., 254.],
     [ 1., 105., 146., 2., 2., 255., 255.],
     [ 2., 29., 0., 2., 1., 10., 3.],
     [ 1., 105., 146., 1., 1., 255., 253.],
     [ 1., 105., 146., 2., 2., 255., 254.],
     [ 3., 223., 185., 4., 4., 71., 255.]])
     the second column here should not apear.
     thanks;)
     
     Reply
     * Jason Brownlee June 13, 2017 at 8:24 am #
       
       It is not clear to me what the fault could be. Consider posting to
       stackoverflow or similar?
       
       Reply
 8.  Ishaan July 4, 2017 at 10:12 pm #
     
     Hi Jason,
     
     I am performing feature selection ( on a dataset with 1,00,000 rows and 32
     features) using multinomial Logistic Regression using python.Now, what
     would be the most efficient way to select features in order to build model
     for multiclass target variable(1,2,3,4,5,6,7,8,9,10)? I have used RFE for
     feature selection but it gives Rank=1 to all features. Do I consider all
     features for building model? Is there any other method for this?
     Thanks in advance.
     
     Reply
     * Jason Brownlee July 6, 2017 at 10:15 am #
       
       Try a suite of methods, build models based on the features and compare
       the performance of those models.
       
       Reply
 9.  Hemalatha S November 17, 2017 at 6:50 pm #
     
     can you tell me how to select features for clinical datasets from a csv
     file??
     
     Reply
     * Jason Brownlee November 18, 2017 at 10:13 am #
       
       Try a suite of feature selection methods, build models based on selected
       features, use the set of features + model that results in the best model
       skill.
       
       Reply
 10. Sufian November 26, 2017 at 4:35 am #
     
     Hi Jason, How can I print the feature name and the importance side by side?
     
     Thanks,
     Sufian
     
     Reply
     * Jason Brownlee November 26, 2017 at 7:35 am #
       
       es, if you have an array of feature or column names you can use the same
       index into both arrays.
       
       Reply
 11. Hemalatha December 1, 2017 at 2:03 am #
     
     what are the feature selection methods?? and how to build models based on
     the selected features??
     can you help me in this? because I am new to machine learning and python
     
     Reply
     * Jason Brownlee December 1, 2017 at 7:39 am #
       
       Sure, read this post on feature selection:
       https://machinelearningmastery.com/an-introduction-to-feature-selection/
       
       Reply
 12. Praveen January 2, 2018 at 6:42 pm #
     
     i want to remove columns which are highly correlated like caret package pre
     processing method does in R. how can i remove them using sklearn?
     
     Reply
     * Jason Brownlee January 3, 2018 at 5:32 am #
       
       You might need to implement it yourself – e.g. calculate the correlation
       matrix and remove selected columns.
       
       Reply
 13. Shabnam January 5, 2018 at 8:15 am #
     
     Deas Keras have similar functionality like FRE that we can use?
     
     I am using Keras for my models. I created a model. Then, I wanted to use
     RFE for it. The first line (rfe=FRE(model, 3)) is fine, but as soon as I
     want to fit the data, I get following error:
     
     TypeError: Cannot clone object ” (type ): it does not seem to be a
     scikit-learn estimator as it does not implement a ‘get_params’ methods.
     
     Reply
     * Jason Brownlee January 5, 2018 at 11:37 am #
       
       You may be able to use the sklearn wrappers in Keras and then put the
       wrapped model within RFE.
       
       I have posts on using the wrappers on the blog, for example:
       https://machinelearningmastery.com/use-keras-deep-learning-models-scikit-learn-python/
       
       Reply
       * Shabnam January 6, 2018 at 7:21 am #
         
         That is awesome! I’ll read it. Thanks a lot for your reply and sharing
         the link.
         
         Reply
         * Jason Brownlee January 7, 2018 at 5:01 am #
           
           No problem.
           
           Reply
           * Deep saxena April 12, 2019 at 8:18 pm #
             
             After using your suggestion keras model does not support or ranking
             attribute
             
             
           * Jason Brownlee April 13, 2019 at 6:27 am #
             
             No it does not.
             
             
           * Deep saxena April 15, 2019 at 5:01 pm #
             
             Then how can we RFE test on keras model ?
             
             
           * Jason Brownlee April 16, 2019 at 6:46 am #
             
             Perhaps you can use the Keras wrapper for the model, then use it as
             part of RFE?
             
             
           * Deep saxena April 16, 2019 at 9:42 pm #
             
             I did that, but no suceess, I am pasting the code for reference
             def create_model():
             # create model
             model = Sequential()
             model.add(Dense(1000, input_dim=v.shape[1], activation=’relu’))
             model.add(Dropout(0.2))
             model.add(Dense(3, activation=’softmax’))
             model.compile(loss=’sparse_categorical_crossentropy’,
             optimizer=’adam’, metrics=[‘accuracy’])
             return model
             
             by_name=True)
             seed = 7
             np.random.seed(seed)
             keras_model = KerasClassifier(build_fn=create_model, epochs=10,
             batch_size=10, verbose=1)
             
             rfe = RFE(keras_model, 3)
             rfe = rfe.fit(v, all_label_encoded)
             print(rfe.support_)
             print(rfe)
             
             model does not support support and ranking. Can you tell me exactly
             how to get the ranking and the support?
             
             
           * Jason Brownlee April 17, 2019 at 7:00 am #
             
             I’m eager to help, but I don’t have the capacity to debug code.
             
             I have some suggestions here:
             https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-code
             
             
           * Deep saxena April 17, 2019 at 5:59 pm #
             
             Your answer justifies the stuff, thanks for the reply.
             
             
     * Deep saxena April 23, 2019 at 4:29 pm #
       
       @Shubham Just to clarify Keras classifier will not work with RFE. Answer
       mentioned by Jason Brownlee will not work.
       
       Reply
       * Jason Brownlee April 24, 2019 at 7:52 am #
         
         Perhaps you can try running a manual search over subsets of features
         with the model?
         
         Perhaps you can run RFE with a sklearn model and use the results to
         motivate a Keras model?
         
         Reply
         * Deep saxena April 25, 2019 at 8:44 pm #
           
           OK
           
           Reply
 14. Smitha January 16, 2018 at 12:33 am #
     
     Hi Jason,
     
     Can Random Forest’s feature importance be considered as a wrapper based
     approach?
     
     Reply
     * Jason Brownlee January 16, 2018 at 7:37 am #
       
       No.
       
       Reply
       * Eva July 4, 2019 at 9:46 pm #
         
         Is it an embedded method?
         
         Reply
         * Jason Brownlee July 5, 2019 at 8:06 am #
           
           No, it is not an embedding method.
           
           Reply
 15. Beytullah January 20, 2018 at 9:40 pm #
     
     Hi Jason,
     
     Do you know how is feature importance calculated?
     
     Reply
     * Jason Brownlee January 21, 2018 at 9:10 am #
       
       It depends on the algorithm.
       
       I cover it in detail for stochastic gradient boosting here:
       https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/
       
       Reply
 16. Fawad January 26, 2018 at 4:52 pm #
     
     I feel in recursive feature selection it is more prudent to use cv and let
     the algo decide how many features to retain
     
     Reply
     * Jason Brownlee January 27, 2018 at 5:54 am #
       
       Yes. I often keep all features and use subspaces or ensembles of feature
       selection methods.
       
       Reply
 17. kumar February 26, 2018 at 4:19 pm #
     
     i need to select the best features from my own data set…using feature
     selection wrapper approach the learning algorithm is ant colony
     optimization and the classifier is svm …any one have any idea…
     
     Reply
 18. Kagne March 23, 2018 at 8:30 pm #
     
     Nice post!
     
     But I still have a question.
     
     I entered the kaggle competition recently, and I evaluate my dataset by
     using the methods you have posted(the model is
     
     RandomForest).
     
     Then I deleted the worst feature. And my score decreased from 0.79904 to
     0.78947. Then I was confused. Should I build more
     
     features? And What should I do to get a higher score(change model? expand
     features or more?) or where I can learn those ?
     
     Thanks a lot.
     
     Reply
     * Jason Brownlee March 24, 2018 at 6:24 am #
       
       Great question. You must try lots of things, this is why ml is hard:
       https://machinelearningmastery.com/applied-machine-learning-is-hard/
       
       It’s a big search problem:
       https://machinelearningmastery.com/applied-machine-learning-as-a-search-problem/
       
       Here is a list of things to try:
       https://machinelearningmastery.com/machine-learning-performance-improvement-cheat-sheet/
       
       Reply
 19. Rimi March 29, 2018 at 7:38 pm #
     
     Hi Jason,
     
     I wanted to know if there are any existing python library/libraries that
     can be used to rank all the features in a specific dataset based on a
     specific attribute for various methods like Gain Ratio, Infomation Gain,
     Chi2,rank correlation, linear correlation, symmetric uncertainty . If not,
     can you please provide some steps to proceed with the same?
     
     Thanks
     
     Reply
     * Jason Brownlee March 30, 2018 at 6:35 am #
       
       Perhaps?
       
       Each method will have a different “view” on what is important in the
       data. You can test each view to see what is real/useful to developing a
       skilful model.
       
       Reply
 20. Abbas April 11, 2018 at 11:48 pm #
     
     What about the feature importance attribute from the decision tree
     classifier? Could it be used for feature selection?
     http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
     
     Reply
     * Jason Brownlee April 12, 2018 at 8:45 am #
       
       Sure.
       
       Reply
 21. Chris May 13, 2018 at 10:52 pm #
     
     Could this method be used to perform feature subset selection on groups of
     subsets that have to be considered together? For instance, after performing
     a FeatureHasher transformation you have a fixed length hash which takes up
     say 256 columns which have to be considered as a group. Do you have any
     resources for this case?
     
     Reply
     * Jason Brownlee May 14, 2018 at 6:36 am #
       
       Perhaps. Try it. Sorry,I don’t have material on this topic. Try a search
       on scholar.google.com
       
       Reply
 22. Aman May 18, 2018 at 5:15 am #
     
     Regarding ensemble learning model, I used it to reduce the features. But,
     how i can get to know that how many features I need to select?
     
     Reply
     * Jason Brownlee May 18, 2018 at 6:27 am #
       
       Great question, I answer it here:
       https://machinelearningmastery.com/faq/single-faq/what-feature-selection-method-should-i-use
       
       Reply
 23. Jeremy Dohmann July 14, 2018 at 9:12 am #
     
     How large can your feature set before the efficacy of this algorithm breaks
     down?
     
     Or, because it uses subsets, it returns a reasonable feature ranking even
     if you fit over a large number of features?
     
     Thanks!
     
     Reply
     * Jason Brownlee July 15, 2018 at 6:03 am #
       
       It depends on the dataset.
       
       Reply
 24. Junaid July 22, 2018 at 12:50 pm #
     
     I am using the tree classifier on my dataset and it gives different values
     each time I run the script. Is this a problem? or it differentiates because
     different ways the features are linked by the tree?
     
     Reply
     * Jason Brownlee July 23, 2018 at 6:06 am #
       
       This is to be expected, you can learn more about this here:
       https://machinelearningmastery.com/randomness-in-machine-learning/
       
       Reply
 25. sajid nawaz October 15, 2018 at 2:15 am #
     
     classification and regression analysis feature selection python code???if
     any one have
     
     Reply
     * Jason Brownlee October 15, 2018 at 7:33 am #
       
       Perhaps start here:
       https://machinelearningmastery.com/an-introduction-to-feature-selection/
       
       Reply
 26. hwanhee October 26, 2018 at 6:53 pm #
     
     Is there a way to find the best number of features for each data set?
     
     Reply
     * Jason Brownlee October 27, 2018 at 5:57 am #
       
       Yes, try a suite of feature selection methods, and a suite of models and
       use the combination of features and model that give the best performance.
       
       Reply
       * hwanhee October 27, 2018 at 12:06 pm #
         
         For example, which algorithm can find the optimal number of features?
         
         Reply
         * Jason Brownlee October 28, 2018 at 6:05 am #
           
           There are many solutions and each with different performance. Machine
           learning is empirical, there’s no idea of ‘best’, just good enough
           given time and resources.
           
           I recommend reading this:
           https://machinelearningmastery.com/faq/single-faq/what-feature-selection-method-should-i-use
           
           Reply
       * hwanhee October 27, 2018 at 12:09 pm #
         
         For example, there are 500 features. Is there any way to know the
         number of features that show the highest classification accuracy when
         performing a feature selection algorithm?
         
         Reply
         * Jason Brownlee October 28, 2018 at 6:06 am #
           
           Test different subsets of features by building a model from them and
           evaluate the performance of the model. The features that lead to a
           model with the best performance are the features that you should use.
           
           Reply
 27. Harshali Patel December 17, 2018 at 9:37 pm #
     
     Hey Jason,
     Again a great post, I have followed several of your posts.
     I want your opinion on the type of Machine learning algorithm that I can
     use my project on Supervised Learning.
     
     Reply
     * Jason Brownlee December 18, 2018 at 6:02 am #
       
       This is a common question that I answer here:
       https://machinelearningmastery.com/faq/single-faq/what-algorithm-config-should-i-use
       
       Reply
 28. Vaibhav January 27, 2019 at 4:28 pm #
     
     Hello Jason,
     
     Thank you for all your content. Big fan of all your posts.
     
     I am now stuck in deciding when to use which feature selection method (
     Filter, Wrapper & Embedded ) for my problem.
     
     Can you please help or provide any reference links where I can get the
     required info.
     
     Thanks in advance. !
     
     Vaibhav
     
     Reply
     * Jason Brownlee January 28, 2019 at 7:12 am #
       
       No problem, this is a common question that I answer here:
       https://machinelearningmastery.com/faq/single-faq/what-feature-selection-method-should-i-use
       
       Reply
 29. nandini February 6, 2019 at 4:59 pm #
     
     Hi Jason,
     
     I have a requirement about model predictions for text classification using
     keras.
     suppose if i entered any unrelated texts for model prediction,the entered
     texts which is not trained in model, instantly to give your entered query
     is invalid .
     
     Please suggest me any methods are available .
     thanks in advance 🙂
     
     Reply
     * Jason Brownlee February 7, 2019 at 6:37 am #
       
       Sorry, I don’t follow, perhaps you can elaborate?
       
       Reply
 30. ofer February 10, 2019 at 9:27 am #
     
     Hi,
     There are many different methods for feature selection. It depends on the
     algorithm i use. For example, if i use logistic regression for prediction
     then i can not use random forest for feature selection (the subset of
     features from random forest can be non significant in logistic regression
     model).
     Is the method you suggest suitable for logistic regression?
     
     Reply
     * Jason Brownlee February 10, 2019 at 9:46 am #
       
       Perhaps start with RFE?
       
       Reply
 31. Shreya April 27, 2019 at 7:45 pm #
     
     After using logistic regression for feature selection can we apply
     different models such as knn, decision tree, random forest etc to get the
     accuracy?
     
     Reply
     * Jason Brownlee April 28, 2019 at 6:56 am #
       
       Perhaps your problem is too easy or too hard and all models find the same
       solution?
       
       Reply
 32. Sydney Wu May 2, 2019 at 1:26 pm #
     
     hi, Jason,
     
     Thanks for your post, it’s clear and useful.
     
     But I still have some questions.
     
     1. Should I eliminate collinearity of variables before feature selection?
     Some posts says collinearity is not a problem for nonlinear model. but I am
     afraid that it will affect the result of feature selection.
     
     2. There are several feature selection method in scikit-learn, different
     method may select different subset, how do I know which subset or method is
     more suitable?
     
     3. When I build a machine learning model, the performance of the model
     seems more related to the number of features. No matter what features I
     use, the accuracy will increase when a certain threshold is reached. How do
     I explain this?
     
     Again, thanks a lot for your patient answer.
     
     Reply
     * Jason Brownlee May 2, 2019 at 2:04 pm #
       
       Perhaps, try it and see for your model + data.
       
       Good question, try them all and see what works best, see this:
       https://machinelearningmastery.com/faq/single-faq/what-feature-selection-method-should-i-use
       
       If the features are relevant to the outcome, the model will figure out
       how to use them. Or most models will.
       
       Reply
 33. Ronak May 9, 2019 at 12:29 pm #
     
     Thanks for the great posts. I have a problem for feature selection and
     parameter tuning.
     Thanks in advance for the help,
     
     I would like to do feature selection with recursive feature elimination and
     cross-validated selection of the best number of features. So I use RFECV:
     
     But I am passing an untuned model, svm.SVC(kernel=’linear’), to RFECV(), to
     find a subset of best features. So I have not addressed the tuning of
     hyperparameters within the model.
     
     Does this make sense to find some optimised hyperparameters of the model
     using grid search first, and THEN doing RFE? (However, parameter tuning has
     performed on un-optimized feature set.)
     How about doing vise versa,i.e. first feature selection and then parameter
     tuning? (However, selected features has chosen based on the untuned model)
     
     Although, either gridsearchCV and RFECV perform feature selection
     independently in each fold of the cross-validation, and I can use different
     splitting criteria for RFECV and gridsearchCV,
     I still suspect that as I have to use the same dataset for parameter tuning
     as well as for RFECV selection, Dose it cause overfiting?
     
     Do I have to take out a portion of the training set to do feature selection
     on. Next start model selection on the remaining data in the training set?
     
     Reply
     * Jason Brownlee May 9, 2019 at 2:09 pm #
       
       It might make sense to use standalone rfe within a pipeline with a given
       algorithm.
       
       Reply
 34. Tarun Gangil May 27, 2019 at 7:25 pm #
     
     Hi,
     Will Recursive Feature Elimination works good for categorical input
     datasets also ?
     
     Reply
     * Jason Brownlee May 28, 2019 at 8:12 am #
       
       Sure.
       
       Reply
 35. Benjamin June 5, 2019 at 1:49 am #
     
     Hi Jason, thanks for your hard work !
     
     How do you explain the following behavior ? Feature importance doesn’t tell
     you to keep the same features as RFE… which one should we trust ?
     
     The code :
     
     # Feature Importance
     from sklearn import datasets
     from sklearn import metrics
     from sklearn.ensemble import RandomForestClassifier
     # load the iris datasets
     dataset = datasets.load_iris()
     # fit an Extra Trees model to the data
     model = RandomForestClassifier()
     model.fit(dataset.data, dataset.target)
     # display the relative importance of each attribute
     print(model.feature_importances_)
     
     rfe = RFE(model, 1)
     rfe = rfe.fit(dataset.data, dataset.target)
     # summarize the selection of the attributes
     print(rfe.support_)
     print(rfe.ranking_)
     
     Output:
     
     [0.02029219 0.01598919 0.57190818 0.39181044]
     [False False False True]
     [3 4 2 1]
     
     Reply
     * Jason Brownlee June 5, 2019 at 8:47 am #
       
       This is a common question that I answer here:
       https://machinelearningmastery.com/faq/single-faq/what-feature-selection-method-should-i-use
       
       Reply
 36. Kushal Ghimire June 17, 2019 at 6:34 pm #
     
     Great explanation but i want to extract feature from videos for human
     activity recognition (walk,sleep,jump). But i dont know how to load the
     datasets. Any help will be appreciated.
     
     Reply
     * Jason Brownlee June 18, 2019 at 6:36 am #
       
       Sorry, i don’t have a tutorial on loading video.
       
       Reply
 37. Suganya July 26, 2019 at 5:21 pm #
     
     Hello Jason,
     I am trying to select the best features among 80 features in my dataset. My
     dataset contains integer as well as string values. I got an issue while
     trying to select the features using SelectKBest method. Why such issue
     happened. Could you help me in understanding this?
     
     Reply
     * Jason Brownlee July 27, 2019 at 6:07 am #
       
       Good question, I answer it here:
       https://machinelearningmastery.com/faq/single-faq/what-feature-selection-method-should-i-use
       
       Reply
       * Suganya July 29, 2019 at 9:21 pm #
         
         Thanks Jason. Having another doubt. Will all the feature selection
         techniques such as SelectKBest, Feature Importance prioritize the
         features in the same order? If so, How could we get to know particular
         method is best for feature selection?
         
         Reply
         * Jason Brownlee July 30, 2019 at 6:12 am #
           
           Good question, I answer it here:
           https://machinelearningmastery.com/faq/single-faq/what-feature-selection-method-should-i-use
           
           Reply
       * Suganya July 29, 2019 at 9:55 pm #
         
         Each time when I execute a feature importance method, it is giving
         different features as best features. Will this be possible?
         
         Reply
         * Jason Brownlee July 30, 2019 at 6:12 am #
           
           Yes, each method has a different “idea” of what features to use.
           
           Test a number of different approaches and choose one that results in
           the best performing model.
           
           Reply
           * Suganya July 30, 2019 at 5:14 pm #
             
             Thank you Jason.
             
             
           * Jason Brownlee July 31, 2019 at 6:45 am #
             
             You’re welcome.
             
             
     * Harendra March 31, 2020 at 6:59 am #
       
       Hi Jason
       Can you provide me python code for correlation based features selection?
       
       Reply
       * Jason Brownlee March 31, 2020 at 8:20 am #
         
         Yes, here:
         https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/
         
         Reply
 38. DHILSATH FATHIMA. M August 6, 2019 at 7:30 pm #
     
     What is the role of p-value in machine learning algorithm?Why to use that?
     
     Reply
     * Jason Brownlee August 7, 2019 at 7:46 am #
       
       It is used to interpret the result of a statistical hypothesis test:
       https://machinelearningmastery.com/faq/single-faq/how-do-i-interpret-a-p-value
       
       Reply
 39. Anushka August 22, 2019 at 9:16 pm #
     
     Hello Jason,
     Thank you for the descriptive article.
     I am working with microbiome data analysis and would like to use machine
     learning to pick a set of genera which can classify samples between two
     categories (for examples, healthy and disease).
     i used the following code:
     
     from sklearn.feature_selection import SelectKBest
     from sklearn.feature_selection import chi2
     from sklearn.feature_selection import SelectFpr
     from sklearn.feature_selection import GenericUnivariateSelect
     X = df_n #dataset with 131 columns and 51 rows
     y = list(map(lambda x : x[:2], df_n.index))
     
     bestfeatures = GenericUnivariateSelect(chi2, ‘k_best’)
     fit = bestfeatures.fit(X,y)
     pvalues = -np.log10(bestfeatures.pvalues_) #convert pvalues into log format
     
     dfscores = pd.DataFrame(fit.scores_)
     dfcolumns = pd.DataFrame(X.columns)
     dfpvalues = pd.DataFrame(pvalues)
     
     #concat two dataframes for better visualization
     featureScores = pd.concat([dfcolumns,dfscores,dfpvalues],axis=1)
     featureScores.columns = [‘Specs’,’Score’,’pvalues’] #naming the dataframe
     columns
     FS = featureScores.loc[featureScores[‘pvalues’] < 0.05, :]
     
     print(FS.nlargest(10, 'pvalues')) #top 10 features
     Specs Score pvalues
     41 a1 0.206076 0.044749
     22 a2 0.193496 0.042017
     11 a3 0.153464 0.033324
     117 a4 0.143448 0.031149
     20 a5 0.143214 0.031099
     45 a6 0.136450 0.029630
     67 a7 0.132488 0.028769
     0 a8 0.122946 0.026697
     80 a9 0.120120 0.026084
     123 a10 0.118977 0.025836
     
     Now I would like to use these list of features to make a PCoA plot with
     Bray-curtis because I want to visualize how these features can distinguish
     the 40 samples into two different categories (already known).
     
     Can you help me by guiding in this regard?
     
     Reply
     * Jason Brownlee August 23, 2019 at 6:25 am #
       
       What is a PCoA plot and what is Bray-curtis?
       
       Reply
 40. Prerna April 22, 2020 at 2:56 am #
     
     Hi,
     
     After rfe.fit and getting the rakings of the features how do we get the
     feature names according to rankings. Also, which rankings would we choose
     to go ahead and train the model
     
     Reply
     * Jason Brownlee April 22, 2020 at 6:05 am #
       
       The ranking has the indexes of each feature, you can use these indexes to
       access the column names from an array or from your dataframe.
       
       Reply
 41. Andrew May 1, 2020 at 5:46 pm #
     
     Hi Jason,
     
     RFE selects the feature set based on train data.
     Although in general, lesser features tend to prevent overfitting. So how
     does it ensure that the best performing features were not due to overfitted
     training data, since there is no validation set in place?
     
     Also, how does RFE differ from the importance_plot from XGboost or random
     forest or Gradient Boosting which shows the list of features based on gain
     importance?
     
     Reply
     * Jason Brownlee May 2, 2020 at 5:40 am #
       
       RFE cannot help you prevent overfitting.
       
       The are very different. RFE is calculated using any model you like and
       selects features based on how it impacts model performance. Feature
       importance from ensembles of trees is calculated based on how much the
       features are used in the trees.
       
       Reply
 42. Henrique June 3, 2020 at 6:36 pm #
     
     Hi,
     
     thank you for the tutorial.
     
     Something that is not clear for me is if the RFE is only used for
     classification or if it can be used for regression problems as well.
     When adapting the tutorial above to another dataset, it keeps alerting that
     the data is continuous. This is normally associated with classifiers, isn’t
     it?
     
     Thank you once more.
     
     Reply
     * Jason Brownlee June 4, 2020 at 6:14 am #
       
       It can be used for classification or regression, see examples here:
       https://machinelearningmastery.com/rfe-feature-selection-in-python/
       
       Reply
 43. Jaime Lannister June 11, 2020 at 1:49 am #
     
     Hey there,
     
     Can we extract features name from model only?
     Like you just have a fitted model and now you have to calculate its score,
     but the problem is you dont have list of features used in it. You just have
     the model and train dataset.
     If yes, them please help me because i am stuck at this!
     
     Thanks
     
     Reply
     * Jason Brownlee June 11, 2020 at 6:01 am #
       
       It will suggest feature/column indexes, you can then relate these to the
       names of the features in the original dataset directly.
       
       Reply
 44. umesh kumar baburao sherkhane July 19, 2020 at 3:02 pm #
     
     hi Jason,
     its a good article.
     
     I have one doubt, if i dont know the no of features to select. How should i
     go about on selecting the optimum number of feaures required for rfe ?
     
     Thanks and regards
     
     Reply
     * Jason Brownlee July 20, 2020 at 6:03 am #
       
       Good question.
       
       You can use a grid search and test each number of features from 1 to the
       total number of features, here is an example:
       https://machinelearningmastery.com/rfe-feature-selection-in-python/
       
       Reply


LEAVE A REPLY CLICK HERE TO CANCEL REPLY.

Comment *

Name (required)

Email (will not be published) (required)





Δ

Welcome!
I'm Jason Brownlee PhD
and I help developers get results with machine learning.
Read more



NEVER MISS A TUTORIAL:


               


PICKED FOR YOU:


Your First Machine Learning Project in Python Step-By-Step
How to Setup Your Python Environment for Machine Learning with Anaconda
Feature Selection For Machine Learning in Python
Python Machine Learning Mini-Course
Save and Load Machine Learning Models in Python with scikit-learn

LOVING THE TUTORIALS?



The Machine Learning with Python EBook is
where you'll find the Really Good stuff.

>> See What's Inside



© 2024 Guiding Tech Media. All Rights Reserved.
LinkedIn | Twitter | Facebook | Newsletter | RSS

Privacy | Disclaimer | Terms | Contact | Sitemap | Search





Update Privacy Preferences