class watex.exlib.gbm.XGBClassifier(*, objective='binary:logistic', use_label_encoder=None, **kwargs)[source]#

Bases: XGBModel, ClassifierMixin

Implementation of the scikit-learn API for XGBoost classification.

Parameters:
  • n_estimators (int) – Number of boosting rounds.

  • max_depth (Optional[int]) – Maximum tree depth for base learners.

  • max_leaves – Maximum number of leaves; 0 indicates no limit.

  • max_bin – If using histogram-based algorithm, maximum number of bins per feature

  • grow_policy – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.

  • learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)

  • verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).

  • objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).

  • booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.

  • tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method

  • n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.

  • gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.

  • min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.

  • max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.

  • subsample (Optional[float]) – Subsample ratio of the training instance.

  • sampling_method

    Sampling method. Used only by gpu_hist tree method.
    • uniform: select random training instances uniformly.

    • gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)

  • colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.

  • colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.

  • colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.

  • reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).

  • reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).

  • scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.

  • base_score (Optional[float]) – The initial prediction score of all instances, global bias.

  • random_state (Optional[Union[numpy.random.RandomState, int]]) –

    Random number seed.

    Note

    Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.

  • missing (float, default np.nan) – Value in the data which needs to be present as a missing value.

  • num_parallel_tree (Optional[int]) – Used for boosting random forest.

  • monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.

  • interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information

  • importance_type (Optional[str]) –

    The feature importance type for the feature_importances_ property:

    • For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.

    • For linear model, only “weight” is defined and it’s the normalized coefficients without bias.

  • gpu_id (Optional[int]) – Device ordinal.

  • validate_parameters (Optional[bool]) – Give warnings for unknown parameter.

  • predictor (Optional[str]) – Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].

  • enable_categorical (bool) –

    New in version 1.5.0.

    Note

    This parameter is experimental

    Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.

  • feature_types (FeatureTypes) –

    New in version 1.7.0.

    Used for specifying feature types without constructing a dataframe. See DMatrix for details.

  • max_cat_to_onehot (Optional[int]) –

    New in version 1.6.0.

    Note

    This parameter is experimental

    A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and cat-param for details.

  • max_cat_threshold (Optional[int]) –

    New in version 1.7.0.

    Note

    This parameter is experimental

    Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and cat-param for details.

  • eval_metric (Optional[Union[str, List[str], Callable]]) –

    New in version 1.6.0.

    Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in sklearn.metrics, or any other user defined metric that looks like sklearn.metrics.

    If custom objective is also provided, then custom metric should implement the corresponding reverse link function.

    Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.

    For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see xgboost.callback.EarlyStopping.

    See Custom Objective and Evaluation Metric for more.

    Note

    This parameter replaces eval_metric in fit() method. The old one receives un-transformed prediction regardless of whether custom objective is being used.

    from sklearn.datasets import load_diabetes
    from sklearn.metrics import mean_absolute_error
    X, y = load_diabetes(return_X_y=True)
    reg = xgb.XGBRegressor(
        tree_method="hist",
        eval_metric=mean_absolute_error,
    )
    reg.fit(X, y, eval_set=[(X, y)])
    

  • early_stopping_rounds (Optional[int]) –

    New in version 1.6.0.

    Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in fit().

    The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.

    If early stopping occurs, the model will have three additional fields: best_score, best_iteration and best_ntree_limit.

    Note

    This parameter replaces early_stopping_rounds in fit() method.

  • callbacks (Optional[List[TrainingCallback]]) –

    List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.

    Note

    States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.

    for params in parameters_grid:
        # be sure to (re)initialize the callbacks before each run
        callbacks = [xgb.callback.LearningRateScheduler(custom_rates)]
        xgboost.train(params, Xy, callbacks=callbacks)
    

  • kwargs (dict, optional) –

    Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

    Note

    **kwargs unsupported by scikit-learn

    **kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

    Note

    Custom objective function

    A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -> grad, hess:

    y_true: array_like of shape [n_samples]

    The target values

    y_pred: array_like of shape [n_samples]

    The predicted values

    grad: array_like of shape [n_samples]

    The value of the gradient for each sample point.

    hess: array_like of shape [n_samples]

    The value of the second derivative for each sample point

fit(X, y, *, sample_weight=None, base_margin=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None, sample_weight_eval_set=None, base_margin_eval_set=None, feature_weights=None, callbacks=None)[source]#

Fit gradient boosting classifier.

Note that calling fit() multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly pass xgb_model argument.

Parameters:
  • X – Feature matrix

  • y – Labels

  • sample_weight – instance weights

  • base_margin – global bias for each instance.

  • eval_set – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.

  • eval_metric (str, list of str, or callable, optional) –

    Deprecated since version 1.6.0: Use eval_metric in __init__() or set_params() instead.

  • early_stopping_rounds (int) –

    Deprecated since version 1.6.0: Use early_stopping_rounds in __init__() or set_params() instead.

  • verbose – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.

  • xgb_model – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).

  • sample_weight_eval_set – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.

  • base_margin_eval_set – A list of the form [M_1, M_2, …, M_n], where each M_i is an array like object storing base margin for the i-th validation set.

  • feature_weights – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.

  • callbacks

    Deprecated since version 1.6.0: Use callbacks in __init__() or set_params() instead.

predict(X, output_margin=False, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)[source]#

Predict with X. If the model is trained with early stopping, then best_iteration is used automatically. For tree models, when data is on GPU, like cupy array or cuDF dataframe and predictor is not specified, the prediction is run on GPU automatically, otherwise it will run on CPU.

Note

This function is only thread safe for gbtree and dart.

Parameters:
  • X – Data to predict with.

  • output_margin – Whether to output the raw untransformed margin value.

  • ntree_limit – Deprecated, use iteration_range instead.

  • validate_features – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin – Margin added to prediction.

  • iteration_range

    Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

    New in version 1.4.0.

Return type:

prediction

predict_proba(X, ntree_limit=None, validate_features=True, base_margin=None, iteration_range=None)[source]#

Predict the probability of each X example being of a given class.

Note

This function is only thread safe for gbtree and dart.

Parameters:
  • X (array_like) – Feature matrix.

  • ntree_limit (int) – Deprecated, use iteration_range instead.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (array_like) – Margin added to prediction.

  • iteration_range – Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

Returns:

a numpy array of shape array-like of shape (n_samples, n_classes) with the probability of each data example being of a given class.

Return type:

prediction

class watex.exlib.gbm.xgboostdoc#

Bases: object

Extreme Gradient Boosting

XGBoost stands for Extreme Gradient Boosting, is an open-source software library that implements optimized distributed gradient boosting machine learning algorithms under the Gradient Boosting framework.

XgBoost, which was proposed by the researchers at the University of Washington. It is a library written in C++ which optimizes the training for Gradient Boosting [1]_. Before understanding the XGBoost, we first need to understand the trees especially the decision tree.

Indeed , a Decision tree(DT) is a flowchart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label. A tree can be ‘learned’ by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node all has the same value of the target variable, or when splitting no longer adds value to the predictions [2]_.

References

..[1] https://www.geeksforgeeks.org/xgboost/ ..[2] https://www.nvidia.com/en-us/glossary/data-science/xgboost/