mlplot is a set of plot templates for visualising and
inspecting the learning models. It gives a quick depiction for users for
models visualization and evaluation with : EvalPlot
- class watex.view.mlplot.EvalPlot(tname=None, encode_labels=False, scale=None, cv=None, objective=None, prefix=None, label_values=None, litteral_classes=None, **kws)[source]#
Bases:
BasePlotMetrics, dimensionality and model evaluatation plots.
Inherited from
BasePlot. Dimensional reduction and metric plots. The class works only with numerical features.Discouraged
Contineous target values for plotting classification metrics is discouraged. However, We encourage user to prepare its dataset before using the
EvalPlotmethods. This is recommended to have full control of the expected results. Indeed, the most metrics plot implemented here works with supervised methods especially deals with the classification problems. So, the convenient way is for users to discretize/categorize (class labels) before the fit. If not the case, as the examples of demonstration under each method implementation, we first need to categorize the continue labels. The choice is twofolds: either providing individual class label as a list of integers using the methodEvalPlot._cat_codes_y()or by specifying the number of clusters that the target must hold. Commonly the latter choice is usefull for a test or academic purpose. In practice into a real dataset, it is discouraged to use this kind of target partition since, it is far away of the reality and will yield unexpected misinterpretation.- Parameters:
X (Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.
Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.y (array-like of shape (M, ) :math:`M=m-samples) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
tname (str,) – A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.
objective (str, default=None,) – The purpose of dataset; what probem do we intend to solve ? Originally the package was designed for flow rate prediction. Thus, if the objective is set to
flow, plot will behave like the flow rate prediction purpose and in that case, some condition of target values need to be fullfilled. Furthermore, if the objective is set toflow, label_values` as well as the litteral_classes parameters need to be supplied to right encode the target according to the hydraulic system requirement during the campaign for drinking water supply. For any other purpose for the dataset, keep the objective toNone. Default isNone.encode_labels (bool, default=False,) –
label encoding works with label_values parameter. If the y is a continous numerical values, we could turn the regression to classification by setting encode_labels to
True. if value is set toTrueand values of labels is not given, an unique identifier is created which can not fit the exact needs of the users. So it is recommended to set this parameters in combinaison with the`label_values`. For instance:encode_labels=True ; label_values =3
indicates that the target y values should be categorized to hold the integer identifier equals to
[0 , 1, 2]. y are splitted into three subsets where:classes (c) = [ c{0} <= y. min(), y.min() < c {1}< y.max(), >=y.max {2}]
This auto-splitting could not fit the exact classification of the target so it is recommended to set the label_values as a list of class labels. For instance label_values=[0 , 1, 2] and else.
scale (str, ['StandardScaler'|'MinMaxScaler'], default ='StandardScaler') – kind of feature scaling to apply on numerical features. Note that when using PCA, it is recommended to turn scale to
Trueand fit_transform rather than only fit the method. Note that transform method also handle the missing nan value in the data where the default strategy for filling ismost_frequent.cv (float,) –
A cross validation splitting strategy. It used in cross-validation based routines. cv is also available in estimators such as multioutput. ClassifierChain or calibration.CalibratedClassifierCV which use the predictions of one estimator as training data for another, to not overfit the training supervision. Possible inputs for cv are usually:
* An integer, specifying the number of folds in K-fold cross validation. K-fold will be stratified over classes if the estimator is a classifier (determined by base.is_classifier) and the targets may represent a binary or multiclass (but not multioutput) classification problem (determined by utils.multiclass.type_of_target). * A cross-validation splitter instance. Refer to the User Guide for splitters available within `Scikit-learn`_ * An iterable yielding train/test splits.- With some exceptions (especially where not using cross validation at all
is an option), the default is
4-fold.
prefix (str, optional) – litteral string to prefix the integer identical labels.
label_values (list of int, optional) – works with encode_labels parameters. It indicates the different class labels. Refer to explanation of encode_labels.
Litteral_classes (list or str, optional) –
Works when objective is
flow. Replace class integer names by its litteral strings. For instance:label_values =[0, 1, 3, 6] Litteral_classes = ['rate0', 'rate1', 'rate2', 'rate3']
yp_ls (str, default='-',) – Line style of Predicted label. Can be [ ‘-’ | ‘.’ | ‘:’ ]
yp_lw (str, default= 3) – Line weight of the Predicted plot
yp_lc (str or
matplotlib.cm(), default= ‘k’) – Line color of the Prediction plot. default iskrs (str, default='--') – Line style of Recall metric
rc (str, default=(.6,.6,.6)) – Recall metric colors
pc (str or
matplotlib.cm(), default=’k’) – Precision colors from Matplotlib colormaps.yp_marker (str or
matplotlib.markers(), default =’o’) – Style of marker in of Prediction points.yp_markerfacecolor (str or
matplotlib.cm(), default=’k’) – Facecolor of the Predicted label marker.yp_markeredgecolor (stror
matplotlib.cm(), default= ‘r’) – Edgecolor of the Predicted label marker.yp_markeredgewidth (int, default=2) – Width of the `Predicted`label marker.
savefig (str, Path-like object,) – savefigure’s name, default is
Nonefig_dpi (float,) – dots-per-inch resolution of the figure. default is 300
fig_num (int,) – size of figure in inches (width, height). default is [5, 5]
fig_size (Tuple (int, int) or inch) – size of figure in inches (width, height).*default* is [5, 5]
fig_orientation (str,) – figure orientation. default is
landscapefig_tile (str,) – figure title. default is
Nonefs (float,) – size of font of axis tick labels, axis labels are fs+2. default is 6
ls (str,) – line style, it can be [ ‘-’ | ‘.’ | ‘:’ ] . default is ‘-’
lc (str, Optional,) – line color of the plot, default is
klw (float, Optional,) – line weight of the plot, default is
1.5alpha (float between 0 < alpha < 1,) – transparency number, default is
0.5,font_weight (str, Optional) – weight of the font , default is
bold.font_style (str, Optional) – style of the font. default is
italicfont_size (float, Optional) – size of font in inches (width, height). default is
3.ms (float, Optional) – size of marker in points. default is
5marker (str, Optional) – marker of stations default is
o.marker_style (str, Optional) – facecolor of the marker. default is
yellowmarker_edgecolor (str, Optional) – facecolor of the marker. default is
yellowmarker_edgewidth (float, Optional) – width of the marker. default is
3.xminorticks (float, Optional) – minortick according to x-axis size and default is
1.yminorticks (float, Optional) – yminorticks according to x-axis size and default is
1.bins (histograms element separation between two bar. default is
10.) –xlim (tuple (int, int), Optional) – limit of x-axis in plot.
ylim (tuple (int, int), Optional) – limit of x-axis in plot.
xlabel (str, Optional,) – label name of x-axis in plot.
ylabel (str, Optional,) – label name of y-axis in plot.
rotate_xlabel (float, Optional) – angle to rotate xlabel in plot.
rotate_ylabel (float, Optional) – angle to rotate ylabel in plot.
leg_kws (dict, Optional) – keyword arguments of legend. default is empty
dictplt_kws (dict, Optional) – keyword arguments of plot. default is empty
dictglc (str, Optional) – line color of the grid plot, default is
kglw (float, Optional) – line weight of the grid plot, default is
2galpha (float, Optional,) – transparency number of grid, default is
0.5gaxis (str ('x', 'y', 'both')) – type of axis to hold the grid, default is
bothgwhich (str, Optional) – kind of grid in the plot. default is
majortp_axis (bool,) – axis to apply the ticks params. default is
bothtp_labelsize (str, Optional) – labelsize of ticks params. default is
italictp_bottom (bool,) – position at bottom of ticks params. default is
True.tp_labelbottom (bool,) – put label on the bottom of the ticks. default is
Falsetp_labeltop (bool,) – put label on the top of the ticks. default is
Truecb_orientation (str , ('vertical', 'horizontal')) – orientation of the colorbar, default is
verticalcb_aspect (float, Optional) – aspect of the colorbar. default is
20.cb_shrink (float, Optional) – shrink size of the colorbar. default is
1.0cb_pad (float,) – pad of the colorbar of plot. default is
.05cb_anchor (tuple (float, float)) – anchor of the colorbar. default is
(0.0, 0.5)cb_panchor (tuple (float, float)) – proportionality anchor of the colorbar. default is
(1.0, 0.5)cb_label (str, Optional) – label of the colorbar.
cb_spacing (str, Optional) – spacing of the colorbar. default is
uniformcb_drawedges (bool,) – draw edges inside of the colorbar. default is
False
Notes
This module works with numerical data i.e if the data must contains the numerical features only. If categorical values are included in the dataset, they should be removed and the size of the data should be chunked during the fit methods.
- fit(X=None, y=None, **fit_params)[source]#
Fit data and populate the attributes for plotting purposes.
There is no conventional procedure for checking if a method is fitted. However, an class that is not fitted should raise
watex.exceptions.NotFittedErrorwhen a method is called.- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.y (array-like, shape (M, )
M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.data (Filepath or Dataframe or shape (M, N) from) –
pandas.DataFrame. Dataframe containing samples M and features Nfit_params (dict Additional keywords arguments from) – :func:watex.utils.coreutils._is_readable`
- Returns:
``self`` – returns
selffor easy method chaining.- Return type:
EvalPlot instance
- fit_transform(X, y=None, **fit_params)[source]#
Fit and transform at once.
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.- Returns:
X – The transformed array or dataframe with numerical features
- Return type:
NDArray |Dataframe , shape (M x N )
- property inspect#
Inspect data and trigger plot after checking the data entry. Raises NotFittedError if ExPlot is not fitted yet.
- plotConfusionMatrix(clf, *, kind=None, labels=None, matshow_kws=None, **conf_mx_kws)[source]#
Plot confusion matrix for error evaluation.
A representation of the confusion matrix for error visualization. If kind is set
map, plot will give the number of confused instances/items. However when kind is set toerror, the number of items confused is explained as a percentage.- Parameters:
clf (callable, always as a function, classifier estimator) – A supervised predictor with a finite set of discrete possible output values. A classifier must supports modeling some of binary, targets. It must store a classes attribute after fitting.
- labels: int, or list of int, optional
Specific class to evaluate the tradeoff of precision
and recall. label needs to be specified and a value within the target.
- plottype: str
can be map or error to visualize the matshow of prediction and errors respectively.
- matshow_kws: dict
matplotlib additional keywords arguments.
- conf_mx_kws: dict
Additional confusion matrix keywords arguments.
- ylabel: list
list of labels names to hold the name of each categories. Return
Examples
>>> from watex.datasets import fetch_data >>> from watex.utils.mlutils import cattarget >>> from watex.exlib.sklearn import SVC >>> from watex.view.mlplot import EvalPlot >>> X, y = fetch_data ('bagoue', return_X_y=True, as_frame =True) >>> # partition the target into 4 clusters-> just for demo >>> b= EvalPlot(scale =True, label_values = 4 ) >>> b.fit_transform (X, y) >>> # prepare our estimator >>> svc_clf = SVC(C=100, gamma=1e-2, kernel='rbf', random_state =42) >>> matshow_kwargs ={ 'aspect': 'auto', # 'auto'equal 'interpolation': None, 'cmap':'jet } >>> plot_kws ={'lw':3, 'lc':(.9, 0, .8), 'font_size':15., 'cb_format':None, 'xlabel': 'Predicted classes', 'ylabel': 'Actual classes', 'font_weight':None, 'tp_labelbottom':False, 'tp_labeltop':True, 'tp_bottom': False } >>> b.plotConfusionMatrix(clf=svc_clf, matshow_kws = matshow_kwargs, **plot_kws) >>> svc_clf = SVC(C=100, gamma=1e-2, kernel='rbf', ... random_state =42) >>> # replace the integer identifier with litteral string >>> b.litteral_classes = ['FR0', 'FR1', 'FR2', 'FR3'] >>> b.plotConfusionMatrix(svc_clf, matshow_kws=matshow_kwargs, kind='error', **plot_kws)
- plotPCA(n_components=None, *, n_axes=None, biplot=False, pc1_label='Axis 1', pc2_label='Axis 2', plot_dict=None, **pca_kws)[source]#
Plot PCA component analysis using
decomposition.PCA identifies the axis that accounts for the largest amount of variance in the train set X. It also finds a second axis orthogonal to the first one, that accounts for the largest amount of remaining variance.
- Parameters:
n_components (Number of dimension to preserve. If`n_components`) – is ranged between float 0. to 1., it indicates the number of variance ratio to preserve. If
Noneas default value the number of variance to preserve is95%.n_axes (Number of importance components to retrieve the) – variance ratio. Default is
2. The first two importance components with most variance ratio.biplot (bool,) – biplot plots PCA features importance (pc1 and pc2) and visualize the level of variance and direction of components for different variables. Refer to Serafeim Loukas
pc1_label (str, default ='Axis 1') – the first component with most variance held in ‘Axis 1’. Can be modified to any other axis for instance ‘Axis 3’ to replace the component in ‘Axis 1’ to the one in Axis 3 and so one. This will allow to visualize the position of each level of variance for each variable.
pc2_label (str, default ='Axis 2',) – the second component with most variance held in ‘Axis 2’. Can be modified to any other axis for instance ‘Axis 6’ to replace the component in ‘Axis 2’ to the one in Axis 6 and so one.
plot_dict (dict,) – dictionnary of font and properties for markers for each sample corresponding to the label_values.
pca_kws (dict,) – additional keyword arguments passed to
watex.analysis.dimensionality.nPCA
- Returns:
``self`` –
selffor easy method chaining.- Return type:
EvalPlot instance
Notes
By default, nPCA methods plots the first two principal components named pc1_label for axis 1 and pc2_label for axis 2. If you want to plot the first component pc1 vs the third components`pc2` set the pc2_label to Axis 3 and set the n_components to 3 that is the max reduced columns to retrieve, otherwise an users warning will be displayed. Commonly Algorithm should automatically detect the digit
3in the litteral pc1_labels including Axis (e.g. ‘Axis 3`) and will consider as the third component `pc3 `. The same process is available for other axis.Examples
>>> from watex.datasets import load_bagoue >>> from watex.view.mlplot import EvalPlot >>> X , y = load_bagoue(as_frame =True ) >>> b=EvalPlot(tname ='flow', encode_labels=True , scale = True ) >>> b.fit_transform (X, y) >>> b.plotPCA (n_components= 2 ) ... >>> # pc1 and pc2 labels > n_components -> raises user warnings >>> b.plotPCA (n_components= 2 , biplot=False, pc1_label='Axis 3', pc2_label='axis 4') ... UserWarning: Number of components and axes might be consistent; '2'and '4 are given; default two components are used. >>> b.plotPCA (n_components= 8 , biplot=False, pc1_label='Axis3', pc2_label='axis4') # works fine since n_components are greater to the number of axes ... EvalPlot(tname= None, objective= None, scale= True, ... , sns_height= 4.0, sns_aspect= 0.7, verbose= 0)
- plotPR(clf, label, kind=None, method=None, cvp_kws=None, **prt_kws)[source]#
Precision/recall (PR) and tradeoff plots.
PR computes a score based on the decision function and plot the result as a score vs threshold.
- Parameters:
clf (callable, always as a function, classifier estimator) – A supervised predictor with a finite set of discrete possible output values. A classifier must supports modeling some of binary, targets. It must store a classes attribute after fitting.
- label: int,
Specific class to evaluate the tradeoff of precision and recall. label needs to be specified and a value within the target. kind: str, [‘threshold|’recall’], default=’threshold’ kind of PR plot. If kind is ‘recall’, method plots the precision VS the recall scores, otherwiwe the PR tradeoff is plotted against the ‘threshold.’
- method: str
Method to get scores from each instance in the trainset. Could be
decison_funcionorpredict_proba. When using the scikit-Learn classifier, it generally has one of the method. Default isdecision_function.- cvp_kws: dict, optional
The
sklearn.model_selection.cross_val_predict()keywords additional arguments- prt_kws:dict,
Additional keyword arguments passed to func:watex.exlib.sklearn.precision_recall_tradeoff Return
Examples
>>> from watex.exlib.sklearn import SGDClassifier >>> from watex.datasets.dload import load_bagoue >>> from watex.utils import cattarget >>> from watex.view.mlplot import EvalPlot >>> X , y = load_bagoue(as_frame =True ) >>> sgd_clf = SGDClassifier(random_state= 42) # our estimator >>> b= EvalPlot(scale = True , encode_labels=True) >>> b.fit_transform(X, y) >>> # binarize the label b.y >>> ybin = cattarget(b.y, labels= 2 ) # can also use labels =[0, 1] >>> b.y = ybin >>> # plot the Precision-recall tradeoff >>> b.plotPR(sgd_clf , label =1) # class=1 ... EvalPlot(tname= None, objective= None, scale= True, ... , sns_height= 4.0, sns_aspect= 0.7, verbose= 0)
- plotROC(clfs, label, method=None, cvp_kws=None, **roc_kws)[source]#
Plot receiving operating characteric (ROC) classifiers.
Can plot multiple classifiers at once. If multiple classifiers are given, each classifier must be a tuple of
( <name>, classifier>, <method>). For instance, to plot the bothsklearn.ensemble.RandomForestClassifierandsklearn.linear_model.SGDClassifierclassifiers, they must be ranged as follow:clfs =[ ('sgd', SGDClassifier(), "decision_function" ), ('forest', RandomForestClassifier(), "predict_proba") ]
It is important to know whether the method ‘predict_proba’ is valid for the scikit-learn classifier, we want to plot its ROC curve.
- Parameters:
clfs (callables, always as a function, classifier estimators) – A supervised predictor with a finite set of discrete possible output values. A classifier must supports modeling some of binary, targets. It must store a classes attribute after fitting.
label (int,) – Specific class to evaluate the tradeoff of precision and recall. label needs to be specified and a value within the target.
kind (str, ['threshold|'recall'], default='threshold') – kind of PR plot. If kind is ‘recall’, method plots the precision VS the recall scores, otherwiwe the PR tradeoff is plotted against the ‘threshold.’
method (str) – Method to get scores from each instance in the trainset. Could be
decison_funcionorpredict_proba. When using the scikit-Learn classifier, it generally has one of the method. Default isdecision_function.cvp_kws (dict, optional) – The
sklearn.model_selection.cross_val_predict()keywords additional argumentsprt_kws (dict,) – Additional keyword arguments passed to func:watex.exlib.sklearn.precision_recall_tradeoff
roc_kws (dict) – roc_curve additional keywords arguments.
- Returns:
``self`` –
selffor easy method chaining.- Return type:
EvalPlot instance
Examples
Plot ROC for single classifier
>>> from watex.exlib.sklearn import ( SGDClassifier, RandomForestClassifier ) >>> from watex.datasets.dload import load_bagoue >>> from watex.utils import cattarget >>> from watex.view.mlplot import EvalPlot >>> X , y = load_bagoue(as_frame =True ) >>> sgd_clf = SGDClassifier(random_state= 42) # our estimator >>> b= EvalPlot(scale = True , encode_labels=True) >>> b.fit_transform(X, y) >>> # binarize the label b.y >>> ybin = cattarget(b.y, labels= 2 ) # can also use labels =[0, 1] >>> b.y = ybin >>> # plot the ROC >>> b.plotROC(sgd_clf , label =1) # class=1 ... EvalPlot(tname= None, objective= None, scale= True, ... , sns_height= 4.0, sns_aspect= 0.7, verbose= 0)
(2)-> Plot ROC for multiple classifiers
>>> b= EvalPlot(scale = True , encode_labels=True, lw =3., lc=(.9, 0, .8), font_size=7 ) >>> sgd_clf = SGDClassifier(random_state= 42) >>> forest_clf =RandomForestClassifier(random_state=42) >>> b.fit_transform(X, y) >>> # binarize the label b.y >>> ybin = cattarget(b.y, labels= 2 ) # can also use labels =[0, 1] >>> b.y = ybin >>> clfs =[('sgd', sgd_clf, "decision_function" ), ('forest', forest_clf, "predict_proba")] >>> b.plotROC (clfs =clfs , label =1 ) ... EvalPlot(tname= None, objective= None, scale= True, ... , sns_height= 4.0, sns_aspect= 0.7, verbose= 0)
- transform(X, **t_params)[source]#
Transform the data and imputs the numerical features.
It is not convenient to use transform if user want to keep categorical values in the array
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.t_params (dict,) – Keyword arguments passed to
sklearn.impute.SimpleImputerfor imputing the missing data; default strategy is ‘most_frequent’ or keywords arguments passed to :func:watex.utils.funcutils.to_numeric_dtypes`
- Returns:
X – The transformed array or dataframe with numerical features
- Return type:
NDArray |Dataframe , shape (M x N )
- watex.view.mlplot.biPlot(self, Xr, components, y, classes=None, markers=None, colors=None)[source]#
The biplot is the best way to visualize all-in-one following a PCA analysis.
There is an implementation in R but there is no standard implementation in Python.
- Parameters:
self (
watex.property.BasePlot.) –Matplotlib property from BasePlot instances. Default BasePlot instance is given as a pobj instance and can be loaded for plotting purpose as:
>>> from watex.view import pobj
To change some default plot properties like line width or style, both can be set before running the script as follow
>>> pobj.lw = 2. ; pobj.ls=':' # and so on
Xr (NDArray of transformed X.) – the PCA projected data scores on n-given components.The reduced dimension of train set ‘X’ with maximum ratio as sorted eigenvectors from first to the last component.
components (NDArray, shape (n_components, n_eigenvectors ),) – the eigenvectors of the PCA. The shape in axis must much the number of component computed using PCA. If the Xr shape 1 equals to the shape 0 of the component matrix components, it will be transposed to fit Xr shape 1.
y (Array-like,) – the target composing the class labels.
classes (list or int,) – class categories or class labels
markers (str,) – Matplotlib list of markers for plotting classes.
colors (str,) – Matplotlib list of colors to customize plots
Examples
>>> from watex.analysis import nPCA >>> from watex.datasets import fetch_data >>> from watex.view import biPlot, pobj # pobj is Baseplot instance >>> X, y = fetch_data ('bagoue pca' ) # fetch pca data >>> pca= nPCA (X, n_components= 2 , return_X= False ) # return PCA object >>> components = pca.components_ [:2, :] # for two components >>> biPlot (pobj, pca.X, components , y ) # pca.X is the reduced dim X >>> # to change for instance line width (lw) or style (ls) >>> # just use the baseplotobject (pobj)
References
Originally written by Serafeim Loukas, serafeim.loukas@epfl.ch and was edited to fit the watex package API.
- watex.view.mlplot.plot2d(ar, y=None, x=None, distance=50.0, stnlist=None, prefix='S', how='py', to_log10=False, plot_contours=False, top_label='', **baseplot_kws)[source]#
Two dimensional template for visualization matrices.
It is a wrappers that can plot any matrice by customizing the position X and y. By default X is considering as stations and y the resistivity log data.
- Parameters:
ar (Array-like 2D, shape (M, N)) – 2D array for plotting. For instance, it can be a 2D resistivity collected at all stations (N) and all frequency (M)
y (array-like, default=None) – Y-coordinates. It should have the length N, the same of the
arr2d. the rows of thearr2d.x (array-like, default=None,) – X-coordinates. It should have the length M, the same of the
arr2d; the columns of the 2D dimensional array. Note that if x is given, the `distance is not needed.distance (float) – The step between two stations. If given, it creates an array of position for plotting purpose. Default value is
50meters.stnlist (list of str) – List of stations names. If given, it should have the same length of the columns M, of arr2d`
prefix (str) – string value to add as prefix of given id. Prefix can be the site name. Default is
S.how (str) – Mode to index the station. Default is ‘Python indexing’ i.e. the counting of stations would starts by 0. Any other mode will start the counting by 1.
to_log10 (bool, default=False) – Recompute the ar in logarithm base 10 values. Note when
True, theyshould be also in log10.plot_contours (bool, default=True) – Plot the contours map. Is available only if the plot_style is set to
pcolormesh.top_label (str,) – Name of the top label.
baseplot_kws (dict,) – All all the keywords arguments passed to the property
watex.property.BasePlotclass.
- Returns:
axe
- Return type:
<AxesSubplot> object
Examples
>>> import numpy as np >>> import watex >>> np.random.seed (42) >>> data = np.random.randn ( 15, 20 ) >>> data_nan = data.copy() >>> data_nan [2, 1] = np.nan; data_nan[4, 2]= np.nan; data_nan[6, 3]=np.nan >>> watex.view.mlplot.plot2d (data ) <AxesSubplot:xlabel='Distance(m)', ylabel='log10(Frequency)[Hz]'> >>> watex.view.mlplot.plot2d (data_nan , plt_style = 'imshow', fig_size = (10, 4))
- watex.view.mlplot.plotDendrogram(df, columns=None, labels=None, metric='euclidean', method='complete', kind=None, return_r=False, verbose=False, **kwd)[source]#
Visualizes the linkage matrix in the results of dendrogram.
Note that the categorical features if exist in the dataframe should automatically be discarded.
- Parameters:
df (dataframe or NDArray of (n_samples, n_features)) – dataframe of Ndarray. If array is given , must specify the column names to much the array shape 1
columns (list) – list of labels to name each columns of arrays of (n_samples, n_features) If dataframe is given, don’t need to specify the columns.
kind (str, ['squareform'|'condense'|'design'], default is {'design'}) – kind of approach to summing up the linkage matrix. Indeed, a condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that
pdistreturns. Alternatively, a collection of \(m\) observation vectors in \(n\) dimensions may be passed as an \(m\) by \(n\) array. All elements of the condensed distance matrix must be finite, i.e., no NaNs or infs. Alternatively, we could used thesquareformdistance matrix to yield different distance values than expected. thedesignapproach uses the complete inpout example matrix also called ‘design matrix’ to lead correct linkage matrix similar to squareform and condense`.metric (str or callable, default is {'euclidean'}) – The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by
sklearn.metrics.pairwise.pairwise_distances(). IfXis the distance array itself, use “precomputed” as the metric. Precomputed distance matrices must have 0 along the diagonal.method (str, optional, default is {'complete'}) – The linkage algorithm to use. See the
Linkage Methodssection below for full descriptions inwatex.utils.exmath.linkage_matrix()labels (ndarray, optional) – By default,
labelsis None so the index of the original observation is used to label the leaf nodes. Otherwise, this is an \(n\)-sized sequence, withn == Z.shape[0] + 1. Thelabels[i]value is the text to put under the \(i\) th leaf node only if it corresponds to an original observation and not a non-singleton cluster.return_r (bool, default='False',) – return r-dictionnary if set to ‘True’ otherwise returns nothing
verbose (int, bool, default='False') – If
True, output message of the name of categorical features dropped.kwd (dict) – additional keywords arguments passes to
scipy.cluster.hierarchy.dendrogram()
- Returns:
r – A dictionary of data structures computed to render the dendrogram. Its has the following keys:
'color_list'A list of color names. The k’th element represents the color of the k’th link.
'icoord'and'dcoord'Each of them is a list of lists. Let
icoord = [I1, I2, ..., Ip]whereIk = [xk1, xk2, xk3, xk4]anddcoord = [D1, D2, ..., Dp]whereDk = [yk1, yk2, yk3, yk4], then the k’th link painted is(xk1, yk1)-(xk2, yk2)-(xk3, yk3)-(xk4, yk4).'ivl'A list of labels corresponding to the leaf nodes.
'leaves'For each i,
H[i] == j, cluster nodejappears in positioniin the left-to-right traversal of the leaves, where \(j < 2n-1\) and \(i < n\). Ifjis less thann, thei-th leaf node corresponds to an original observation. Otherwise, it corresponds to a non-singleton cluster.'leaves_color_list'A list of color names. The k’th element represents the color of the k’th leaf.
- Return type:
dict
Examples
>>> from watex.datasets import load_iris >>> from watex.view import plotDendrogram >>> data = load_iris () >>> X =data.data[:, :2] >>> plotDendrogram (X, columns =['X1', 'X2' ] )
- watex.view.mlplot.plotDendroheat(df, columns=None, labels=None, metric='euclidean', method='complete', kind='design', cmap='hot_r', fig_size=(8, 8), facecolor='white', **kwd)[source]#
Attaches dendrogram to a heat map.
Hierachical dendrogram are often used in combination with a heat map which allows us to represent the individual value in data array or matrix containing our training examples with a color code.
- Parameters:
df (dataframe or NDArray of (n_samples, n_features)) – dataframe of Ndarray. If array is given , must specify the column names to much the array shape 1
columns (list) – list of labels to name each columns of arrays of (n_samples, n_features) If dataframe is given, don’t need to specify the columns.
kind (str, ['squareform'|'condense'|'design'], default is {'design'}) – kind of approach to summing up the linkage matrix. Indeed, a condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that
pdistreturns. Alternatively, a collection of \(m\) observation vectors in \(n\) dimensions may be passed as an \(m\) by \(n\) array. All elements of the condensed distance matrix must be finite, i.e., no NaNs or infs. Alternatively, we could used thesquareformdistance matrix to yield different distance values than expected. thedesignapproach uses the complete inpout example matrix also called ‘design matrix’ to lead correct linkage matrix similar to squareform and condense`.metric (str or callable, default is {'euclidean'}) – The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by
sklearn.metrics.pairwise.pairwise_distances(). IfXis the distance array itself, use “precomputed” as the metric. Precomputed distance matrices must have 0 along the diagonal.method (str, optional, default is {'complete'}) – The linkage algorithm to use. See the
Linkage Methodssection below for full descriptions inwatex.utils.exmath.linkage_matrix()labels (ndarray, optional) – By default,
labelsis None so the index of the original observation is used to label the leaf nodes. Otherwise, this is an \(n\)-sized sequence, withn == Z.shape[0] + 1. Thelabels[i]value is the text to put under the \(i\) th leaf node only if it corresponds to an original observation and not a non-singleton cluster.cmap (str , default is {'hot_r'}) – matplotlib color map
fig_size (str , Tuple , default is {(8, 8)}) – the size of the figure
facecolor (str , default is {"white"}) – Matplotlib facecolor
kwd (dict) – additional keywords arguments passes to
scipy.cluster.hierarchy.dendrogram()
Examples
>>> # (1) -> Use random data >>> import numpy as np >>> from watex.view.mlplot import plotDendroheat >>> np.random.seed(123) >>> variables =['X', 'Y', 'Z'] ; labels =['ID_0', 'ID_1', 'ID_2', 'ID_3', 'ID_4'] >>> X= np.random.random_sample ([5,3]) *10 >>> df =pd.DataFrame (X, columns =variables, index =labels) >>> plotDendroheat (df) >>> # (2) -> Use Bagoue data >>> from watex.datasets import load_bagoue >>> X, y = load_bagoue (as_frame=True ) >>> X =X[['magnitude', 'power', 'sfi']].astype(float) # convert to float >>> plotDendroheat (X )
- watex.view.mlplot.plotLearningInspection(model, X, y, axes=None, ylim=None, cv=5, n_jobs=None, train_sizes=None, display_legend=True, title=None)[source]#
Inspect model from its learning curve.
Generate 3 plots: the test and training learning curve, the training samples vs fit times curve, the fit times vs score curve.
- Parameters:
model (estimator instance) – An estimator instance implementing fit and predict methods which will be cloned for each validation.
title (str) – Title for the chart.
X (array-like of shape (n_samples, n_features)) – Training vector, where
n_samplesis the number of samples andn_featuresis the number of features.y (array-like of shape (n_samples) or (n_samples, n_features)) – Target relative to
Xfor classification or regression; None for unsupervised learning.axes (array-like of shape (3,), default=None) – Axes to use for plotting the curves.
ylim (tuple of shape (2,), default=None) – Defines minimum and maximum y-values plotted, e.g. (ymin, ymax).
cv (int, cross-validation generator or an iterable, default=None) –
Determines the cross-validation splitting strategy. Possible inputs for cv are:
None, to use the default 5-fold cross-validation,
integer, to specify the number of folds.
CV splitter,
An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if
yis binary or multiclass,StratifiedKFoldused. If the estimator is not a classifier or ifyis neither binary nor multiclass,KFoldis used.Refer User Guide for the various cross-validators that can be used here.
n_jobs (int or None, default=None) – Number of jobs to run in parallel.
Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors. See Glossary for more details.train_sizes (array-like of shape (n_ticks,)) – Relative or absolute numbers of training examples that will be used to generate the learning curve. If the
dtypeis float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class. (default: np.linspace(0.1, 1.0, 5))display_legend (bool, default ='True') – display the legend
- Returns:
axes
- Return type:
Matplotlib axes
Examples
>>> from watex.datasets import fetch_data >>> from watex.models import p >>> from watex.view.mlplot import plotLearningInspection >>> # import sparse matrix from Bagoue datasets >>> X, y = fetch_data ('bagoue prepared') >>> # import the pretrained Radial Basis Function (RBF) from SVM >>> plotLearningInspection (p.SVM.rbf.best_estimator_ , X, y )
- watex.view.mlplot.plotLearningInspections(models, X, y, fig_size=(22, 18), cv=None, savefig=None, titles=None, subplot_kws=None, **kws)[source]#
Inspect multiple models from their learning curves.
Mutiples Inspection plots that generate the test and training learning curve, the training samples vs fit times curve, the fit times vs score curve for each model.
- Parameters:
models (list of estimator instances) – Each estimator instance implements fit and predict methods which will be cloned for each validation.
X (array-like of shape (n_samples, n_features)) – Training vector, where
n_samplesis the number of samples andn_featuresis the number of features.y (array-like of shape (n_samples) or (n_samples, n_features)) – Target relative to
Xfor classification or regression; None for unsupervised learning.cv (int, cross-validation generator or an iterable, default=None) –
Determines the cross-validation splitting strategy. Possible inputs for cv are:
None, to use the default 5-fold cross-validation,
integer, to specify the number of folds.
CV splitter,
An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if
yis binary or multiclass,StratifiedKFoldused. If the estimator is not a classifier or ifyis neither binary nor multiclass,KFoldis used.Refer Sckikit-learn User Guide for the various cross-validators that can be used here.
savefig (str, default =None ,) – the path to save the figures. Argument is passed to matplotlib.Figure class.
titles (str, list) – List of model names if changes are needed. If
None, model names are used by default.kws (dict,) – Additional keywords argument passed to
plotLearningInspection().
- Returns:
axes
- Return type:
Matplotlib axes
See also
plotLearningInspectionInspect single model
Examples
>>> from watex.datasets import fetch_data >>> from watex.models.premodels import p >>> from watex.view.mlplot import plotLearningInspections >>> # import sparse matrix from Bagoue dataset >>> X, y = fetch_data ('bagoue prepared') >>> # import the two pretrained models from SVM >>> models = [p.SVM.rbf.best_estimator_ , p.SVM.poly.best_estimator_] >>> plotLearningInspections (models , X, y, ylim=(0.7, 1.01) )
- watex.view.mlplot.plotModel(yt, ypred=None, *, clf=None, Xt=None, predict=False, prefix=None, index=None, fill_between=False, labels=None, return_ypred=False, **baseplot_kws)[source]#
- Plot model ‘y’ (true labels) versus ‘ypred’ (predicted) from test
data.
Plot will allow to know where estimator/classifier fails to predict correctly the target
- Parameters:
- yt:array-like, shape (M, ) ``M=m-samples``,
test target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
- ypred:array-like, shape (M, ) ``M=m-samples``
Array of the predicted labels. It has the same number of samples as the test data ‘Xt’
- clf :callable, always as a function, classifier estimator
A supervised predictor with a finite set of discrete possible output values. A classifier must supports modeling some of binary, targets. It must store a classes attribute after fitting.
- Xt: Ndarray ( M x N matrix where ``M=m-samples``, & ``N=n-features``)
Shorthand for “test set”; data that is observed at testing and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix.
- prefix: str, optional
litteral string to prefix the samples/examples considered as tick labels in the abscissa. For instance:
index =[0, 2, 4, 7] prefix ='b' --> index =['b0', 'b2', 'b4', 'b7']
- predict: bool, default=False,
Expected to be ‘True’ when user want to predict the array ‘ypred’ and plot at the same time. Otherwise, can be set to ‘False’ and use the’ypred’ data already predicted. Note that, if ‘True’, an estimator/classifier must be provided as well as the test data ‘Xt’, otherwise an error will occur.
- index: array_like, optional
list integer values or string expected to be the index of ‘Xt’ and ‘yt’ turned into pandas dataframe and series respectively. Note that one of them has already and index and new index is given, the latter must be consistent. This is usefull when data are provided as ndarray rathern than a dataframe.
- fill_between: bool
Fill a line between the actual classes i.e the true labels.
- labels: list of str or int, Optional
list of labels names to hold the name of each category.
- return_pred: bool,
return predicted ‘ypred’ if ‘True’ else nothing.
- baseplot_kws: dict,
All all the keywords arguments passed to the peroperty
watex.property.BasePlotclass.
(2)-> prepared our demo estimator and plot model predicted
>>> svc_clf = SVC(C=100, gamma=1e-2, kernel='rbf', random_state =42) >>> base_plot_params ={ 'lw' :3., # line width 'lc':(.9, 0, .8), 'ms':7., 'yp_marker' :'o', 'fig_size':(12, 8), 'font_size':15., 'xlabel': 'Test examples', 'ylabel':'Flow categories' , 'marker':'o', 'markeredgecolor':'k', 'markerfacecolor':'b', 'markeredgewidth':3, 'yp_markerfacecolor' :'k', 'yp_markeredgecolor':'r', 'alpha' :1., 'yp_markeredgewidth':2., 'show_grid' :True, 'galpha' :0.2, 'glw':.5, 'rotate_xlabel' :90., 'fs' :3., 's' :20 , 'rotate_xlabel':90 } >>> plotModel(yt= ytest , Xt=Xtest , predict =True , # predict the result (estimator fit) clf=svc_clf , fill_between= False, prefix ='b', labels=['FR0', 'FR1', 'FR2', 'FR3'], # replace 'y' labels. **base_plot_params ) >>> # plot show where the model failed to predict the target 'yt'
- watex.view.mlplot.plotProjection(X, Xt=None, *, columns=None, test_kws=None, **baseplot_kws)[source]#
Visualize train and test dataset based on the geographical coordinates.
Since there is geographical information(latitude/longitude or easting/northing), it is a good idea to create a scatterplot of all instances to visualize data.
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xt (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Shorthand for “test set”; data that is observed at testing and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix.columns (list of str or index, optional) – columns is usefull when a dataframe is given with a dimension size greater than 2. If such data is passed to X or Xt, columns must hold the name to considered as ‘easting’, ‘northing’ when UTM coordinates are given or ‘latitude’ , ‘longitude’ when latlon are given. If dimension size is greater than 2 and columns is None , an error will raises to prevent the user to provide the index for ‘y’ and ‘x’ coordinated retrieval.
test_kws (dict,) – keywords arguments passed to
matplotlib.plot.scatter()as test location font and colors properties.baseplot_kws (dict,) – All all the keywords arguments passed to the peroperty
watex.property.BasePlotclass.
Examples
>>> from watex.datasets import fetch_data >>> from watex.view.mlplot import plotProjection >>> # Discard all the non-numeric data >>> # then inut numerical data >>> from watex.utils import to_numeric_dtypes, naive_imputer >>> X, Xt, *_ = fetch_data ('bagoue', split_X_y =True, as_frame =True) >>> X =to_numeric_dtypes(X, pop_cat_features=True ) >>> X= naive_imputer(X) >>> Xt = to_numeric_dtypes(Xt, pop_cat_features=True ) >>> Xt= naive_imputer(Xt) >>> plot_kws = dict (fig_size=(8, 12), lc='k', marker='o', lw =3., font_size=15., xlabel= 'easting (m) ', ylabel='northing (m)' , markerfacecolor ='k', markeredgecolor='r', alpha =1., markeredgewidth=2., show_grid =True, galpha =0.2, glw=.5, rotate_xlabel =90., fs =3., s =None ) >>> plotProjection( X, Xt , columns= ['east', 'north'], trainlabel='train location', testlabel='test location', **plot_kws )
- watex.view.mlplot.plotSilhouette(X, labels=None, prefit=True, n_clusters=3, n_init=10, max_iter=300, random_state=None, tol=10000.0, metric='euclidean', **kwd)[source]#
quantifies the quality of clustering samples.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.
labels (array-like 1d of shape (n_samples,)) – Label values for each sample.
n_clusters (int, default=8) – The number of clusters to form as well as the number of centroids to generate.
prefit (bool, default=False) – Whether a prefit labels is expected to be passed into the function directly or not. If True, labels must be a fit predicted values target. If False, labels is fitted and updated from X by calling fit_predict methods. Any other values passed to labels is discarded.
n_init (int, default=10) – Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
max_iter (int, default=300) – Maximum number of iterations of the k-means algorithm for a single run.
tol (float, default=1e-4) – Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
verbose (int, default=0) – Verbosity mode.
random_state (int, RandomState instance or None, default=42) – Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.
tol – Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
metric (str or callable, default='euclidean') – The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by
sklearn.metrics.pairwise.pairwise_distances(). IfXis the distance array itself, use “precomputed” as the metric. Precomputed distance matrices must have 0 along the diagonal.**kwds (optional keyword parameters) – Any further parameters are passed directly to the distance function. If using a
scipy.spatial.distancemetric, the parameters are still metric dependent. See the scipy docs for usage examples.
Note
The sihouette coefficient is bound between -1 and 1
- watex.view.mlplot.plot_matshow(arr, /, labelx=None, labely=None, matshow_kws=None, **baseplot_kws)[source]#
Quick matrix visualization using matplotlib.pyplot.matshow.
- Parameters:
arr (2D ndarray,) – matrix of n rowns and m-columns items
matshow_kws (dict) – Additional keywords arguments for
matplotlib.axes.matshow()labelx (list of str, optional) – list of labels names that express the name of each category on x-axis. It might be consistent with the matrix number of columns of arr.
label (list of str, optional) – list of labels names that express the name of each category on y-axis. It might be consistent with the matrix number of row of arr.
Examples
>>> import numpy as np >>> from watex.view.mlplot import plot_matshow >>> matshow_kwargs ={ 'aspect': 'auto', 'interpolation': None, 'cmap':'copper_r', } >>> baseplot_kws ={'lw':3, 'lc':(.9, 0, .8), 'font_size':15., 'cb_format':None, #'cb_label':'Rate of prediction', 'xlabel': 'Predicted flow classes', 'ylabel': 'Geological rocks', 'font_weight':None, 'tp_labelbottom':False, 'tp_labeltop':True, 'tp_bottom': False } >>> labelx =['FR0', 'FR1', 'FR2', 'FR3', 'Rates'] >>> labely =['VOLCANO-SEDIM. SCHISTS', 'GEOSYN. GRANITES', 'GRANITES', '1.0', 'Rates'] >>> array2d = np.array([(1. , .5, 1. ,1., .9286), (.5, .8, 1., .667, .7692), (.7, .81, .7, .5, .7442), (.667, .75, 1., .75, .82), (.9091, 0.8064, .7, .8667, .7931)]) >>> plot_matshow(array2d, labelx, labely, matshow_kwargs,**baseplot_kws )
- watex.view.mlplot.plot_model_scores(models, scores=None, cv_size=None, **baseplot_kws)[source]#
uses the cross validation to get an estimation of model performance generalization.
It Visualizes model fined tuned scores vs the cross validation
- Parameters:
models (list of callables, always as a functions,) –
list of estimator names can also be a pair estimators and validations scores.For instance estimators and scores can be arranged as:
models =[('SVM', scores_svm), ('LogRegress', scores_logregress), ...]
If that arrangement is passed to models parameter then no need to pass the score values of each estimators in scores. Note that a model is an object which manages the estimation and decoding. The model is estimated as a deterministic function of:
parameters provided in object construction or with set_params;
- the global numpy.random random state if the estimator’s random_state
parameter is set to None; and
- any data or sample properties passed to the most recent call to fit,
fit_transform or fit_predict, or data similarly passed in a sequence of calls to partial_fit.
list of estimators names or a pairs estimators and validations scores. For instance:
clfs =[('SVM', scores_svm), ('LogRegress', scores_logregress), ...]
scores (array like) –
list of scores on different validation sets. If scores are given, set only the name of the estimators passed to models like:
models =['SVM', 'LogRegress', ...] scores=[scores_svm, scores_logregress, ...]
cv_size (float or int,) – The number of fold used for validation. If different models have different cross validation values, the minimum size of cross validation is used and the scored of each model is resized to match the minimum size number.
baseplot_kws (dict,) – All all the keywords arguments passed to the peroperty
watex.property.BasePlotclass.
Examples
(1) -> Score is appended to the model >>> from watex.exlib.sklearn import SVC >>> from watex.view.mlplot import plot_model_scores >>> import numpy as np >>> svc_model = SVC() >>> fake_scores = np.random.permutation (np.arange (0, 1, .05)) >>> plot_model_scores([(svc_model, fake_scores )]) … (2) -> Use model and score separately
>>> plot_model_scores([svc_model],scores =[fake_scores] )# >>> # customize plot by passing keywords properties >>> base_plot_params ={ 'lw' :3., 'lc':(.9, 0, .8), 'ms':7., 'fig_size':(12, 8), 'font_size':15., 'xlabel': 'samples', 'ylabel':'scores' , 'marker':'o', 'alpha' :1., 'yp_markeredgewidth':2., 'show_grid' :True, 'galpha' :0.2, 'glw':.5, 'rotate_xlabel' :90., 'fs' :3., 's' :20 , 'sns_style': 'darkgrid', } >>> plot_model_scores([svc_model],scores =[fake_scores] , **base_plot_params )
- watex.view.mlplot.plot_reg_scoring(reg, X, y, test_size=None, random_state=42, scoring='mse', return_errors=False, **baseplot_kws)[source]#
Plot regressor learning curves using root-mean squared error scorings.
Use the hold-out cross-validation technique for score evaluation [1].
- Parameters:
reg (callable, always as a function) – A regression estimator; Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator. The estimated model is stored in public and private attributes on the estimator instance, facilitating decoding through prediction and transformation methods. The core functionality of some estimators may also be available as a
function.X (Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.
Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.y (array-like of shape (M, ) :math:`M=m-samples) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
scoring (str, ['mse'|'rmse'], default ='mse') – kind of error to visualize on the regression learning curve.
test_size (float or int, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If
train_sizeis also None, it will be set to 0.25.random_state (int, RandomState instance or None, default=None) – Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls..
return_errors (bool, default='False') – returns training eror and validation errors.
baseplot_kws (dict,) – All all the keywords arguments passed to the peroperty
watex.property.BasePlotclass.
- Returns:
(train_errors, val_errors) – training score and validation scores if return_errors is set to
True, otherwise returns nothing- Return type:
Tuple,
Examples
>>> from watex.datasets import fetch_data >>> from watex.view.mlplot import plot_reg_scoring >>> # Note that for the demo, we import SVC rather than LinearSVR since the >>> # problem of Bagoue dataset is a classification rather than regression. >>> # if use regression instead, a convergence problem will occurs. >>> from watex.exlib.sklearn import SVC >>> X, y = fetch_data('bagoue analysed')# got the preprocessed and imputed data >>> svm =SVC() >>> t_errors, v_errors =plot_reg_scoring(svm, X, y, return_errors=True)
Notes
The hold-out technique is the classic and most popular approach for estimating the generalization performance of the machine learning. The dataset is splitted into training and test sets. The former is used for the model training whereas the latter is used for model performance evaluation. However in typical machine learning we are also interessed in tuning and comparing different parameter setting for futher improve the performance for the name refering to the given classification or regression problem for which we want the optimal values of tuning the hyperparameters. Thus, reusing the same datset over and over again during the model selection is not recommended since it will become a part of the training data and then the model will be more likely to overfit. From this issue, the hold-out cross validation is not a good learning practice. A better way to use the hold-out method is to separate the data into three parts such as the traing set, the the validation set and the test dataset. See more in [2].
References
[1]Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., et al. (2011) Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12, 2825–2830.
[2]Raschka, S. & Mirjalili, V. (2019) Python Machine Learning. (J. Malysiak, S. Jain, J. Lovell, C. Nelson, S. D’silva & R. Atitkar, Eds.), 3rd ed., Packt.