- class watex.base.AdalineGradientDescent(eta=0.01, n_iter=50, random_state=42)[source]#
Bases:
_BaseAdaptative Linear Neuron Classifier
ADAptative LInear NEuron (Adaline) was published by Bernard Widrow and his doctoral studentTeed Hoff only a few uears after Rosenblatt’s perceptron algorithm. It can be considered as impovrment of the latter Windrow and al., 1960.
Adaline illustrates the key concepts of defining and minimizing continuous cost function. This lays the groundwork for understanding more advanced machine learning algorithm for classification, such as Logistic Regression, Support Vector Machines,and Regression models.
The key difference between Adaline rule (also know as the WIdrow-Hoff rule) and Rosenblatt’s perceptron is that the weights are updated based on linear activation function rather than unit step function like in the perceptron. In Adaline, this linear activation function \(\phi(z)\) is simply the identifu function of the net input so that:
\[\phi (w^Tx)= w^Tx\]while the linear activation function is used for learning the weights.
- Parameters:
eta (float,) – Learning rate between (0. and 1.)
n_iter (int ,) – number of iteration passes over the training set
random_state (int, default is 42) – random number generator seed for random weight initialization.
- w_#
Weight after fitting
- Type:
Array-like,
- cost_#
Sum of squares cost function (updates ) in each epoch
- Type:
list
References
[1]Windrow and al., 1960. An Adaptative “Adeline” Neuron Using Chemical “Memistors”, Technical reports Number, 1553-2,B Windrow and al., standford Electron labs, Standford, CA,October 1960.
- activation(X)[source]#
Compute the linear activation
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.- Returns:
X
- Return type:
activate NDArray
- fit(X, y)[source]#
Fit the training data
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.y (array-like, shape (M, )
M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
- Returns:
self – returns
selffor easy method chaining.- Return type:
Perceptron instance
- property inspect#
Inspect object whether is fitted or not
- net_input(X)[source]#
Compute the net input X
- Parameters:
- X: Ndarray ( M x N matrix where ``M=m-samples``, & ``N=n-features``)
Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.
Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
- Returns:
- weight net inputs
- predict(X)[source]#
Predict the class label after unit step
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.- Returns:
ypred
- Return type:
predicted class label after the unit step (1, or -1)
- class watex.base.AdalineStochasticGradientDescent(eta=0.01, n_iter=50, shuffle=True, random_state=42)[source]#
Bases:
_BaseAdaptative Linear Neuron Classifier with batch (stochastic) gradient descent
A stochastic gradient descent is a popular alternative algorithm which is sometimes also called iterative or online gradient descent [1]. It updates the weights based on the sum of accumulated errors over all training examples \(x^{(i)}\):
\[\delta w: \sum{i} (y^{(i)} -\phi( z^{(i)}))x^(i)\]the weights are updated incremetally for each training examples:
\[\eta(y^{(i)} - \phi(z^{(i)})) x^{(i)}\]- Parameters:
eta (float,) – Learning rate between (0. and 1.)
n_iter (int,) – number of iteration passes over the training set
suffle (bool,) – shuffle training data every epoch if True to prevent cycles.
random_state (int, default is 42) – random number generator seed for random weight initialization.
- w_#
Weight after fitting
- Type:
Array-like,
- cost_#
Sum of squares cost function (updates ) in each epoch
- Type:
list
See also
AdelineGradientDescent
References
[1]Windrow and al., 1960. An Adaptative “Adaline” Neuron Using Chemical “Memistors”, Technical reports Number, 1553-2,B Windrow and al., standford Electron labs, Standford, CA,October 1960.
- activation(X)[source]#
Compute the linear activation
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.- Returns:
X
- Return type:
activate NDArray
- fit(X, y)[source]#
Fit the training data
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.y (array-like, shape (M, )
M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
- Returns:
self – returns
selffor easy method chaining.- Return type:
Perceptron instance
- property inspect#
Inspect object whether is fitted or not
- net_input(X)[source]#
Compute the net input X
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.- Return type:
weight net inputs
- partial_fit(X, y)[source]#
Fit training data without reinitialising the weights
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.y (array-like, shape (M, )
M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
- Returns:
self – returns
selffor easy method chaining.- Return type:
Perceptron instance
- predict(X)[source]#
Predict the class label after unit step
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.- Returns:
ypred
- Return type:
predicted class label after the unit step (1, or -1)
- class watex.base.Data(verbose=0)[source]#
Bases:
objectData base class
Typically, we train a model with a matrix of data. Note that pandas Dataframe is the most used because it is very nice to have columns lables even though Numpy arrays work as well.
For supervised Learning for instance, suc as regression or clasification, our intent is to have a function that transforms features into a label. If we were to write this as an algebra formula, it would be look like:
\[y = f(X)\]Xis a matrix. Each row represent a sample of data or information about individual. Every columns inXis a feature.The output of our function,y, is a vector that contains labels (for classification) or values (for regression).In Python, by convention, we use the variable name
Xto hold the sample data even though the capitalization of variable is a violation of standard naming convention (see PEP8).- Parameters:
data (str, filepath_or_buffer or
pandas.core.DataFrame) – Path -like object or Dataframe. If data is given as path-like object, data is read, asserted and validated. Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be a file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts anyos.PathLike. By file-like object, we refer to objects with a read() method, such as a file handle e.g. via builtin open function or StringIO.columns (str or list of str) – columns to replace which contain the missing data. Can use the axis equals to ‘1’.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Determine if rows or columns which contain missing values are removed. * 0, or ‘index’ : Drop rows which contain missing values. * 1, or ‘columns’ : Drop columns which contain missing value. Changed in version 1.0.0: Pass tuple or list to drop on multiple axes. Only a single axis is allowed.
sample (int, Optional,) – Number of row to visualize or the limit of the number of sample to be able to see the patterns. This is usefull when data is composed of many rows. Skrunked the data to keep some sample for visualization is recommended.
Noneplot all the samples ( or examples) in the datakind (str, Optional) –
type of visualization. Can be
dendrogramm,mbarorbar.corrplot for dendrogram ,msnobar,pltandmsnocorrelation visualization respectively:barplot counts the nonmissing data using pandasmbaruse themsnopackage to count the numberof nonmissing data.
- dendrogram`` show the clusterings of where the data is missing.
leaves that are the same level predict one onother presence (empty of filled). The vertical arms are used to indicate how different cluster are. short arms mean that branch are similar.
- ``corr` creates a heat map showing if there are correlations
where the data is missing. In this case, it does look like the locations where missing data are corollated.
Noneis the default vizualisation. It is useful for viewingcontiguous area of the missing data which would indicate that the missing data is not random. The
matrixfunction includes a sparkline along the right side. Patterns here would also indicate non-random missing data. It is recommended to limit the number of sample to be able to see the patterns.
Any other value will raise an error.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
verbose (int, default is
0) – Control the level of verbosity. Higher value lead to more messages.
- Returns:
self – returns
selffor easy method chaining.- Return type:
Baseclass instance
Examples
- property data#
return verified data
- property describe#
Get summary stats as well as see the cound of non-null data. Here is the default behaviour of the method i.e. it is to only report on numeric columns. To have have full control, do it manually by yourself.
- drop(labels=None, columns=None, inplace=False, axis=0, **kws)[source]#
Drop specified labels from rows or columns.
Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.
- Parameters:
labels (single label or list-like) – Index or column labels to drop. A tuple will be used as a single label and not treated as a list-like.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
columns (single label or list-like) – Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels)
kws (dict,) – Additionnal keywords arguments passed to
pd.DataFrame.drop().
- Returns:
DataFrame without the removed index or column labels or None if inplace equsls to
True.- Return type:
DataFrame or None
- fit(data=None)[source]#
Read, assert and fit the data.
- Parameters:
data (Dataframe or shape (M, N) from
pandas.DataFrame) – Dataframe containing samples M and features N- Returns:
Returns
selffor easy method chaining.- Return type:
Datainstance
- property inspect#
Inspect data and trigger plot after checking the data entry. Raises NotFittedError if ExPlot is not fitted yet.
- merge()[source]#
Merge two series whatever the type with operator &&.
When series as dtype object as non numeric values, dtypes should be change into a object
- profilingReport(data=None, **kwd)[source]#
Generate a report in a notebook.
It will summarize the types of the columns and allow yuou to view details of quatiles statistics, a histogram, common values and extreme values.
- Parameters:
data (Dataframe or shape (M, N) from
pandas.DataFrame) – Dataframe containing samples M and features N- Returns:
Returns
selffor easy method chaining.- Return type:
Datainstance
Examples
>>> from watex.base import Data >>> Data().fit(data).profilingReport()
- rename(data=None, columns=None, pattern=None)[source]#
rename columns of the dataframe with columns in lowercase and spaces replaced by underscores.
- Parameters:
data (Dataframe of shape (M, N) from
pandas.DataFrame) – Dataframe containing samples M and features Ncolumns (str or list of str, Optional) – the specific columns in dataframe to renames. However all columns is put in lowercase. If columns not in dataframe, error raises.
pattern (str, Optional,) – Regular expression pattern to strip the data. By default, the pattern is
'[ -@*#&+/]'.
- Returns:
``self`` – returns
selffor easy method chaining.- Return type:
Datainstance
- shrunk(columns, data=None, **kwd)[source]#
Reduce the data with importance features
- Parameters:
data (Dataframe or shape (M, N) from
pandas.DataFrame) – Dataframe containing samples M and features Ncolumns (str or list of str) – Columns or features to keep in the datasets
kwd (dict,)
:param additional keywords arguments from
watex.utils.mlutils.selectfeatures():- Returns:
Returns
selffor easy method chaining.- Return type:
Datainstance
- class watex.base.GreedyPerceptron(eta=0.01, n_iter=50, random_state=42)[source]#
Bases:
_BasePerceptron classifier
Inspired from Rosenblatt concept of perceptron rules. Indeed, Rosenblatt published the first concept of perceptron learning rule based on the MCP (McCulloth-Pitts) neuron model. With the perceptron rule, Rosenblatt proposed an algorithm thar would automatically learn the optimal weights coefficients that would them be multiplied by the input features in order to make the decision of whether a neuron fires (transmits a signal) or not. In the context of supervised learning and classification, such algirithm could them be used to predict whether a new data points belongs to one class or the other.
Rosenblatt initial perceptron rule and the perceptron algorithm can be summarized by the following steps:
initialize the weights at 0 or small random numbers.
- For each training examples, \(x^{(i)}\):
Compute the output value \(\hat{y}\).
update the weighs.
the weights \(w\) vector can be fromally written as:
\[w := w_j + \delta w_j\]- Parameters:
eta (float,) – Learning rate between (0. and 1.)
n_iter (int ,) – number of iteration passes over the training set
random_state (int, default is 42) – random number generator seed for random weight initialization.
- w_#
Weight after fitting
- Type:
Array-like,
- errors_#
Number of missclassification (updates ) in each epoch
- Type:
list
References
[1]Rosenblatt F, 1957, The perceptron:A perceiving and Recognizing Automaton,Cornell Aeoronautical Laboratory 1957
[2]McCulloch W.S and W. Pitts, 1943. A logical calculus of Idea of Immanent in Nervous Activity, Bulleting of Mathematical Biophysics, 5(4): 115-133, 1943.
- fit(X, y)[source]#
Fit the training data
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.y (array-like, shape (M, )
M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
- Returns:
self – returns
selffor easy method chaining.- Return type:
Perceptron instance
- predict(X)[source]#
Predict the class label after unit step
- Parameters:
- XNdarray ( M x N matrix where
M=m-samples, &N=n-features) Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.
Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
- XNdarray ( M x N matrix where
- Returns:
- ypred: predicted class label after the unit step (1, or -1)
- class watex.base.MajorityVoteClassifier(clfs, weights=None, vote='classlabel')[source]#
Bases:
BaseEstimator,ClassifierMixinA majority vote Ensemble classifier
Combine different classification algorithms associate with individual weights for confidence. The goal is to build a stronger meta-classifier that balance out of the individual classifiers weaknes on a particular datasets. In more precise in mathematical terms, the weighs majority vote can be expressed as follow:
\[\hat{y} = arg \max{i} \sum {j=1}^{m} w_j\chi_A (C_j(x)=1)\]where \(w_j\) is a weight associated with a base classifier, \(C_j\); \(\hat{y}\) is the predicted class label of the ensemble. \(A\) is the set of the unique class label; \(\chi_A\) is the characteristic function or indicator function which returns 1 if the predicted class of the jth clasifier matches \(i(C_j(x)=1)\). For equal weights, the equation is simplified as follow:
\[\hat{y} = mode {{C_1(x), C_2(x), ... , C_m(x)}}\]- Parameters:
clfs ({array_like}, shape (n_classifiers)) – Differents classifier for ensembles
vote (str , ['classlabel', 'probability'], default is {'classlabel'}) – If ‘classlabel’ the prediction is based on the argmax of the class label. Otherwise, if ‘probability’, the argmax of the sum of the probabilities is used to predict the class label. Note it is recommended for calibrated classifiers.
weights ({array-like}, shape (n_classifiers, ), Optional, default=None) – If a list of int or float, values are provided, the classifier are weighted by importance; it uses the uniform weights if ‘weights’ is
None.
- classes_#
array of classifiers withencoded classes labels
- Type:
array_like, shape (n_classifiers)
- classifiers_#
list of fitted classifiers
- Type:
list,
Examples
>>> from watex.exlib.sklearn import ( LogisticRegression,DecisionTreeClassifier ,KNeighborsClassifier, Pipeline , cross_val_score , train_test_split , StandardScaler , SimpleImputer ) >>> from watex.datasets import fetch_data >>> from watex.base import MajorityVoteClassifier >>> from watex.base import selectfeatures >>> data = fetch_data('bagoue original').get('data=dfy1') >>> X0 = data.iloc [:, :-1]; y0 = data ['flow'].values >>> # exclude the categorical value for demonstration >>> # binarize the target y >>> y = np.asarray (list(map (lambda x: 0 if x<=1 else 1, y0))) >>> X = selectfeatures (X0, include ='number') >>> X = SimpleImputer().fit_transform (X) >>> X, Xt , y, yt = train_test_split(X, y) >>> clf1 = LogisticRegression(penalty ='l2', solver ='lbfgs') >>> clf2= DecisionTreeClassifier(max_depth =1 ) >>> clf3 = KNeighborsClassifier( p =2 , n_neighbors=1) >>> pipe1 = Pipeline ([('sc', StandardScaler()), ('clf', clf1)]) >>> pipe3 = Pipeline ([('sc', StandardScaler()), ('clf', clf3)])
-> Test the each classifier results taking individually
>>> clf_labels =['Logit', 'DTC', 'KNN'] >>> # test the results without using the MajorityVoteClassifier >>> for clf , label in zip ([pipe1, clf2, pipe3], clf_labels): scores = cross_val_score(clf, X, y , cv=10 , scoring ='roc_auc') print("ROC AUC: %.2f (+/- %.2f) [%s]" %(scores.mean(), scores.std(), label)) ... ROC AUC: 0.91 (+/- 0.05) [Logit] ROC AUC: 0.73 (+/- 0.07) [DTC] ROC AUC: 0.77 (+/- 0.09) [KNN]
_> Implement the MajorityVoteClassifier
>>> # test the resuls with Majority vote >>> mv_clf = MajorityVoteClassifier(clfs = [pipe1, clf2, pipe3]) >>> clf_labels += ['Majority voting'] >>> all_clfs = [pipe1, clf2, pipe3, mv_clf] >>> for clf , label in zip (all_clfs, clf_labels): scores = cross_val_score(clf, X, y , cv=10 , scoring ='roc_auc') print("ROC AUC: %.2f (+/- %.2f) [%s]" %(scores.mean(), scores.std(), label)) ... ROC AUC: 0.91 (+/- 0.05) [Logit] ROC AUC: 0.73 (+/- 0.07) [DTC] ROC AUC: 0.77 (+/- 0.09) [KNN] ROC AUC: 0.92 (+/- 0.06) [Majority voting] # give good score & less errors
- fit(X, y)[source]#
Fit classifiers
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.y (array-like, shape (M, )
M=m-samples) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
- Returns:
self – returns
selffor easy method chaining.- Return type:
MajorityVoteClassifier instance
- get_params(deep=True)[source]#
Overwrite the get params from _Base class and get classifiers parameters from GridSearch .
- property inspect#
Inspect object whether is fitted or not
- predict(X)[source]#
Predict the class label of X
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.- Returns:
maj_vote – Predicted class label array
- Return type:
{array_like}, shape (n_examples, )
- predict_proba(X)[source]#
Predict the class probabilities an return average probabilities which is usefull when computing the the receiver operating characteristic area under the curve (ROC AUC ).
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.- Returns:
avg_proba – weights average probabilities for each class per example.
- Return type:
{array_like }, shape (n_examples, n_classes)
- set_score_request(*, sample_weight='$UNCHANGED$')#
Request metadata passed to the
scoremethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weightparameter inscore.- Returns:
self – The updated object.
- Return type:
object
- class watex.base.Missing(in_percent=False, sample=None, kind=None, drop_columns=None, **kws)[source]#
Bases:
DataDeal with missing values in Data
Most algorithms will not work with missing data. Notable exceptions are the recent boosting libraries such as the XGBoost (watex.documentation.xgboost.__doc__) CatBoost and LightGBM. As with many things in machine learning , there are no hard answaers for how to treat a missing data. Also, missing data could represent different situations. There are three warious way to handle missing data:
* Remove any row with missing data * Remove any columns with missing data * Impute missing values * Create an indicator columns to indicator data was missing
- Parameters:
in_percent (bool,) – give the statistic of missing data in percentage if ser to
True.sample (int, Optional,) – Number of row to visualize or the limit of the number of sample to be able to see the patterns. This is usefull when data is composed of many rows. Skrunked the data to keep some sample for visualization is recommended.
Noneplot all the samples ( or examples) in the datakind (str, Optional) –
type of visualization. Can be
dendrogramm,mbarorbar.corrplot for dendrogram ,msnobar,pltandmsnocorrelation visualization respectively:barplot counts the nonmissing data using pandasmbaruse themsnopackage to count the numberof nonmissing data.
- dendrogram`` show the clusterings of where the data is missing.
leaves that are the same level predict one onother presence (empty of filled). The vertical arms are used to indicate how different cluster are. short arms mean that branch are similar.
- ``corr` creates a heat map showing if there are correlations
where the data is missing. In this case, it does look like the locations where missing data are corollated.
Noneis the default vizualisation. It is useful for viewingcontiguous area of the missing data which would indicate that the missing data is not random. The
matrixfunction includes a sparkline along the right side. Patterns here would also indicate non-random missing data. It is recommended to limit the number of sample to be able to see the patterns.
Any other value will raise an error
Examples
>>> from watex.base import Missing >>> data ='data/geodata/main.bagciv.data.csv' >>> ms= Missing().fit(data) >>> ms.plot_.fig_size = (12, 4 ) >>> ms.plot ()
- drop(data=None, columns=None, inplace=False, axis=1, **kwd)[source]#
Remove missing data
- Parameters:
data (Dataframe of shape (M, N) from
pandas.DataFrame) – Dataframe containing samples M and features Ncolumns (str or list of str) – columns to drop which contain the missing data. Can use the axis equals to ‘1’.
axis ({0 or 'index', 1 or 'columns'}, default 0) –
Determine if rows or columns which contain missing values are removed. * 0, or ‘index’ : Drop rows which contain missing values.
1, or ‘columns’ : Drop columns which contain missing value.
Changed in version 1.0.0: Pass tuple or list to drop on multiple axes. Only a single axis is allowed.
how ({'any', 'all'}, default 'any') –
Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
’any’: If any NA values are present, drop that row or column.
’all’ : If all values are NA, drop that row or column.
thresh (int, optional) – Require that many non-NA values. Cannot be combined with how.
subset (column label or sequence of labels, optional) – Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
- Returns:
``self`` – returns
selffor easy method chaining.- Return type:
Missinginstance
- property get_missing_columns#
return columns with Nan Values
- property isnull#
Check the mean values in the data in percentge
- plot(figsize=None, **kwd)[source]#
Vizualize patterns in the missing data.
- Parameters:
data (Dataframe of shape (M, N) from
pandas.DataFrame) – Dataframe containing samples M and features Nkind (str, Optional) –
kind of visualization. Can be
dendrogramm,mbarorbarplot for dendrogram ,msnobar andpltvisualization respectively:barplot counts the nonmissing data using pandasmbaruse themsnopackage to count the numberof nonmissing data.
- dendrogram`` show the clusterings of where the data is missing.
leaves that are the same level predict one onother presence (empty of filled). The vertical arms are used to indicate how different cluster are. short arms mean that branch are similar.
- ``corr` creates a heat map showing if there are correlations
where the data is missing. In this case, it does look like the locations where missing data are corollated.
Noneis the default vizualisation. It is useful for viewingcontiguous area of the missing data which would indicate that the missing data is not random. The
matrixfunction includes a sparkline along the right side. Patterns here would also indicate non-random missing data. It is recommended to limit the number of sample to be able to see the patterns.
Any other value will raise an error
sample (int, Optional) – Number of row to visualize. This is usefull when data is composed of many rows. Skrunked the data to keep some sample for visualization is recommended.
Noneplot all the samples ( or examples) in the datakws (dict) – Additional keywords arguments of
msno.matrixplot.
- Returns:
``self`` – returns
selffor easy method chaining.- Return type:
Missinginstance
Examples
>>> from watex.base import Missing >>> data ='data/geodata/main.bagciv.data.csv' >>> ms= Missing().fit(data) >>> ms.plot(figsize = (12, 4 ) )
- replace(data=None, columns=None, fill_value=None, new_column_name=None, return_non_null=False, **kwd)[source]#
Replace the missing values to consider.
Use the
coaleasefunction ofpyjanitor. It takes a dataframe and a list of columns to consider. This is a similar to functionality found in Excel and SQL databases. It returns the first non null value of each row.- Parameters:
data (Dataframe of shape (M, N) from
pandas.DataFrame) – Dataframe containing samples M and features Ncolumns (str or list of str) – columns to replace which contain the missing data. Can use the axis equals to ‘1’.
axis ({0 or 'index', 1 or 'columns'}, default 0) –
- Determine if rows or columns which contain missing values are
removed. * 0, or ‘index’ : Drop rows which contain missing values.
1, or ‘columns’ : Drop columns which contain missing value.
Changed in version 1.0.0: Pass tuple or list to drop on multiple axes. Only a single axis is allowed.
- returns:
``self`` – returns
selffor easy method chaining.- rtype:
Missinginstance
- property sanity_check#
Ensure that we have deal with all missing values. The following code returns a single boolean if there is any cell that is missing in a DataFrame
- class watex.base.SequentialBackwardSelection(estimator=None, k_features=1, scoring='accuracy', test_size=0.25, random_state=42)[source]#
Bases:
_BaseSequential Backward Selection (SBS) is a feature selection algorithm which aims to reduce dimensionality of the initial feature subspace with a minimum decay in the performance of the classifier to improve upon computationan efficiency. In certains cases, SBS can even improve the predictive power of the model if a model suffers from overfitting.
The idea behind the SBS is simple: it sequentially removes features from the full feature subset until the new feature subspace contains the desired number of features. In order to determine which feature is to be removed at each stage, the criterion fonction \(J\) is needed for minimization [1]. Indeed, the criterion calculated from the criteria function can simply be the difference in performance of the classifier before and after the removal of this particular feature. Then, the feature to be remove at each stage can simply be the defined as the feature that maximizes this criterion; or in more simple terms, at each stage, the feature that causes the least performance is eliminated loss after removal. Based on the preceding definition of SBS, the algorithm can be outlibe with a few steps:
- Initialize the algorithm with \(k=d\), where \(d\) is the
dimensionality of the full feature space, \(X_d\).
- Determine the feature \(x^{-}\),that maximizes the criterion:
\(x^{-}= argmax J(X_k-x)\), where \(x\in X_k\).
- Remove the feature \(x^{-}\) from the feature set
\(X_{k+1}= X_k -x^{-}; k=k-1\).
- -Terminate if \(k\) equals to the number of desired features;
otherwise go to the step 2. [2]
- Parameters:
estimator (callable or instanciated object,) – callable or instance object that has a fit method.
k_features (int, default=1) – the number of features from where starting the selection. It must be less than the number of feature in the training set, otherwise it does not make sense.
scoring (callable or str , default='accuracy') – metric for scoring. availabe metric are ‘precision’, ‘recall’, ‘roc_auc’ or ‘accuracy’. Any other metric with raise an errors.
test_size (float or int, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If
train_sizeis also None, it will be set to 0.25.random_state (int, RandomState instance or None, default=None) – Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.
References
[1]Raschka, S., Mirjalili, V., 2019. Python Machine Learning, 3rd ed. Packt.
[2]Ferri F., Pudil F., Hatef M., and Kittler J., Comparative study of the techniques for Large-scale feature selection, pages 403-413, 1994.
- feature_names_in_#
Names of features seen during fit. Defined only when X has feature names that are all strings.
- Type:
ndarray of shape (n_features_in_,)
- indices_#
Collect the indices of subset of the best validated models
- Type:
tuple of dimensionnality X
- subsets_#
list of indices_
- Type:
list,
- scores_#
Collection of the scores of the best model got during the cross-validating
- Type:
list,
- k_score_#
The score of the desired feature.
- Type:
float,
Examples
>>> from watex.exlib.sklearn import KNeighborsClassifier , train_test_split >>> from watex.datasets import fetch_data >>> from watex.base import SequentialBackwardSelection >>> X, y = fetch_data('bagoue analysed') # data already standardized >>> Xtrain, Xt, ytrain, yt = train_test_split(X, y) >>> knn = KNeighborsClassifier(n_neighbors=5) >>> sbs= SequentialBackwardSelection (knn) >>> sbs.fit(Xtrain, ytrain )
- fit(X, y)[source]#
Fit the training data
Note that SBS splits the datasets into a test and training insite the fit function. \(X\) is still fed to the algorithm. Indeed, SBS will then create a new training subsets for testing (validation) and training , which is why this test set is also called the validation dataset. This approach is necessary to prevent our original test set to becoming part of the training data.
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.y (array-like, shape (M, )
M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.
- Returns:
self – returns
selffor easy method chaining.- Return type:
SequentialBackwardSelection instance
- transform(X)[source]#
Transform the training set
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.- Returns:
X – New transformed training set with selected features columns
- Return type:
Ndarray ( M x N matrix where
M=m-samples, &N=n-features)
- watex.base.existfeatures(df, features, error='raise')[source]#
Control whether the features exists or not
- Parameters:
df – a dataframe for features selections
features – list of features to select. Lits of features must be in the dataframe otherwise an error occurs.
error – str - raise if the features don’t exist in the dataframe. default is
raiseandignoreotherwise.
- Returns:
bool assert whether the features exists
- watex.base.get_params(obj)[source]#
Get object parameters.
Object can be callable or instances
- Parameters:
obj – object , can be callable or instance
- Returns:
dict of parameters values
- Examples:
>>> from sklearn.svm import SVC >>> from watex.base import get_params >>> sigmoid= SVC ( **{ 'C': 512.0, 'coef0': 0, 'degree': 1, 'gamma': 0.001953125, 'kernel': 'sigmoid', 'tol': 1.0 } ) >>> pvalues = get_params( sigmoid) >>> {'decision_function_shape': 'ovr', 'break_ties': False, 'kernel': 'sigmoid', 'degree': 1, 'gamma': 0.001953125, 'coef0': 0, 'tol': 1.0, 'C': 512.0, 'nu': 0.0, 'epsilon': 0.0, 'shrinking': True, 'probability': False, 'cache_size': 200, 'class_weight': None, 'verbose': False, 'max_iter': -1, 'random_state': None }
- watex.base.selectfeatures(df, features=None, include=None, exclude=None, coerce=False, **kwd)[source]#
Select features and return new dataframe.
- Parameters:
df – a dataframe for features selections
features – list of features to select. Lits of features must be in the dataframe otherwise an error occurs.
include – the type of data to retrieved in the dataframe df. Can be
number.exclude – type of the data to exclude in the dataframe df. Can be
numberi.e. only non-digits data will be keep in the data return.coerce – return the whole dataframe with transforming numeric columns. Be aware that no selection is done and no error is raises instead. default is
Falsekwd – additional keywords arguments from pd.astype function
- Ref:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html