watex.analysis.linear_discriminant_analysis#
- watex.analysis.linear_discriminant_analysis(X, y, n_components=2, view=False, verbose=0, return_X=True)[source]#
Linear Discriminant Analysis LDA.
LDA is used as a technique for feature extraction to increase the computational efficiency and reduce the degree of overfitting due to the curse of dimensionnality in non-regularized models. The general concept behind LDA is very similar to the principal component analysis (PCA), but whereas PCA attempts to find the orthogonal component axes of minimum variance in a dataset, the goal in LDA is to find the features subspace that optimize class separability. The main steps requiered to perform LDA are summarized below:
Standardize the d-dimensional datasets (d is the number of features)
For each class , compute the d-dimensional mean vectors. Thus for each mean feature value, \(\mu_m\) with respect to the examples of class \(i\):
\[m_i = \frac{1}{n_i} \sum{x\in D_i} x_m\]Construct the between-class scatter matrix, \(S_B\) and the within class scatter matrix, \(S_W\). Individual scatter matrices are scalled \(S_i\) before we sum them up as scatter matrix \(S_W\) as:
\[ \begin{align}\begin{aligned}\sum{i} = \frac{1}{n_i}S_i\\\sum{i} = \frac{1}{n_i} \sum{x\in D_i} (x-m_i)(x-m_i)^T\end{aligned}\end{align} \]The within-class is also called the covariance matrix, thus we can compute the between class scatter_matrix \(S_B\).
\[S_B= \sum{i}^{n_i}(m_i-m) (m_i-m)^T\]where \(m\) is the overall mean that is computed , including examples from all classes.
Compute the eigenvectors and corresponding eigenvalues of the matrix \(S_W^{-1}S_B\).
Sort the eigenvalues by decreasing order to rank the corresponding eigenvectors
Choose the \(k\) eigenvectors that correspond to the \(k\) largest eigenvalues to construct \(dxk\)-dimensional transformation matrix, \(W\); the eigenvectors are the columns of this matrix.
project the examples onto the new_features subspaces using the transformation matrix \(W\).
- Parameters:
X (Ndarray ( M x N matrix where
M=m-samples, &N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample.Xmay also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.y (array-like, shape (M, )
M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.n_components (int, default =2) – Number of components considered as the most discriminative eigen vector.
return_X (bool, default =True) – return the transformed training set from n_components.
view (bool ,default =False,) – Visualize the LDA plot. If set to
True, the plot is triggered.
- Returns:
X or W – The transformed train set (X) or matrix (W) from the most discriminative eigenvector columns
- Return type:
ndarray (n_samples, 2 )
Examples
>>> from watex.datasets import fetch_data >>> from watex.exlib.sklearn import SimpleImputer, LogisticRegression >>> from watex.analysis.decomposition import linear_discriminant_analysis >>> data= fetch_data("bagoue original").get('data=dfy1') # encoded flow >>> y = data.flow ; X= data.drop(columns='flow') >>> # select the numerical features >>> X =selectfeatures(X, include ='number') >>> # imputed the missing data >>> X = SimpleImputer().fit_transform(X) >>> Xtr= linear_discriminant_analysis (X, y , view =True)