watex.analysis.linear_discriminant_analysis#

watex.analysis.linear_discriminant_analysis(X, y, n_components=2, view=False, verbose=0, return_X=True)[source]#

Linear Discriminant Analysis LDA.

LDA is used as a technique for feature extraction to increase the computational efficiency and reduce the degree of overfitting due to the curse of dimensionnality in non-regularized models. The general concept behind LDA is very similar to the principal component analysis (PCA), but whereas PCA attempts to find the orthogonal component axes of minimum variance in a dataset, the goal in LDA is to find the features subspace that optimize class separability. The main steps requiered to perform LDA are summarized below:

  • Standardize the d-dimensional datasets (d is the number of features)

  • For each class , compute the d-dimensional mean vectors. Thus for each mean feature value, \(\mu_m\) with respect to the examples of class \(i\):

    \[m_i = \frac{1}{n_i} \sum{x\in D_i} x_m\]
  • Construct the between-class scatter matrix, \(S_B\) and the within class scatter matrix, \(S_W\). Individual scatter matrices are scalled \(S_i\) before we sum them up as scatter matrix \(S_W\) as:

    \[ \begin{align}\begin{aligned}\sum{i} = \frac{1}{n_i}S_i\\\sum{i} = \frac{1}{n_i} \sum{x\in D_i} (x-m_i)(x-m_i)^T\end{aligned}\end{align} \]

    The within-class is also called the covariance matrix, thus we can compute the between class scatter_matrix \(S_B\).

    \[S_B= \sum{i}^{n_i}(m_i-m) (m_i-m)^T\]

    where \(m\) is the overall mean that is computed , including examples from all classes.

  • Compute the eigenvectors and corresponding eigenvalues of the matrix \(S_W^{-1}S_B\).

  • Sort the eigenvalues by decreasing order to rank the corresponding eigenvectors

  • Choose the \(k\) eigenvectors that correspond to the \(k\) largest eigenvalues to construct \(dxk\)-dimensional transformation matrix, \(W\); the eigenvectors are the columns of this matrix.

  • project the examples onto the new_features subspaces using the transformation matrix \(W\).

Parameters:
  • X (Ndarray ( M x N matrix where M=m-samples, & N=n-features)) – Training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

  • y (array-like, shape (M, ) M=m-samples,) – train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

  • n_components (int, default =2) – Number of components considered as the most discriminative eigen vector.

  • return_X (bool, default =True) – return the transformed training set from n_components.

  • view (bool ,default =False,) – Visualize the LDA plot. If set to True, the plot is triggered.

Returns:

X or W – The transformed train set (X) or matrix (W) from the most discriminative eigenvector columns

Return type:

ndarray (n_samples, 2 )

Examples

>>> from watex.datasets import fetch_data
>>> from watex.exlib.sklearn import SimpleImputer, LogisticRegression
>>> from watex.analysis.decomposition import linear_discriminant_analysis
>>> data= fetch_data("bagoue original").get('data=dfy1') # encoded flow
>>> y = data.flow ; X= data.drop(columns='flow')
>>> # select the numerical features
>>> X =selectfeatures(X, include ='number')
>>> # imputed the missing data
>>> X = SimpleImputer().fit_transform(X)
>>> Xtr= linear_discriminant_analysis (X, y , view =True)