Model selection with Probabilistic PCA and Factor Analysis (FA)#

Probabilistic PCA and Factor Analysis are probabilistic models. The consequence is that the likelihood of new data can be used for model selection and covariance estimation. Here we compare PCA and FA with cross-validation on low rank data corrupted with homoscedastic noise (noise variance is the same for each feature) or heteroscedastic noise (noise variance is the different for each feature). In a second step we compare the model likelihood to the likelihoods obtained from shrinkage covariance estimators.

One can observe that with homoscedastic noise both FA and PCA succeed in recovering the size of the low rank subspace. The likelihood with PCA is higher than FA in this case. However PCA fails and overestimates the rank when heteroscedastic noise is present. Under appropriate circumstances the low rank models are more likely than shrinkage models.

The automatic estimation from Automatic Choice of Dimensionality for PCA. NIPS 2000: 598-604 by Thomas P. Minka is also compared.

# Authors: Alexandre Gramfort & Denis A. Engemann # License: BSD 3 clause edited by LKouadio on Tue Oct 11 16:54:26 2022

watex.analysis.factor.LW_score(X, store_precision=True, assume_centered=False, **kws)[source]#

Models score from Ledoit-Wolf.

Parameters:

store_precision (bool, default=True) – Specify if the estimated precision is stored.
assume_centered (bool, default=False) – If True, data will not be centered before computation. Useful when working with data whose mean is almost, but not exactly zero. If False (default), data will be centered before computation.
block_size (int, default=1000) – Size of blocks into which the covariance matrix will be split during its Ledoit-Wolf estimation. This is purely a memory optimization and does not affect results.

Notes

The regularised covariance is:

\[(1 - text{shrinkage}) * \text{cov} + \text{shrinkage} * \mu * \text{np.identity(n_features)}\]

where \(\mu = \text{trace(cov)} / n_{features}\) and shrinkage is given by the Ledoit and Wolf formula

References

“A Well-Conditioned Estimator for Large-Dimensional Covariance Matrices”, Ledoit and Wolf, Journal of Multivariate Analysis, Volume 88, Issue 2, February 2004, pages 365-411.

watex.analysis.factor.compute_scores(X, n_features, n_components=5)[source]#

Compute PCA score and Factor Analysis scores from training X.

Parameters:

X (Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
n_features (int,) – number of features that composes X
n_components (int, default {5}) – number of component to retrieve.

Returns:

Scores from PCA and FA from transformed X

Return type:

Tuple (pca_scores, fa_scores)

watex.analysis.factor.make_scedastic_data(n_samples=1000, n_features=50, rank=10, sigma=1.0, random_state=42)[source]#

Generate a sampling data for probabilistic PCA and Factor Analysis for model comparison.

By default:: nsamples = 1000 n_features = 50 rank =10

Returns:

* X (sampling data)
* X_homo (sampling data with homoscedastic noise)
* X_hetero (sampling with heteroscedastic noise)
* n_components (number of components 50 features.)

watex.analysis.factor.pcavsfa(X, rank=10, sigma=1.0, n_components=5, random_state=42, verbose=0, view=False)[source]#

Compute PCA score and Factor Analysis scores from training X and compare probabilistic PCA and Factor Analysis models.

Parameters:

X (Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
n_features (int,) – number of features that composes X
n_components (int, default {5}) – number of component to retrieve.
rank (int, default{10}) – Bounding for ranking
sigma (float, default {1.}) – data pertubator ratio for adding heteroscedastic noise
random_state (int , default {42}) – Determines random number generation for dataset shuffling. Pass an int for reproducible output across multiple function calls.
verbose (int, default is 0) – Control the level of verbosity. Higher value lead to more messages.

Returns:

Scores from PCA and FA from transformed X

Return type:

Tuple (pca_scores, fa_scores)

watex.analysis.factor.shrunk_cov_score(X)[source]#

shrunk the covariance scores.

Parameters:: X (Ndarray of shape ( M x N), \(M=m-samples\) & \(N=n-features\)) – training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.
Returns:: score
Return type:: score of covariance estimator (best ) with shrinkage