watex.utils.linkage_matrix#

watex.utils.linkage_matrix(df, columns=None, kind='design', metric='euclidean', method='complete', as_frame=False, optimal_ordering=False)[source]#

Compute the distance matrix from the hierachical clustering algorithm

Parameters:
  • df (dataframe or NDArray of (n_samples, n_features)) – dataframe of Ndarray. If array is given , must specify the column names to much the array shape 1

  • columns (list) – list of labels to name each columns of arrays of (n_samples, n_features) If dataframe is given, don’t need to specify the columns.

  • kind (str, ['squareform'|'condense'|'design'], default is {'design'}) – kind of approach to summing up the linkage matrix. Indeed, a condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of \(m\) observation vectors in \(n\) dimensions may be passed as an \(m\) by \(n\) array. All elements of the condensed distance matrix must be finite, i.e., no NaNs or infs. Alternatively, we could used the squareform distance matrix to yield different distance values than expected. the design approach uses the complete inpout example matrix also called ‘design matrix’ to lead correct linkage matrix similar to squareform and condense`.

  • metric (str or callable, default is {'euclidean'}) – The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by sklearn.metrics.pairwise.pairwise_distances(). If X is the distance array itself, use “precomputed” as the metric. Precomputed distance matrices must have 0 along the diagonal.

  • method (str, optional, default is {'complete'}) – The linkage algorithm to use. See the Linkage Methods section below for full descriptions.

  • optimal_ordering (bool, optional) – If True, the linkage matrix will be reordered so that the distance between successive leaves is minimal. This results in a more intuitive tree structure when the data are visualized. defaults to False, because this algorithm can be slow, particularly on large datasets. See also scipy.cluster.hierarchy.linkage().

Returns:

row_clusters – consist of several rows where each rw represents one merge. The first and second columns denotes the most dissimilar members of each cluster and the third columns reports the distance between those members

Return type:

linkage matrix