watex.utils.reduce_samples#
- watex.utils.reduce_samples(*data, sname, zname=None, kname=None, section_indexes=None, error='raise', strategy='average', verify_integrity=False, ignore_index=False, **kws)[source]#
Create a new dataframe by squeezing/compressing the non valid data.
The m-samples reduction is necessary for the dataset with a lot of missing k-values. The technique of shrinking the number of k0 –values (k-missing values ) seems a relevant idea. It consists to compressed the values of the missing \(k -values from the top ( depth equals 0 ) thin the upper section of the first aquifer with lower depth into a single vector :math:\) with dimension (1×n ) i.e. contains the n-features.
- Parameters
data (list of dataframes) – Data that contains mainly the aquifer values. It must contains the depth values refering at the column_name passed at zname and the permeability coefficient k passed to kname . Both argument need t supplied when datafame as passes as positional arguments.
sname (str, optional) – Name of column in the dataframe that contains the strata values. Dont confuse ‘sname’ with ‘stratum’ which is the name of the valid layer/rock in the array/Series of strata.
zname (str, int) – Name of depth columns. zname allows to retrieve the depth column in a dataframe. If integer is passed, it assumes the index of the dataframe fits the depth column. Integer value must not be out the dataframe size along axis 1. Commonly `zname`needs to be supplied when a dataframe is passed to a function argument.
kname (str, int) –
- Name of permeability coefficient columns. kname allows to retrieve the
permeability coefficient ‘k’ in a specific dataframe. If integer is passed, it assumes the index of the dataframe fits the ‘k’ columns. Note that integer value must not be out the dataframe size along axis 1. Commonly
- kname needs to be supplied when a dataframe is passed as a positional
or keyword argument.
z (array-like 1d, pandas.Series) – Array of depth or a pandas series that contains the depth values. Two dimensional array or more is not allowed. However when z is given as a dataframe and zname is not supplied, an error raises since zname is used to fetch and overwritten z from the dataframe.
strategy (str , default='average' or 'mean',) – strategy used to select or compute the numerical data into a singular series. It can be [‘naive’]. In that case , a single serie if randomly picked up into the base strata data.
section_indexes (tuple or list of int) – list of a pair tuple or list of integers. It is be the the valid sections( upper and lower ) indexes of of the aquifer. If the depth range z_range and zname are supplied, section_indexes can be None. Note that the last indix is considered as the last position, the bottom of the section therefore, its value is included in the data.
error (str, default='raise') – Raise errors if trouble occurs when computing the section of each aquifer. If ‘ignore’, a UserWarning is displayed when invalid data is found. Any other value of error will set error to raise.
verify_integrity (bool, default=False) –
Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method. if ‘True’, remove the duplicate rows from a DataFrame.
subset: By default, if the rows have the same values in all the columns, they are considered duplicates. This parameter is used to specify the columns that only need to be considered for identifying duplicates. keep: Determines which duplicates (if any) to keep. It takes inputs as, first – Drop duplicates except for the first occurrence. This is the default behavior. last – Drop duplicates except for the last occurrence. False – Drop all duplicates. inplace: It is used to specify whether to return a new DataFrame or update an existing one. It is a boolean flag with default False.
ignore_index (bool, default=False,) – It is a boolean flag to indicate if row index should be reset after dropping duplicate rows. False: It keeps the original row index. True: It reset the index, and the resulting rows will be labeled 0, 1, …, n – 1.
- Returns
df_new – new dataframes with reducing samples.
- Return type
List of pandas.dataframes
Example
>>> from watex.datasets import load_hlogs >>> from watex.utils.hydroutils import reduce_samples >>> data = load_hlogs ().frame # get the frames >>> # add explicitly the aquifer section indices >>> dfnew= reduce_samples (data.copy(), sname='strata_name', section_indexes = (16, 29 ),) >>> dfnew[0] ... hole_number strata_name rock_name ... r rp remark 0 H502 mudstone J2z ... NaN NaN NaN 16 H502 siltstone NaN ... 35.74 59.23 NaN 17 H502 fine-grained sandstone NaN ... 35.74 59.23 NaN 18 H502 siltstone NaN ... 35.74 59.23 NaN 19 H502 fine-grained sandstone NaN ... 35.74 59.23 NaN 20 H502 mudstone NaN ... 35.74 59.23 NaN 21 H502 siltstone NaN ... 35.74 59.23 NaN 22 H502 fine-grained sandstone NaN ... 59.61 59.23 NaN 23 H502 siltstone NaN ... 59.61 59.23 NaN 24 H502 fine-grained sandstone NaN ... 59.61 59.23 NaN 25 H502 Coarse-grained sandstone NaN ... 59.61 59.23 NaN 26 H502 mudstone NaN ... 82.33 59.23 NaN 27 H502 fine-grained sandstone NaN ... 82.33 59.23 NaN 28 H502 Coarse-grained sandstone J2z ... 82.33 59.23 NaN 29 H502 coal (J2y) 2coal ... 82.33 59.23 NaN 0 H502 siltstone NaN ... NaN NaN NaN
[16 rows x 23 columns] >>> # specify the column name and kname without section indexes >>> dfnew= reduce_samples (
data.copy(), sname=’strata_name’, data, zname=’depth’, kname=’k’, ignore_index= True )[0]
… dfnew[0].index # index is reset … RangeIndex(start=0, stop=16, step=1)