Glossary#

This glossary gives some explicit conventions applied in watex and its API, while providing a reference for users. The concepts are structured into the related terms Geosciences concepts, Computer concepts, Machine Learning concepts, Estimators and assessors methods, Core parameters, Plotting parameters, Miscellaneous, and Resource-links.

Geosciences concepts#

Audio-frequency magnetotelluric#
Audio-magnetotelluric#
AMT#

is an important high resolution non-seismic geophysical technique that measures variations in the Earth’s natural electromagnetic fields to detect electrical resistivity variations in the subsurface at shallow to intermediate depths.

Aquifer#

In hydrology, aquifer is defined as a rock layer that contains water and releases it in appreciable amounts. The rock contains water-filled pore spaces, and, when the spaces are connected, the water is able to flow through the matrix of the rock.

Borehole#

A deep hole made in the ground when looking for oil, gas, or water: We must sink a borehole so that people will have water. It can also be defined as a narrow shaft bored in the ground, either vertically or horizontally. Usually borehole may be constructed for many different purposes, including the extraction of water (drilled water well and tube well), other liquids (such as petroleum), or gases (such as natural gas). In our case, the objective is water.

Conductivity#

It is often represented using \(\sigma\). It is the inverse of resistivity: \(1/\sigma\) . Conductivity is given in units of Siemens per metre, or S/m. Millisiemens per metre (\(mS/m\)) are often used; \(1000 mS/m = 1 S/m\). So \(1 mS/m = 1000Ohm-m\).

Controlled-source audio-frequency magnetotelluric#
Controlled-source audio-magnetotelluric#
CSAMT#

Controlled Source Audio-Magnetotellurics; It is a frequency-domain electromagnetic sounding technique which uses a fixed grounded dipole or horizontal loop as an artificial signal source. Mostly, Its involves transmitting a current at various frequencies in one location, and measuring resistivity differences between electrodes spaced along a receiver line several kilometers from the transmitter. CSAMT has a low environemental-impact and mostly used extensively in minerals, geothermal, and groundwater exploration—and in some hydrocarbon applications.

Controlled source electromagnetic#
CSEM#

It stands for Controlled Source Electromagnetics (CSEM), commonly known as Marine CSEM. It is the collective term for techniques that can be used to investigate the geological subsurface using electromagnetic signals generated by artificial and controllable source systems operated in a marine setting. This is possible because the various subsurface strata are made up of materials with different electromagnetic properties in terms of their resistivity/conductivity and chargeability. The differences in resistivity between different materials enable us to use electromagnetic signals to map geological formations in the subsurface. When an electromagnetic field propagates through the different formations, it becomes successively influenced and modified by the resistivities of the different strata it encounters.

Deadband#
Dead-band#
Attenuation band#
Neural zone#

It is defined as a band of input values in the domain of a transfer function in a control system or signal processing system where the output is zero (the output is ‘dead’ - no action occurs). In short-periods EM, the attenuation band constitutes one potential problem in collecting NSAMT data centered at 2000 Hz. In this band natural source signals are generally absent. While commonly called the attenuation band, the actual problem is that the atmospheric wave guide channeling natural source electromagnetic signals is transparent to these frequencies. Natural electromagnetic (EM) energy is dissipated! Often, cultural electromagnetic noise fills the gap. Recovering the missing or weak frequency signal in that bandwidth seems a real challenge.

DC-resistivity#
DC#

It stands for direct current resistivity methods. It entails injecting a steady state electrical current into the ground and observing the resulting distribution of potentials (voltages) at the surface or within boreholes. Like all geophysical processes, DC surveys can be described in terms of input energy, the earth’s physical properties, and signals or data that are measured.

Drilling#
Drill#

Drilling is a material-removing or cutting process in which the tool uses a drill bit to cut a hole of circular cross-section in solid materials. This is the most common machining process, one estimate is that 75% of all metal cutting material removed comes from the drilling operation.

Drinking water supply campaign#
Drinking water supply project#
Campaign for drinking water supply#
DWSC#
CDWS#

It is a project hugely financed by Global organizations (UNICEF, UNESCO) in collaboration with states governements for supplying population from rural and urban areas to potable water. It is mostly registered under the Sustainable Development Goals Number 6. In developing countries, some DWSC tends to cover 1000 and more villages per years for population welfare.

Electrical Data Interchange#
EDI#

EDI is a data format for electrical geophysical data. It purpose is to facilitate data exchange from contractors to clients and among clients. It also serves a very important function as a standard format for archiving data. The format has been designed to accommodate a wide variety of different field measurement configurations and processing requirements which might arise from special applications or new developments. Although very similar in appearance to some existing formats, it has been slightly modified to be more consistent, unambiguous, extensible, and realizable. A number of refinements have been incorporated as a result of actually implementing the standard. The format can accommodate any currently used data acquisition configuration, and has the flexibility to handle any reasonable future configuration. Supported configurations include single and multi-station MT, telluric-magnetotelluric sounding, EMAP profiling, and combinations of the above. Provisions have been made for interchange of data at various levels of acquisition and processing including time series, power spectra, impedance, and computed parameters including non-standard parameters. The data interchange file is always an ASCII file. However, a provision has been made to store actual data values in a parallel binary data file and reference them through the (ASCII) data interchange file. See the EDI file format construction in the document of Wight and Drive,1988.

Electrical resistivity profiling#
DC-Profiling#
ERP#

It is an electrical geophysical method easy to use. It is used to select the best conductive zone (anomaly) where the drilling operations are expected to be performed. Because, it is cheap, mostly it is prefered by local companies and drilling ventures because to carry out their survey in order to stay on the project timeline and save money. Moreover, it is first used, before the DC-sounding to select the conductive zone based on the resistivity value for featuring the fracture expecting to contain a groundwater.

Electromagnetic#
EM#

It means relating to the science that deals with the relationship between electricity and magnetism.

Electromagnetic array profiling#
EMAP#

A variant of the magnetotelluric method (q.v.) where electric field measurements along a profile are made in a continuous fashion. Data is then processed to remove the effects of near-surface inhomogeneities.

Flow-rate#
Flow rate#
flow#
FR#

flow is defined as liquids, gases, or electricity to move in one direction, especially continuously. It can be: - lava from the volcano that was flowing down the hillside. - many short rivers flow into the Pacific Ocean. - river flows through three counties before flowing into the sea just south of here. - more … Here, when we talk about the flow rate, we are mostly focused on the underground water flow ( groundwater) circulating in the fractures obtained during/after the drilling operations. It can be defined as the mass/volumetric of a water which passes per unit of time. We denoted mostly the unit as \(m^3/hr\).

Geology#
Geological#

Geology is a branch of natural science concerned with Earth and other astronomical objects, the features or rocks of which it is composed, and the processes by which they change over time.

Geophysic#
Geophysical#

Geophysics is a subject of natural science concerned with the physical processes and physical properties of the Earth and its surrounding space environment, and the use of quantitative methods for their analysis. The term geophysics classically refers to solid earth applications.

Geosciences#

Is the study of the Earth - its oceans, atmosphere, rivers and lakes, ice sheets and glaciers, soils, its complex surface, rocky interior, and metallic core. This includes many aspects of how living things, including humans, interact with the Earth.

Groundwater#
Underground water#

It is the water present beneath Earth’s surface in rock and soil pore spaces and in the fractures of rock formations.

Groundwater exploration#

Groundwater exploration is the investigation of underground formations to understand the hydrologic cycle, know the groundwater quality, and identify the nature, number and type of aquifers. There are different groundwater exploration methods. Surface geophysical method is one of the groundwater investigation methods. One of the surface geophysical methods is therefore the vertical electrical sounding (VES) method. VES is one to provide valuable information regarding the vertical successions of subsurface geo-materials in terms of their individual thicknesses and corresponding resistivity values. It is rapid and much effective in estimating aquifer thickness of an area and is cheap effective technique for groundwater study.

GWE#

Is a shorthand of groundwater exploration.

Hydrogeology#

Hydrogeology (hydro- meaning water, and -geology meaning the study of the Earth) is the area of geology that deals with the distribution and movement of groundwater in the soil and rocks of the Earth’s crust (commonly in aquifers). The terms groundwater hydrology, geohydrology, and hydrogeology are often used interchangeably. It can also be defined as a study of the laws governing the movement of subterranean water, the mechanical, chemical, and thermal interaction of this water with the porous solid, and the transport of energy, chemical constituents, and particulate matter by flow.

Hydrogeophysics#
Hydro-geophysics#
hydrogeophysical#
hydro-geophysical#

Hydrogeophysics is a cross-disciplinary area of research that uses geophysics to determine parameters (characteristics; measurements of limitations or boundaries) and monitor processes for hydrological studies of matters such as water resources, contamination, and ecological studies.[1] The field uses knowledge and researchers from geology, hydrology, physics, geophysics, engineering, statistics, and rock physics. It uses geophysics to provide quantitative information about hydrogeological parameters, using minimally invasive methods. Hydrogeophysics differs from geophysics in its specific uses and methods. Although geophysical knowledge and methods have existed and grown over the last half century for applications in mining and petroleum industries, hydrogeological study sites have different subsurface conditions than those industries.

Hydrology#
Hydrologic#

Hydrology is an important field of goeosciences study, dealing with water as one of the most valuable resources on Earth. All aspects of the Earth’s available water are studied by experts from many disciplines, from geologists to engineers, to obtain the information needed to manage this vital resource.

Magnetetolluric#
MT#

Magnetetolluric; It is an electromagnetic geophysical method for inferring the earth’s subsurface electrical conductivity from measurements of natural geomagnetic and geoelectric field variation at the Earth’s surface. Investigation depth ranges from 300 m below ground by recording higher frequencies down to 10,000 m or deeper with long-period soundings

Natural source audio-frequency magnetotelluric#
Natural source audio magnetotelluric#
NSAMT#

It is a EM method that involves the measurement of electromagnetic fields that are generated by natural electromagnetic activity above the earth’s surface. Natural source signals are generated in the atmosphere and magnetosphere. MT signals (low frequencies < 1 Hz) are generated by the interaction between the earth’s magnetosphere and the solar wind, sunspot activity and auroras. High frequency sources (> 1 Hz) in the Audio range (AMT) are generated by worldwide thunderstorms and lightning. These time-varying electric and magnetic fields induce currents into the earth and oceans.

ohmic-area#
ohmS#

It is also called pseudo-area of the fractured zone. The ohmS is a fully data-driven parameter and is used to evaluate a pseudo-area of the fracture zone from the depth where the basement rock is supposed to start. Indeed, when exploring deeper using the VES, usually we are looking for groundwater in the fractured rock that is outside the anthropic pollution. Since the VES is an indirect method, we cannot ascertain whether the presumed fractured rock contains water inside. However, we assume that the fracture zone could exist and should contain groundwater. Mathematically, the computed area constitutes the expected “fracture zone” and is expressed in \(\Omega.m^2\). Thus, when \(ohmS≠0\) confirms the existence of the fracture zone while \(ohms=0\) raises doubts.

PNAEP#
National drinking water supply program#
Programme national d’adduction en eau potable#

PNEAP is a huge program funded by the Cote d’Ivoire government in partnership with international organizations such as the World Bank, UNICEF to bring drinking water to rural and urban populations in several villages and towns within the country. The program extended over phases (from 2012 to 2014) and from 2015 to 2016. During this program the local geophysical companies were assisted by and West-Africa International Drilling Company (FORACO-CI) for the development of drilling after geophysics exploration.

PPU#
Emergency presidential program#
Programme presidentiel d’urgence#

The PPU is a Cote d’Ivoire government initiative few months later after the inauguration of Dr Alassane Ouattara as president of Cote d’Ivoire. The project was opened just after the politico-military-crisis from 2012. The project started from 2012 to-2013 and and relauched from 2013-2014. The objective was the same as the PNAEP but entirely financed by the state goverment. and a few part of funds is collected from NGO children of Africa lead by Dominique Ouattara, First Lady of Côte d’Ivoire. See some other resource from Mel et al., 2018.

pseudo-fracturing index#
IF#
sfi#

It a data driven parameter computed from DC-sounding. It does not indicate the rock fracturing degree in the underground but it is used to speculate about the apparent resistivity dispersion ratio around the cumulated sum of the resistivity values of the selected anomaly. It uses a similar approach to the fracturing ratio (IF) parameter proposed by Dieng et al. (2004).

permeability coefficient k#
permeability coefficient#
coefficient of permeability#
k#

Simply, it is the rate of flow of water under laminar flow conditions through a unit cross-sectional area of a porous medium under a unit hydraulic gradient and a standard temperature, usually 20 degrees C. In physic, it is defined as the volume of an incompressible fluid that will flow in unit time through a unit cube of a porous substance across which a unit pressure difference is maintained. consult the documentation for further details.

Required-flow rate#
Required flow rate#
Required FR#
Flow rate required#
RFR#

RFR is defined as the flow rate that is requested/requiered by contractors/funders to supply in potable water a specific locality. It is strongly tied to the number of living inhabitants of that locality. For instance, the FR recommended to supply a village with a population estimated to 2,000 inhabitants should naturally be different from the FR of 20,000 inhabitants, i.e. 10 times the peculiar area composed of 2,000 inhabitants. The objective of the funders when they requiered this rate of FR is to adress the issue of water scarcity for a long-term by minimizing the impact of the climate change during the dry season ( especially for the population living in Africa).

Resistance#

It is simply the measured voltage over the measured (known) current (which is Ohm’s Law). Resistance will change if the measurement geometry or if the volume of material changes. Therefore, it is NOT a physical property.

Resistivity#

It is basically the resistance per unit volume. It is defined as the voltage measured across a unit cube’s length (volts per metre, or V/m) divided by the current flowing through the unit cube’s cross sectional area (Amps per metre squared, or A/m2). This results in units of \(\Omega.m^2/m\) or \(\Omega.m\). The greek symbol:math:Omega, is often used to represent resistivity.

Society of Exploration Geophysicists#
SEG#

SEG is a learned society dedicated to promoting the science and education of exploration geophysics in particular and geophysics in general. The Society fosters the expert and ethical practice of geophysics in the exploration and development of natural resources, in characterizing the near-surface, and in mitigating earth hazards. See more in SEG.ORG.

Shape#

The shape parameter is mostly used in the basement medium to depict the better conductive zone for the drilling location. There are various shapes of anomalies can be described such as “W”, “U”, “V”, “H”, “C”, “K”, and “M”. The choice of the anomaly shape is sometimes influenced by project external requirements. Commonly, this situation arises during the campaigns for VH/IVH systems. One of these constraints is minimizing the distance between the center of the village and the future borehole (expected drilling location). In such circumstances, the order of priority of the anomaly shape is not always followed. For instance, the shape W found at 1000 m from the center of the village should be replaced by the shape V found at 100 m if there is no other shape W or U that satisfied the project conditions. One more constraint is the existence of anthropic waste sites in the vicinity of the ERP line. Thus, an anomaly of higher priority W or U found closer to pollution waste sites is systematically rejected to consider the shape with lower priority just comes afterward. In CDWS especially in rural areas, the external project constraints constitute an important factor that influences the choice of the anomaly shape and must be taken into consideration.

Signal-to-noise ratio#
Signal-to-Noise Ratio#
S/N#
SNR#

SNR is a measure used in science and engineering that compares the level of a desired signal to the level of background noise. SNR is defined as the ratio of signal power to the noise power, often expressed in decibels. A ratio higher than 1:1 (greater than 0 dB) indicates more signal than noise. The signal-to-noise ratio is difficult to quantify accurately because it is difficult to completely separate signal from noise. It also depends on how noise is defined. For further details click on the the following documentation.

tensor#
EM-tensor#
EM field tensor#

In electromagnetism, the tensor or EM field tensor (sometimes called the field strength tensor, Faraday tensor or Maxwell bivector) is a mathematical object that describes the EM field in spacetime. In EM survey, different array types are considered for acquiring the most beneficial data for a particular area. The fameous are scalar or tensors. Typically, Ex are scalar data collected in the traverse direction with multiple Ex measurements, and one Hy. Tensor NSAMT data measure additional components (Ey and Hx) which provide information about directionality. Ex and Ey refer to the directions of the array used to measure the electric field. Hx and Hy refer to the directions used to measure the magnetic field. Every Ex has an associated orthogonal Hy measurement, and every Ey has an associated Hx. xy, xy, yx and yy are called the tensor components. Mostly acquiring data according to xy and yx, refers to the transverse electric (TE) and transverse magnetic (TM) modes respectively. Download the following book for further details about EM methods.

Type#
type_#

The type parameter comes after the shape as a second criterion. For instance, two anomalies with the same shape W will differ in the order of priority of their types. The type depends on the lateral resistivity distribution in the underground (resulting from the pace of the apparent resistivity curve) along with the whole ERP. According to the African Hydraulic Study Committee report (CIEH), four types of anomalies were emphasized:

  • EC: The extensive or large conductive zone (EC); when the width of the conductive area \(W_cz\) is greater than four times the dipole length everywhere in the ERP line;

  • CB2P: The contact between two plans (CB2P): defines the contact between two plans with different levels of the lateral distribution of the apparent resistivity values;

  • NC: The narrow conductive zone (NC): when the width of the conductive zone is ranging between the dipole length (\(d_l\)) and four times the dipole length everywhere in the ERP line;

  • CP: The conductive plan zone (CP) shows a large and stretched conductive zone, and is consecutively evaluated according to the whole ERP line.

Vertical electrical sounding#
DC-Sounding#
VES#

It is a simple and fast geophysical method. It is used mostly to speculate about the existence of a fracture zone and the layer thicknesses. Mostly, it comes after the :term:`DC-Profiling`(ERP).

Water#
H20#

It is an inorganic compound with the chemical formula \(H_2O\). It is a transparent, tasteless, odorless, and nearly colorless chemical substance, which is the main constituent of Earth’s hydrosphere and the fluids of all known living organisms (in which it acts as a solvent). See it different properties as parameters in watex.property.Water class.

Well#

A well is an excavation or structure created in the ground by digging, driving, or drilling to access liquid resources, usually water. It should not be confused with the borehole. Mostly the well depth should not be greater than 100m and only refer to water whereas borehole needs more technical and advanced technologies especially when it’s come for seismic investigation.

Computer concepts#

These are few concepts to understand conventional words in computers sciences. Note also that the maths concepts are also included. The list is not exhaustive.

1d#
1d array#

One-dimensional array. A NumPy array whose .shape has length 1. A vector.

2d#
2d array#

Two-dimensional array. A NumPy array whose .shape has length 2. Often represents a matrix.

API#

Refers to both the specific interfaces for estimators implemented in watex and the generalized conventions across types of estimators as described in this glossary and overviewed in the contributor documentation.

The specific interfaces that constitute watex’s public API are largely documented in API Reference. However, we less formally consider anything as public API if none of the identifiers required to access it begins with _. We generally try to maintain backwards compatibility for all objects in the public API.

Private API, including functions, modules and methods beginning _ are not assured to be stable.

array-like#

The most common data format for input to watex estimators and functions, array-like is any type object for which numpy.asarray() will produce an array of appropriate shape (usually 1 or 2-dimensional) of appropriate dtype (usually numeric).

This includes:

  • a numpy array

  • a list of numbers

  • a list of length-k lists of numbers for some fixed length k

  • a pandas.DataFrame with all columns numeric

  • a numeric pandas.Series

It excludes:

Note that output from scikit-learn estimators and functions (e.g. predictions) should generally be arrays or sparse matrices, or lists thereof (as in multi-output tree.DecisionTreeClassifier’s predict_proba). An estimator where predict() returns a list or a pandas.Series is not valid.

attribute#
attributes#

We mostly use attribute to refer to how model information is stored on an estimator during fitting. Any public attribute stored on an estimator instance is required to begin with an alphabetic character and end in a single underscore if it is set in fit or partial_fit. These are what is documented under an estimator’s Attributes documentation. The information stored in attributes is usually either: sufficient statistics used for prediction or transformation; transductive outputs such as labels_ or embedding_; or diagnostic data, such as feature_importances_. Common attributes are listed below.

A public attribute may have the same name as a constructor parameter, with a _ appended. This is used to store a validated or estimated version of the user’s input. For example, decomposition.PCA is constructed with an n_components parameter. From this, together with other parameters and the data, PCA estimates the attribute n_components_.

Further private attributes used in prediction/transformation/etc. may also be set when fitting. These begin with a single underscore and are not assured to be stable for public access.

A public attribute on an estimator instance that does not end in an underscore should be the stored, unmodified value of an __init__ parameter of the same name. Because of this equivalence, these are documented under an estimator’s Parameters documentation.

backwards compatibility#

We generally try to maintain backward compatibility (i.e. interfaces and behaviors may be extended but not changed or removed) from release to release but this comes with some exceptions:

Public API only

The behavior of objects accessed through private identifiers (those beginning _) may be changed arbitrarily between versions.

As documented

We will generally assume that the users have adhered to the documented parameter types and ranges. If the documentation asks for a list and the user gives a tuple, we do not assure consistent behavior from version to version.

Deprecation

Behaviors may change following a deprecation period (usually two releases long). Warnings are issued using Python’s warnings module.

Keyword arguments

We may sometimes assume that all optional parameters (other than X and y to fit and similar methods) are passed as keyword arguments only and may be positionally reordered.

Bug fixes and enhancements

Bug fixes and – less often – enhancements may change the behavior of estimators, including the predictions of an estimator trained on the same data and random_state. When this happens, we attempt to note it clearly in the changelog.

Serialization

We make no assurances that pickling an estimator in one version will allow it to be unpickled to an equivalent model in the subsequent version.

Despite this informal contract with our users, the software is provided as is, as stated in the license. When a release inadvertently introduces changes that are not backward compatible, these are known as software regressions.

callable#

A function, class or an object which implements the __call__ method; anything that returns True when the argument of callable().

categorical feature#

A categorical or nominal feature is one that has a finite set of discrete values across the population of data. These are commonly represented as columns of integers or strings. Strings will be rejected by most scikit-learn estimators, and integers will be treated as ordinal or count-valued. For the use with most estimators, categorical variables should be one-hot encoded. Notable exceptions include tree-based models such as random forests and gradient boosting models that often work better and faster with integer-coded categorical variables.

Cython#

Cython is a programming language, a (non-strict, at least for Python 3) superset of the Python programming language (with slightly different semantics), designed to give C-like performance with code that is written mostly in Python with optional additional C-inspired syntax. See more here.

deprecation#

We use deprecation to slowly violate our backwards compatibility assurances, usually to:

  • change the default value of a parameter; or

  • remove a parameter, attribute, method, class, etc.

We will ordinarily issue a warning when a deprecated element is used, although there may be limitations to this. For instance, we will raise a warning when someone sets a parameter that has been deprecated, but may not when they access that parameter’s attribute on the estimator instance.

dimensionality#

May be used to refer to the number of features (i.e. n_features), or columns in a 2d feature matrix. Dimensions are, however, also used to refer to the length of a NumPy array’s shape, distinguishing a 1d array from a 2d matrix.

docstring#

The embedded documentation for a module, class, function, etc., usually in code as a string at the beginning of the object’s definition, and accessible as the object’s __doc__ attribute.

We try to adhere to PEP257, and follow NumpyDoc conventions.

double underscore#
double underscore notation#

When specifying parameter names for nested estimators, __ may be used to separate between parent and child in some contexts. The most common use is when setting parameters through a meta-estimator with set_params and hence in specifying a search grid in parameter search. See parameter. It is also used in pipeline.Pipeline.fit() for passing sample properties to the fit methods of estimators in the pipeline.

dtype#
data type#

NumPy arrays assume a homogeneous data type throughout, available in the .dtype attribute of an array (or sparse matrix). We generally assume simple data types for scikit-learn data: float or integer. We may support object or string data types for arrays before encoding or vectorizing. Our estimators do not work with struct arrays, for instance.

Our documentation can sometimes give information about the dtype precision, e.g. np.int32, np.int64, etc. When the precision is provided, it refers to the NumPy dtype. If an arbitrary precision is used, the documentation will refer to dtype integer or floating. Note that in this case, the precision can be platform dependent. The numeric dtype refers to accepting both integer and floating.

early stopping#

This consists in stopping an iterative optimization method before the convergence of the training loss, to avoid over-fitting. This is generally done by monitoring the generalization score on a validation set.

estimator instance#

We sometimes use this terminology to distinguish an estimator class from a constructed instance. For example, in the following, cls is an estimator class, while est1 and est2 are instances:

cls = RandomForestClassifier
est1 = cls()
est2 = RandomForestClassifier()
examples#

We try to give examples of basic usage for most functions and classes in the API:

  • as doctests in their docstrings (i.e. within the watex/ package code itself).

  • as examples in the example gallery rendered (using sphinx-gallery) from scripts in the examples/ directory, exemplifying key features or parameters of the estimator/function. These should also be referenced from the User Guide.

  • sometimes in the User Guide (built from doc/) alongside a technical description of the estimator.

experimental#

An experimental tool is already usable but its public API, such as default parameter values or fitted attributes, is still subject to change in future versions without the usual deprecation warning policy.

evaluation metric#
evaluation metrics#

Evaluation metrics give a measure of how well a model performs. We may use this term specifically to refer to the functions in metrics (disregarding metrics.pairwise), as distinct from the score method and the scoring API used in cross validation. See model_evaluation.

These functions usually accept a ground truth (or the raw data where the metric evaluates clustering without a ground truth) and a prediction, be it the output of predict (y_pred), of predict_proba (y_proba), or of an arbitrary score function including decision_function (y_score). Functions are usually named to end with _score if a greater score indicates a better model, and _loss if a lesser score indicates a better model. This diversity of interface motivates the scoring API.

Note that some estimators can calculate metrics that are not included in metrics and are estimator-specific, notably model likelihoods.

feature#
features#
feature vector#

In the abstract, a feature is a function (in its mathematical sense) mapping a sampled object to a numeric or categorical quantity. “Feature” is also commonly used to refer to these quantities, being the individual elements of a vector representing a sample. In a data matrix, features are represented as columns: each column contains the result of applying a feature function to a set of samples.

Elsewhere features are known as attributes, predictors, regressors, or independent variables.

Nearly all estimators in scikit-learn assume that features are numeric, finite and not missing, even when they have semantically distinct domains and distributions (categorical, ordinal, count-valued, real-valued, interval). See also categorical feature and missing values.

n_features indicates the number of features in a dataset.

fitting#

Calling fit (or fit_transform, fit_predict, etc.) on an estimator.

fitted#

The state of an estimator after fitting.

There is no conventional procedure for checking if an estimator is fitted. However, an estimator that is not fitted:

  • should raise exceptions.NotFittedError when a prediction method (predict, transform, etc.) is called. (utils.validation.check_is_fitted() is used internally for this purpose.)

  • should not have any attributes beginning with an alphabetic character and ending with an underscore. (Note that a descriptor for the attribute may still be present on the class, but hasattr should return False)

function#

We provide ad hoc function interfaces for many algorithms, while estimator classes provide a more consistent interface.

In particular, watex may provide a function interface that fits a model to some data and returns the learnt model parameters, as in linear_model.enet_path(). For transductive models, this also returns the embedding or cluster labels, as in manifold.spectral_embedding() or cluster.dbscan(). Many preprocessing transformers also provide a function interface, akin to calling fit_transform, as in preprocessing.maxabs_scale(). Users should be careful to avoid data leakage when making use of these fit_transform-equivalent functions.

We do not have a strict policy about when to or when not to provide function forms of estimators, but maintainers should consider consistency with existing interfaces, and whether providing a function would lead users astray from best practices (as regards data leakage, etc.)

See examples.

hyperparameter#
hyper-parameter#

See parameter.

impute#
imputation#

Most machine learning algorithms require that their inputs have no missing values, and will not work if this requirement is violated. Algorithms that attempt to fill in (or impute) missing values are referred to as imputation algorithms.

indexable#

An array-like, sparse matrix, pandas DataFrame or sequence (usually a list).

induction#
inductive#

Inductive (contrasted with transductive) machine learning builds a model of some data that can then be applied to new instances. Most estimators in watex are inductive, having predict and/or transform methods.

joblib#

A Python library (https://joblib.readthedocs.io) used in watex to facilite simple parallelism and caching. Joblib is oriented towards efficiently working with numpy arrays, such as through use of memory mapping. See parallelism for more information.

kernel#

Specifies the kernel function to be used by Kernel Method algorithms. For example, the estimators watex.exlib.SVC has a kernel parameter that takes the name of the kernel to use as string or a callable kernel function used to compute the kernel matrix.

label indicator matrix#
multilabel indicator matrix#
multilabel indicator matrices#

The format used to represent multilabel data, where each row of a 2d array or sparse matrix corresponds to a sample, each column corresponds to a class, and each element is 1 if the sample is labeled with the class and 0 if not.

leakage#
data leakage#

A problem in cross validation where generalization performance can be over-estimated since knowledge of the test data was inadvertently included in training a model. This is a risk, for instance, when applying a transformer to the entirety of a dataset rather than each training portion in a cross validation split.

We aim to provide interfaces (such as pipeline and model_selection) that shield the user from data leakage.

max_iter#

For estimators involving iterative optimization, this determines the maximum number of iterations to be performed in fit. Raises exceptions.ConvergenceWarning If max_iter iterations are run without convergence,

memory#

Some estimators make use of joblib.Memory to store partial solutions during fitting. Thus when fit is called again, those partial solutions have been memoized and can be reused.

memmapping#
memory map#
memory mapping#

A memory efficiency strategy that keeps data on disk rather than copying it into main memory. Memory maps can be created for arrays that can be read, written, or both, using numpy.memmap. When using joblib to parallelize operations in watex, it may automatically memmap large arrays to reduce memory duplication overhead in multiprocessing.

missing values#

Most watex estimators do not work with missing values. When they do (e.g. in impute.SimpleImputer), NaN is the preferred representation of missing values in float arrays. If the array has integer dtype, NaN cannot be represented. For this reason, we support specifying another missing_values value when imputation or learning can be performed in integer space. Unlabeled data is a special case of missing values in the target.

n_features#

The number of features.

n_outputs#

The number of outputs in the target.

n_samples#

The number of samples.

n_targets#

Synonym for n_outputs.

narrative docs#
narrative documentation#

An alias for User Guide, i.e. documentation written in doc/modules/. Unlike the API reference provided through docstrings, the User Guide aims to:

  • group tools provided by watex together thematically or in terms of usage;

  • motivate why someone would use each particular tool, often through comparison;

  • provide both intuitive and technical descriptions of tools;

  • provide or link to examples of using key features of a tool.

np#

A shorthand for Numpy due to the conventional import statement:

import numpy as np
online learning#

Where a model is iteratively updated by receiving each batch of ground truth targets soon after making predictions on corresponding batch of data. Intrinsically, the model must be usable for prediction after each batch. See partial_fit.

out-of-core#

An efficiency strategy where not all the data is stored in main memory at once, usually by performing learning on batches of data. See partial_fit.

outputs#

Individual scalar/categorical variables per sample in the target. For example, in multilabel classification each possible label corresponds to a binary output. Also called responses, tasks or targets. See multiclass multioutput and continuous multioutput.

pair#

A tuple of length two.

parameter#
parameters#
param#
params#

In the statistical sense, parameters are values that specify a model and can be estimated from data. However, here, what we call parameters might be what statisticians call hyperparameters to the model: aspects for configuring model structure that are often not directly learnt from data. We mostly use parameter to refer to the aspects of an estimator that can be specified in its construction. For example, max_depth and random_state are parameters of RandomForestClassifier. Parameters to an estimator’s constructor are stored unmodified as attributes on the estimator instance, and conventionally start with an alphabetic character and end with an alphanumeric character. Each estimator’s or assessor constructor parameters are described in the estimator’s docstring.

The list of parameters and their current values can be retrieved from an estimator instance using its get_params method.

Between construction and fitting, parameters may be modified using set_params. To enable this, parameters are not ordinarily validated or altered when the estimator is constructed, or when each parameter is set. Parameter validation is performed when fit is called. Common parameters are listed below.

pairwise metric#
pairwise metrics#

In its broad sense, a pairwise metric defines a function for measuring similarity or dissimilarity between two samples (with each ordinarily represented as a feature vector). We particularly provide implementations of distance metrics (as well as improper metrics like Cosine Distance) through metrics.pairwise_distances(), and of kernel functions (a constrained class of similarity functions) in metrics.pairwise_kernels(). These can compute pairwise distance matrices that are symmetric and hence store data redundantly.

See also precomputed and metric.

Note that for most distance metrics, we rely on implementations from scipy.spatial.distance, but may reimplement for efficiency in our context. The metrics.DistanceMetric interface is used to implement distance metrics for integration with efficient neighbors search.

pd#

A shorthand for Pandas due to the conventional import statement:

import pandas as pd
precomputed#

Where algorithms rely on pairwise metrics, and can be computed from pairwise metrics alone, we often allow the user to specify that the X provided is already in the pairwise (dis)similarity space, rather than in a feature space. That is, when passed to fit, it is a square, symmetric matrix, with each vector indicating (dis)similarity to every sample, and when passed to prediction/transformation methods, each row corresponds to a testing sample and each column to a training sample.

Use of precomputed X is usually indicated by setting a metric, affinity or kernel parameter to the string ‘precomputed’. If this is the case, then the estimator should set the pairwise estimator tag as True.

Python#
python-based#

Python is a programming languages. It’s used in everything from machine learning to building websites and software testing. It can be used by developers and non-developers alike. It is also a general-purpose language, which means it’s designed to be used in a range of applications, including data science, :software and web development, automation, and generally getting stuff done. In computers sciences, Python is defined as a high-level, general-purpose, interpreted object-oriented programming language. Similar to PERL, Python is a programming language popular among experienced C++ and Java programmers. Get further details here.

rectangular#

Data that can be represented as a matrix with samples on the first axis and a fixed, finite set of features on the second is called rectangular.

This term excludes samples with non-vectorial structures, such as text, an image of arbitrary size, a time series of arbitrary length, a set of vectors, etc. The purpose of a vectorizer is to produce rectangular forms of such data.

sample#
samples#

We usually use this term as a noun to indicate a single feature vector. Elsewhere a sample is called an instance, data point, or observation. n_samples indicates the number of samples in a dataset, being the number of rows in a data array X.

sample property#
sample properties#

A sample property is data for each sample (e.g. an array of length n_samples) passed to an estimator method or a similar function, alongside but distinct from the features (X) and target (y).

As of version 0.19 we do not have a consistent approach to handling sample properties and their routing in meta-estimators, though a fit_params parameter is often used.

sparse matrix#
sparse graph#

A representation of two-dimensional numeric data that is more memory efficient the corresponding dense numpy array where almost all elements are zero. We use the scipy.sparse framework, which provides several underlying sparse data representations, or formats. Some formats are more efficient than others for particular tasks, and when a particular format provides especial benefit, we try to document this fact in watex parameter descriptions.

Some sparse matrix formats (notably CSR, CSC, COO and LIL) distinguish between implicit and explicit zeros. Explicit zeros are stored (i.e. they consume memory in a data array) in the data structure, while implicit zeros correspond to every element not otherwise defined in explicit storage.

Two semantics for sparse matrices are used in watex:

matrix semantics

The sparse matrix is interpreted as an array with implicit and explicit zeros being interpreted as the number 0. This is the interpretation most often adopted, e.g. when sparse matrices are used for feature matrices or multilabel indicator matrices.

graph semantics

As with scipy.sparse.csgraph, explicit zeros are interpreted as the number 0, but implicit zeros indicate a masked or absent value, such as the absence of an edge between two vertices of a graph, where an explicit value indicates an edge’s weight. This interpretation is adopted to represent connectivity in clustering, in representations of nearest neighborhoods (e.g. neighbors.kneighbors_graph()), and for precomputed distance representation where only distances in the neighborhood of each point are required.

When working with sparse matrices, we assume that it is sparse for a good reason, and avoid writing code that densifies a user-provided sparse matrix, instead maintaining sparsity or raising an error if not possible (i.e. if an estimator does not / cannot support sparse matrices).

target#
targets#

The dependent variable in supervised (and semisupervised) learning, passed as y to an estimator’s fit method. Also known as dependent variable, outcome variable, response variable, ground truth or label. watex works with targets that have minimal structure: a class from a finite set, a finite real-valued number, multiple classes, or multiple numbers.

transduction#
transductive#

A transductive (contrasted with inductive) machine learning method is designed to model a specific dataset, but not to apply that model to unseen data.

Machine Learning concepts#

These are basic concepts about Machine Learning. For more-in depth, refer to Scikit-learn glossary.

Artificial intelligence#
AI#
Intelligence artificielle#
IA#

It is the science and engineering of making intelligent machines, especially intelligent computer programs. It is related to the similar task of using computers to understand human intelligence, but AI does not have to confine itself to methods that are biologically observable. See more here.

binary classification#

A 1-dimensional array, where values strictly greater than zero indicate the positive class (i.e. the last class in classes_).

classifier#

An array of shape (n_samples,) (n_samples, n_outputs). Multilabel data may be represented as a sparse matrix if a sparse matrix was used in fitting. Each element should be one of the values in the classifier’s classes_ attribute.

clusterer#

An array of shape (n_samples,) where each value is from 0 to n_clusters - 1` if the corresponding sample is clustered, nd -1 if the sample is not clustered.

continuous#

A regression problem where each sample’s target is a finite floating point number represented as a 1-dimensional array of floats (or sometimes ints).

cv#

Determines a cross validation splitting strategy, as used in cross-validation based routines.

leakage#
data leakage#

A problem in cross validation where generalization performance can be over-estimated since knowledge of the test data was inadvertently included in training a model. This is a risk, for instance, when applying a transformer to the entirety of a dataset rather than each training portion in a cross validation split.

n_components#

The number of features which a transformer should transform the input into. See components_ for the special case of affine projection.

n_iter_no_change#

Number of iterations with no improvement to wait before stopping the iterative procedure. This is also known as a patience parameter. It is typically used with early stopping to avoid stopping too early.

n_jobs#

This parameter is used to specify how many concurrent processes or threads should be used for routines that are parallelized with joblib.

metric#

As a parameter, this is the scheme for determining the distance between two data points. See metrics.pairwise_distances(). In practice, for some algorithms, an improper distance metric (one that does not obey the triangle inequality, such as Cosine Distance) may be used.

Machine Learning#
Machine learning#
Machine-learning#
Learning machine#
ML#

Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy. See more here.

multiclass classification#

A 2-dimensional array, where the row-wise arg-maximum is the predicted class. Columns are ordered according to classes_.

multilabel classification#

Scikit-learn is inconsistent in its representation of multilabel decision functions. It may be represented one of two ways:

  • List of 2d arrays, each array of shape: (n_samples, 2), like in

multiclass multioutput. List is of length n_labels.

  • Single 2d array of shape (n_samples, n_labels), with each ‘column’ in the array corresponding to the individual binary classification decisions. This is identical to the multiclass classification format, though its semantics differ: it should be interpreted, like in the binary case, by thresholding at 0.

outlier detector#
outlier detectors#

An unsupervised binary predictor which models the distinction between core and outlying samples.

predictor#
predictors#

An estimator supporting predict and/or fit_predict. This encompasses classifier, regressor, outlier detector and clusterer.

In statistics, “predictors” refers to features.

regressor#
regressors#

A supervised (or semi-supervised) predictor with continuous output values.

transformer#
transformers#

An estimator supporting transform and/or fit_transform. A purely transductive transformer, such as manifold.TSNE, may not implement transform.

semi-supervised#
semi-supervised learning#
semisupervised#

Learning where the expected prediction (label or ground truth) is only available for some samples provided as training data when fitting the model.

supervised#
supervised learning#

Learning where the expected prediction (label or ground truth) is available for each sample. This is the approach taken in a classifier or regressor among other estimators.

target#
targets#

The dependent variable in supervised (and semisupervised) learning, passed as y to an estimator’s fit method. Also known as dependent variable, outcome variable, response variable, ground truth or label. Scikit-learn works with targets that have minimal structure: a class from a finite set, a finite real-valued number, multiple classes, or multiple numbers. See glossary_target_types.

unsupervised#
unsupervised learning#

Learning where the expected prediction (label or ground truth) is not available for each sample.

pos_label#

Value with which positive labels must be encoded in binary classification problems in which the positive class is not assumed. This value is typically required to compute asymmetric evaluation metrics such as precision and recall.

random_state#

Whenever randomization is part of a Scikit-learn algorithm, a random_state parameter may be provided to control the random number generator used. Note that the mere presence of random_state doesn’t mean that randomization is always used, as it may be dependent on another parameter, e.g. shuffle, being set.

The passed value will have an effect on the reproducibility of the results returned by the function (fit, split, or any other function like k_means()). random_state’s value may be:

None (default)

Use the global random state instance from numpy.random. Calling the function multiple times will reuse the same instance, and will produce different results.

An integer

Use a new random number generator seeded by the given integer. Using an int will produce the same results across different calls. However, it may be worthwhile checking that your results are stable across a number of different distinct random seeds. Popular integer random seeds are 0 and 42. Integer values must be in the range [0, 2**32 - 1].

A numpy.random.RandomState instance

Use the provided random state, only affecting other users of that same random state instance. Calling the function multiple times will reuse the same instance, and will produce different results.

scoring#

Specifies the score function to be maximized (usually by cross validation), or – in some cases – multiple score functions to be reported. The score function can be a string accepted by watex.metrics.get_scorers().

Estimators and assessors methods#

decision_function#

In a fitted classifier or outlier detector, predicts a “soft” score for each sample in relation to each class, rather than the “hard” categorical prediction produced by predict. Its input is usually only some observed data, X.

fit#

The fit method is provided on every estimator. It usually takes some samples X, targets y if the model is supervised, and potentially other sample properties such as sample_weight. It should:

  • clear any prior attributes stored on the estimator, unless warm_start is used;

  • validate and interpret any parameters, ideally raising an error if invalid;

  • validate the input data;

  • estimate and store model attributes from the estimated parameters and provided data; and

  • return the now fitted estimator to facilitate method chaining.

glossary_target_types describes possible formats for y.

fit_predict#

Used especially for unsupervised, transductive estimators, this fits the model and returns the predictions (similar to predict) on the training data. In clusterers, these predictions are also stored in the labels_ attribute, and the output of .fit_predict(X) is usually equivalent to .fit(X).predict(X). The parameters to fit_predict are the same as those to fit.

fit_transform#

A method on transformers which fits the estimator and returns the transformed training data. It takes parameters as in fit and its output should have the same shape as calling .fit(X, ...).transform(X). There are nonetheless rare cases where .fit_transform(X, ...) and .fit(X, ...).transform(X) do not return the same value, wherein training data needs to be handled differently (due to model blending in stacked ensembles, for instance; such cases should be clearly documented). Transductive transformers may also provide fit_transform but not transform.

One reason to implement fit_transform is that performing fit and transform separately would be less efficient than together. base.TransformerMixin provides a default implementation, providing a consistent interface across transformers where fit_transform is or is not specialized.

In inductive learning – where the goal is to learn a generalized model that can be applied to new data – users should be careful not to apply fit_transform to the entirety of a dataset (i.e. training and test data together) before further modelling, as this results in data leakage.

get_feature_names_out#

Primarily for feature extractors, but also used for other transformers to provide string names for each column in the output of the estimator’s transform method. It outputs an array of strings and may take an array-like of strings as input, corresponding to the names of input columns from which output column names can be generated. If input_features is not passed in, then the feature_names_in_ attribute will be used. If the feature_names_in_ attribute is not defined, then the input names are named [x0, x1, …, x(n_features_in_ - 1)].

get_n_splits#

On a CV splitter (not an estimator), returns the number of elements one would get if iterating through the return value of split given the same parameters. Takes the same parameters as split.

get_params#

Gets all parameters, and their values, that can be set using set_params. A parameter deep can be used, when set to False to only return those parameters not including __, i.e. not due to indirection via contained estimators.

Most estimators adopt the definition from base.BaseEstimator, which simply adopts the parameters defined for __init__. pipeline.Pipeline, among others, reimplements get_params to declare the estimators named in its steps parameters as themselves being parameters.

partial_fit#

Facilitates fitting an estimator in an online fashion. Unlike fit, repeatedly calling partial_fit does not clear the model, but updates it with the data provided. The portion of data provided to partial_fit may be called a mini-batch. Each mini-batch must be of consistent shape, etc. In iterative estimators, partial_fit often only performs a single iteration.

predict#

Makes a prediction for each sample, usually only taking X as input (but see under regressor output conventions below). In a classifier or regressor, this prediction is in the same target space used in fitting (e.g. one of {‘red’, ‘amber’, ‘green’} if the y in fitting consisted of these strings). Despite this, even when y passed to fit is a list or other array-like, the output of predict should always be an array or sparse matrix. In a clusterer or outlier detector the prediction is an integer.

If the estimator was not already fitted, calling this method should raise a exceptions.NotFittedError.

predict_log_proba#

The natural logarithm of the output of predict_proba, provided to facilitate numerical stability.

predict_proba#

A method in classifiers and clusterers that can return probability estimates for each class/cluster. Its input is usually only some observed data, X.

If the estimator was not already fitted, calling this method should raise a exceptions.NotFittedError.

Output conventions are like those for decision_function except in the binary classification case, where one column is output for each class (while decision_function outputs a 1d array). For binary and multiclass predictions, each row should add to 1.

Like other methods, predict_proba should only be present when the estimator can make probabilistic predictions (see duck typing). This means that the presence of the method may depend on estimator parameters (e.g. in linear_model.SGDClassifier) or training data (e.g. in model_selection.GridSearchCV) and may only appear after fitting.

regressor

A numeric array of shape (n_samples,), usually float64. Some regressors have extra options in their predict method, allowing them to return standard deviation (return_std=True) or covariance (return_cov=True) relative to the predicted value. In this case, the return value is a tuple of arrays corresponding to (prediction mean, std, cov) as required.

score#

A method on an estimator, usually a predictor, which evaluates its predictions on a given dataset, and returns a single numerical score. A greater return value should indicate better predictions; accuracy is used for classifiers and R^2 for regressors by default.

If the estimator was not already fitted, calling this method should raise a exceptions.NotFittedError.

Some estimators implement a custom, estimator-specific score function, often the likelihood of the data under the model.

score_samples#

A method that returns a score for each given sample. The exact definition of score varies from one class to another. In the case of density estimation, it can be the log density model on the data, and in the case of outlier detection, it can be the opposite of the outlier factor of the data.

If the estimator was not already fitted, calling this method should raise a exceptions.NotFittedError.

set_params#

Available in any estimator, takes keyword arguments corresponding to keys in get_params. Each is provided a new value to assign such that calling get_params after set_params will reflect the changed parameters. Most estimators use the implementation in base.BaseEstimator, which handles nested parameters and otherwise sets the parameter as an attribute on the estimator. The method is overridden in pipeline.Pipeline and related estimators.

split#

On a CV splitter (not an estimator), this method accepts parameters (X, y, groups), where all may be optional, and returns an iterator over (train_idx, test_idx) pairs.

transform#

In a transformer, transforms the input, usually only X, into some transformed space (conventionally notated as Xt). Output is an array or sparse matrix of length n_samples and with the number of columns fixed after fitting.

If the estimator was not already fitted, calling this method should raise a exceptions.NotFittedError.

Core parameters#

These parameter names, are commonly used in estimator, assessors and common functions.

as_frame#

Transform the data in a pandas DataFrame including columns with appropriate types (numeric). The target is a panda DataFrame or Series depending on the number of target columns. If as_frame is False, then returning a Boxspace dictionary-like object, with the following attributes:

  • data{ndarray, dataframe}

    The data matrix. If as_frame=True, data will be a pandas DataFrame.

  • resistivity: {array-like} of shape (shape[0],)

    The resistivity of the sounding point.

  • MN: {array-like} of shape (shape[0],)

    The step value of potential electrodes increasing in meters

  • AB: {array-like} of shape (shape[0],)

    The step value of current electrodes increasing in meters

  • feature_names: list

    The names of the dataset columns.

  • DESCR: str

    The full description of the dataset.

  • filename: str

    The path to the location of the data.

data#

str, filepath_or_buffer or pandas.core.DataFrame Path -like object or Dataframe. If data is given as path-like object, data is read, asserted and validated. Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be a file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handle e.g. via builtin open function or StringIO.

index_rhoa#

int, index of the resistivy columns to retrieve. Note that this is useful in the cases many sounding values are collected in the same survey area. index_rhoa=0 fetches the first sounding values in the collection of all values.

tag#

str, Name of the dataset to fectched. Tag can be a data set processing stages. See datasets <datasets> for consistent details.

X#

Ndarray of shape ( \(M x N\)), \(M = m_{samples}\) & \(N=n_{features}\) training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. X may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model.

y#

array-like of shape (\(M,\) ), \(M=m_{samples}\) train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

Xt#

Ndarray ( \(M x N\) matrix where \(M=m_{samples}\), & \(N=n_{features}\)) Shorthand for “test set”; data that is observed at testing and prediction time, used as independent variables in learning.The notation is uppercase to denote that it is ordinarily a matrix.

yt#

array-like, shape (\(M\), ) \(M=m_{samples}\), test target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction.

tname#

str, A target name or label. In supervised learning the target name is considered as the reference name of y or label variable.

z#

array-like 1d, pandas.Series Array of depth or a pandas series that contains the depth values. Two dimensional array or more is not allowed. However when z is given as a dataframe and zname is not supplied, an error raises since zname is used to fetch and overwritten z from the dataframe.

zname#

str, int Name of depth columns. zname allows to retrieve the depth column in a dataframe. If integer is passed, it assumes the index of the dataframe fits the depth column. Integer value must not be out the dataframe size along axis 1. Commonly `zname`needs to be supplied when a dataframe is passed to a function argument.

kname#

str, int Name of permeability coefficient columns. kname allows to retrieve the permeability coefficient k in a specific dataframe. If integer is passed, it assumes the index of the dataframe fits the k columns. Note that integer value must not be out the dataframe size along axis 1. Commonly kname needs to be supplied when a dataframe is passed as a positional or keyword argument.

k#

array-like 1d, pandas.Series Array of permeability coefficient ‘k’ or a pandas series that contains the ‘k’ values. Two dimensional array or more is not allowed. However, when k passes as a dataframe and kname is not supplied, an error raises since kname is used to retrieve k values from the dataframe and overwritten it.

target#

Array-like or pd.Series Is the dependent variable in supervised (and semisupervised) learning, passed as y to an estimator’s fit method. Also known as dependent variable, outcome variable, response variable, ground truth or label. watex works with targets that have minimal structure: a class from a finite set, a finite real-valued number, multiple classes, or multiple numbers. Refer to target types . Note that throughout this library, a target is considered as a pd.Series where the name is tname and the variable y i.e target = tname + y.

model#

callable, always as a function, A model estimator. An object which manages the estimation and decoding of a model. The model is estimated as a deterministic function of:

  • parameters provided in object construction or with set_params;

  • the global numpy.random random state if the estimator’s random_state parameter is set to None; and

  • any data or sample properties passed to the most recent call to fit, fit_transform or fit_predict, or data similarly passed in a sequence of calls to partial_fit.

The estimated model is stored in public and private attributes on the estimator instance, facilitating decoding through prediction and transformation methods. Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator. The core functionality of some estimators may also be available as a function.

clf#

callable, always as a function, classifier estimator A supervised (or semi-supervised) predictor with a finite set of discrete possible output values. A classifier supports modeling some of binary, multiclass, multilabel, or multiclass multioutput targets. Within scikit-learn, all classifiers support multi-class classification, defaulting to using a one-vs-rest strategy over the binary classification problem. Classifiers must store a classes_ attribute after fitting, and usually inherit from base.ClassifierMixin, which sets their _estimator_type attribute. A classifier can be distinguished from other estimators with is_classifier. It must implement: * fit * predict * score It may also be appropriate to implement decision_function, predict_proba and predict_log_proba.

reg#

callable, always as a function A regression estimator; Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator. The estimated model is stored in public and private attributes on the estimator instance, facilitating decoding through prediction and transformation methods. The core functionality of some estimators may also be available as a``function``.

cv#

float, A cross validation splitting strategy. It used in cross-validation based routines. cv is also available in estimators such as multioutput. ClassifierChain or calibration.CalibratedClassifierCV which use the predictions of one estimator as training data for another, to not overfit the training supervision. Possible inputs for cv are usually:

  • An integer, specifying the number of folds in K-fold cross validation.

    K-fold will be stratified over classes if the estimator is a classifier (determined by base.is_classifier) and the targets may represent a binary or multiclass (but not multioutput) classification problem (determined by utils.multiclass.type_of_target).

  • A cross-validation splitter instance. Refer to the User Guide for

    splitters available within watex.

  • An iterable yielding train/test splits.

With some exceptions (especially where not using cross validation at all is an option), the default is 4-fold. .. _Scikit-learn: https://scikit-learn.org/stable/glossary.html#glossary

scoring#

str, Specifies the score function to be maximized (usually by cross validation), or – in some cases – multiple score functions to be reported.

random_state#

int, RandomState instance or None, default=None Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls..

test_size#

float or int, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.

n_jobs#

int, is used to specify how many concurrent processes or threads should be used for routines that are parallelized with joblib. It specifies the maximum number of concurrently running workers. If 1 is given, no joblib parallelism is used at all, which is useful for debugging. If set to -1, all CPUs are used. For instance:

  • n_jobs below -1, (n_cpus + 1 + n_jobs) are used.

  • `n_jobs`=-2, all CPUs but one are used.

  • n_jobs is None by default, which means unset; it will generally be

    interpreted as n_jobs=1 unless the current joblib.Parallel backend context specifies otherwise.

Note that even if n_jobs=1, low-level parallelism (via Numpy and OpenMP) might be used in some configuration.

verbose#

int, default is 0 Control the level of verbosity. Higher value lead to more messages. Any True value should enable some logging, but larger integers (e.g. above 7) may be needed for full verbosity.

self:#

Baseclass instance returns self for easy method chaining.

Plotting parameters#

These are the parameters for plotting purposes. Each plotting class that inherits from watex.property.BasePlot such as (ExPlot, TPlot, QuickPlot or EvalPlot) accepts the following parameters for customizing plots. See also the property parameters BasePlot.

savefig#

str, Path-like object, savefigure’s name, default is None

fig_dpi#

float, dots-per-inch resolution of the figure. default is 300

fig_num#

int, size of figure in inches (width, height). default is [5, 5]

fig_size#
Tuple (int, int) or inch

of figure in inches (width, height).*default* is [5, 5]

fig_orientation#

str, figure orientation. default is landscape

fig_tile#

str, figure title. default is None

fs#

float, size of font of axis tick labels, axis labels are fs+2. default is 6

ls#

str, line style, it can be [ ‘-’ | ‘.’ | ‘:’ ] . default is ‘-’

lc#

str, Optional, line color of the plot, default is k

lw#

float, Optional, line weight of the plot, default is 1.5

alpha#

float between 0 < alpha < 1, transparency number, default is 0.5,

font_weight#

str, Optional weight of the font , default is bold.

font_style#

str, Optional style of the font. default is italic

font_size#

float, Optional size of font in inches (width, height). default is 3.

ms#

float, Optional size of marker in points. default is 5

marker#

str, Optional marker of stations default is o.

marker_style#

str, Optional facecolor of the marker. default is yellow

marker_edgecolor#

str, Optional facecolor of the marker. default is yellow

marker_edgewidth#

float, Optional width of the marker. default is 3.

xminorticks#

float, Optional minortick according to x-axis size and default is 1.

yminorticks#

float, Optional yminorticks according to x-axis size and default is 1.

bins#

histograms element separation between two bar. default is 10.

xlim#

tuple (int, int), Optional limit of x-axis in plot.

ylim#

tuple (int, int), Optional limit of x-axis in plot.

xlabel#

str, Optional, label name of x-axis in plot.

ylabel#

str, Optional, label name of y-axis in plot.

rotate_xlabel#

float, Optional angle to rotate xlabel in plot.

rotate_ylabel#

float, Optional angle to rotate ylabel in plot.

leg_kws#

dict, Optional keyword arguments of legend. default is empty dict

plt_kws#

dict, Optional keyword arguments of plot. default is empty dict

glc#

str, Optional line color of the grid plot, default is k

glw#
float, Optional

weight of the grid plot, default is 2

galpha#

float, Optional, transparency number of grid, default is 0.5

gaxis#

str (‘x’, ‘y’, ‘both’) type of axis to hold the grid, default is both

gwhich#

str, Optional kind of grid in the plot. default is major

tp_axis#

bool, axis to apply the ticks params. default is both

tp_labelsize#

str, Optional labelsize of ticks params. default is italic

tp_bottom#

bool, position at bottom of ticks params. default is True.

tp_labelbottom#

bool, put label on the bottom of the ticks. default is False

tp_labeltop#

bool, put label on the top of the ticks. default is True

cb_orientation#

str , (‘vertical’, ‘horizontal’) orientation of the colorbar, default is vertical

cb_aspect#

float, Optional aspect of the colorbar. default is 20.

cb_shrink#

float, Optional shrink size of the colorbar. default is 1.0

cb_pad#

float, pad of the colorbar of plot. default is .05

cb_anchor#

tuple (float, float) anchor of the colorbar. default is (0.0, 0.5)

cb_panchor#

tuple (float, float) proportionality anchor of the colorbar. default is (1.0, 0.5)

cb_label#

str, Optional label of the colorbar.

cb_spacing#

str, Optional spacing of the colorbar. default is uniform

cb_drawedges#

bool, draw edges inside of the colorbar. default is False

ax#

matplotlib.axes.Axes The matplotlib axes containing the plot.

Miscellaneous#

These are different plotting functions from seaborn used in some for visualization.

scatterplot#

Plot data using points.

lineplot#

Plot data using lines.

displot#

Figure-level interface to distribution plot functions.

histplot#

Plot a histogram of binned counts with optional normalization or smoothing.

kdeplot#

Plot univariate or bivariate distributions using kernel density estimation.

violinplot#

Draw an enhanced boxplot using kernel density estimation.

pointplot#

Plot point estimates and CIs using markers and lines.

boxplot#

Draw an enhanced boxplot.

jointplot#

Draw a bivariate plot with univariate marginal distributions.

jointplot#

Draw multiple bivariate plots with univariate marginal distributions.

JointGrid#

Set up a figure with joint and marginal views on bivariate data.

PairGrid#

Set up a figure with joint and marginal views on multiple variables.

facetgrid#

FacetGrid An object managing one or more subplots that correspond to conditional data subsets with convenient methods for batch-setting of axes attributes.

jointgrid#

JointGrid An object managing multiple subplots that correspond to joint and marginal axes for plotting a bivariate relationship or distribution.

pairgrid#

class:PairGrid An object managing multiple subplots that correspond to joint and marginal axes for pairwise combinations of multiple variables in a dataset.