.. _glossary: ================== Glossary ================== This glossary gives some explicit conventions applied in watex and its API, while providing a reference for users. The concepts are structured into the related terms :ref:`glossary_geosciences_concepts`, :ref:`glossary_computer_concepts`, :ref:`glossary_ml_concepts`, :ref:`glossary_methods`, :ref:`glossary_parameters`, :ref:`glossary_plot_parameters`, :ref:`glossary_miscellaneous`, and :ref:`glossary_reference_links`. .. _glossary_geosciences_concepts: Geosciences concepts ===================== .. glossary:: Audio-frequency magnetotelluric Audio-magnetotelluric AMT is an important high resolution non-seismic geophysical technique that measures variations in the Earth's natural electromagnetic fields to detect electrical resistivity variations in the subsurface at shallow to intermediate depths. Aquifer In :term:`hydrology`, aquifer is defined as a rock layer that contains :term:`water` and releases it in appreciable amounts. The rock contains water-filled pore spaces, and, when the spaces are connected, the water is able to flow through the matrix of the rock. Borehole A deep hole made in the ground when looking for oil, gas, or water: We must sink a borehole so that people will have water. It can also be defined as a narrow shaft bored in the ground, either vertically or horizontally. Usually borehole may be constructed for many different purposes, including the extraction of water (drilled water well and tube well), other liquids (such as petroleum), or gases (such as natural gas). In our case, the objective is :term:`water`. Conductivity It is often represented using :math:`\sigma`. It is the inverse of resistivity: :math:`1/\sigma` . Conductivity is given in units of Siemens per metre, or S/m. Millisiemens per metre (:math:`mS/m`) are often used; :math:`1000 mS/m = 1 S/m`. So :math:`1 mS/m = 1000Ohm-m`. Controlled-source audio-frequency magnetotelluric Controlled-source audio-magnetotelluric CSAMT Controlled Source Audio-Magnetotellurics; It is a frequency-domain electromagnetic sounding technique which uses a fixed grounded dipole or horizontal loop as an artificial signal source. Mostly, Its involves transmitting a current at various frequencies in one location, and measuring resistivity differences between electrodes spaced along a receiver line several kilometers from the transmitter. CSAMT has a low environemental-impact and mostly used extensively in minerals, geothermal, and groundwater exploration—and in some hydrocarbon applications. Controlled source electromagnetic CSEM It stands for Controlled Source Electromagnetics (`CSEM `__), commonly known as Marine CSEM. It is the collective term for techniques that can be used to investigate the geological subsurface using electromagnetic signals generated by artificial and controllable source systems operated in a marine setting. This is possible because the various subsurface strata are made up of materials with different electromagnetic properties in terms of their resistivity/conductivity and chargeability. The differences in resistivity between different materials enable us to use electromagnetic signals to map geological formations in the subsurface. When an electromagnetic field propagates through the different formations, it becomes successively influenced and modified by the resistivities of the different strata it encounters. Deadband Dead-band Attenuation band Neural zone It is defined as a band of input values in the domain of a transfer function in a control system or signal processing system where the output is zero (the output is 'dead' - no action occurs). In short-periods :term:`EM`, the attenuation band constitutes one potential problem in collecting :term:`NSAMT` data centered at 2000 Hz. In this band natural source signals are generally absent. While commonly called the attenuation band, the actual problem is that the atmospheric `wave guide` channeling natural source electromagnetic signals is transparent to these frequencies. Natural electromagnetic (EM) energy is dissipated! Often, cultural electromagnetic noise fills the gap. Recovering the missing or weak frequency signal in that bandwidth seems a real challenge. DC-resistivity DC It stands for direct current resistivity methods. It entails injecting a steady state electrical current into the ground and observing the resulting distribution of potentials (voltages) at the surface or within boreholes. Like all geophysical processes, `DC surveys `__ can be described in terms of input energy, the earth's physical properties, and signals or data that are measured. Drilling Drill Drilling is a material-removing or cutting process in which the tool uses a drill bit to cut a hole of circular cross-section in solid materials. This is the most common machining process, one estimate is that 75% of all metal cutting material removed comes from the drilling operation. Drinking water supply campaign Drinking water supply project Campaign for drinking water supply DWSC CDWS It is a project hugely financed by Global organizations (UNICEF, UNESCO) in collaboration with states governements for supplying population from rural and urban areas to potable water. It is mostly registered under the `Sustainable Development Goals Number 6 `_. In developing countries, some DWSC tends to cover 1000 and more villages per years for population welfare. Electrical Data Interchange EDI EDI is a data format for electrical geophysical data. It purpose is to facilitate data exchange from contractors to clients and among clients. It also serves a very important function as a standard format for archiving data. The format has been designed to accommodate a wide variety of different field measurement configurations and processing requirements which might arise from special applications or new developments. Although very similar in appearance to some existing formats, it has been slightly modified to be more consistent, unambiguous, extensible, and realizable. A number of refinements have been incorporated as a result of actually implementing the standard. The format can accommodate any currently used data acquisition configuration, and has the flexibility to handle any reasonable future configuration. Supported configurations include single and multi-station :term:`MT`, telluric-magnetotelluric sounding, :term:`EMAP` profiling, and combinations of the above. Provisions have been made for interchange of data at various levels of acquisition and processing including time series, power spectra, impedance, and computed parameters including non-standard parameters. The data interchange file is always an ASCII file. However, a provision has been made to store actual data values in a parallel binary data file and reference them through the (ASCII) data interchange file. See the EDI file format construction in the document of `Wight and Drive,1988 `_. Electrical resistivity profiling DC-Profiling ERP It is an electrical geophysical method easy to use. It is used to select the best conductive zone (anomaly) where the drilling operations are expected to be performed. Because, it is cheap, mostly it is prefered by local companies and drilling ventures because to carry out their survey in order to stay on the project timeline and save money. Moreover, it is first used, before the DC-sounding to select the conductive zone based on the resistivity value for featuring the fracture expecting to contain a groundwater. Electromagnetic EM It means relating to the science that deals with the relationship between electricity and magnetism. Electromagnetic array profiling EMAP A variant of the magnetotelluric method (q.v.) where electric field measurements along a profile are made in a continuous fashion. Data is then processed to remove the effects of near-surface inhomogeneities. Flow-rate Flow rate flow FR flow is defined as liquids, gases, or electricity to move in one direction, especially continuously. It can be: - lava from the volcano that was flowing down the hillside. - many short rivers flow into the Pacific Ocean. - river flows through three counties before flowing into the sea just south of here. - more ... Here, when we talk about the ``flow rate``, we are mostly focused on the underground water flow ( :term:`groundwater`) circulating in the fractures obtained during/after the :term:`drilling` operations. It can be defined as the mass/volumetric of a :term:`water` which passes per unit of time. We denoted mostly the unit as :math:`m^3/hr`. Geology Geological Geology is a branch of natural science concerned with Earth and other astronomical objects, the features or rocks of which it is composed, and the processes by which they change over time. Geophysic Geophysical Geophysics is a subject of natural science concerned with the physical processes and physical properties of the Earth and its surrounding space environment, and the use of quantitative methods for their analysis. The term geophysics classically refers to solid earth applications. Geosciences Is the study of the Earth - its oceans, atmosphere, rivers and lakes, ice sheets and glaciers, soils, its complex surface, rocky interior, and metallic core. This includes many aspects of how living things, including humans, interact with the Earth. Groundwater Underground water It is the water present beneath Earth's surface in rock and soil pore spaces and in the fractures of rock formations. Groundwater exploration Groundwater exploration is the investigation of underground formations to understand the hydrologic cycle, know the :term:`groundwater` quality, and identify the nature, number and type of aquifers. There are different groundwater exploration methods. Surface :term:`geophysical` method is one of the groundwater investigation methods. One of the surface geophysical methods is therefore the :term:`vertical electrical sounding` (VES) method. :term:`VES` is one to provide valuable information regarding the vertical successions of subsurface geo-materials in terms of their individual thicknesses and corresponding resistivity values. It is rapid and much effective in estimating aquifer thickness of an area and is cheap effective technique for :term:`groundwater` study. GWE Is a shorthand of :term:`groundwater exploration`. Hydrogeology Hydrogeology (hydro- meaning water, and -geology meaning the study of the Earth) is the area of geology that deals with the distribution and movement of groundwater in the soil and rocks of the Earth's crust (commonly in :term:`aquifers`). The terms :term:`groundwater` hydrology, geohydrology, and hydrogeology are often used interchangeably. It can also be defined as a study of the laws governing the movement of subterranean water, the mechanical, chemical, and thermal interaction of this water with the porous solid, and the transport of energy, chemical constituents, and particulate matter by flow. Hydrogeophysics Hydro-geophysics hydrogeophysical hydro-geophysical Hydrogeophysics is a cross-disciplinary area of research that uses geophysics to determine parameters (characteristics; measurements of limitations or boundaries) and monitor processes for hydrological studies of matters such as water resources, contamination, and ecological studies.[1] The field uses knowledge and researchers from geology, hydrology, physics, geophysics, engineering, statistics, and rock physics. It uses geophysics to provide quantitative information about hydrogeological parameters, using minimally invasive methods. Hydrogeophysics differs from geophysics in its specific uses and methods. Although geophysical knowledge and methods have existed and grown over the last half century for applications in mining and petroleum industries, hydrogeological study sites have different subsurface conditions than those industries. Hydrology Hydrologic Hydrology is an important field of :term:`goeosciences` study, dealing with :term:`water` as one of the most valuable resources on Earth. All aspects of the Earth’s available water are studied by experts from many disciplines, from geologists to engineers, to obtain the information needed to manage this vital resource. Magnetetolluric MT Magnetetolluric; It is an electromagnetic geophysical method for inferring the earth's subsurface electrical conductivity from measurements of natural geomagnetic and geoelectric field variation at the Earth's surface. Investigation depth ranges from 300 m below ground by recording higher frequencies down to 10,000 m or deeper with long-period soundings Natural source audio-frequency magnetotelluric Natural source audio magnetotelluric NSAMT It is a :term:`EM` method that involves the measurement of electromagnetic fields that are generated by natural electromagnetic activity above the earth's surface. Natural source signals are generated in the atmosphere and magnetosphere. :term:`MT` signals (low frequencies < 1 Hz) are generated by the interaction between the earth's magnetosphere and the solar wind, sunspot activity and auroras. High frequency sources (> 1 Hz) in the Audio range (AMT) are generated by worldwide thunderstorms and lightning. These time-varying electric and magnetic fields induce currents into the earth and oceans. ohmic-area ohmS It is also called pseudo-area of the fractured zone. The ohmS is a fully data-driven parameter and is used to evaluate a pseudo-area of the fracture zone from the depth where the basement rock is supposed to start. Indeed, when exploring deeper using the VES, usually we are looking for groundwater in the fractured rock that is outside the anthropic pollution. Since the VES is an indirect method, we cannot ascertain whether the presumed fractured rock contains water inside. However, we assume that the fracture zone could exist and should contain groundwater. Mathematically, the computed area constitutes the expected “fracture zone” and is expressed in :math:`\Omega.m^2`. Thus, when :math:`ohmS≠0` confirms the existence of the fracture zone while :math:`ohms=0` raises doubts. PNAEP National drinking water supply program Programme national d'adduction en eau potable PNEAP is a huge program funded by the :term:`Cote d'Ivoire` `government `__ in partnership with international organizations such as the `World Bank `__, `UNICEF `__ to bring drinking water to rural and urban populations in several villages and towns within the country. The program extended over phases (from 2012 to 2014) and from 2015 to 2016. During this program the local geophysical companies were assisted by and West-Africa International Drilling Company (`FORACO-CI `__) for the development of :term:`drilling` after :term:`geophysics` exploration. PPU Emergency presidential program Programme presidentiel d'urgence The PPU is a :term:`Cote d'Ivoire` `government `__ initiative few months later after the inauguration of `Dr Alassane Ouattara `__ as president of Cote d'Ivoire. The project was opened just after the politico-military-crisis from 2012. The project started from 2012 to-2013 and and relauched from 2013-2014. The objective was the same as the :term:`PNAEP` but entirely financed by the state goverment. and a few part of funds is collected from `NGO children of Africa `__ lead by `Dominique Ouattara `__, First Lady of Côte d’Ivoire. See some other resource from `Mel et al., 2018 `__. pseudo-fracturing index IF sfi It a data driven parameter computed from :term:`DC-sounding`. It does not indicate the rock fracturing degree in the underground but it is used to speculate about the apparent resistivity dispersion ratio around the cumulated sum of the resistivity values of the selected anomaly. It uses a similar approach to the fracturing ratio (`IF `__) parameter proposed by `Dieng et al. (2004) `__. permeability coefficient k permeability coefficient coefficient of permeability k Simply, it is the rate of flow of water under laminar flow conditions through a unit cross-sectional area of a porous medium under a unit hydraulic gradient and a standard temperature, usually 20 degrees C. In physic, it is defined as the volume of an incompressible fluid that will flow in unit time through a unit cube of a porous substance across which a unit pressure difference is maintained. consult the `documentation `__ for further details. Required-flow rate Required flow rate Recommended FR Required FR Flow rate required Flow rate recommended RFR RFR is defined as the :term:`flow rate` that is requested/requiered by contractors/funders to supply in potable water a specific locality. It is strongly tied to the number of living inhabitants of that locality. For instance, the :term:`FR` recommended to supply a village with a population estimated to 2,000 inhabitants should naturally be different from the :term:`FR` of 20,000 inhabitants, i.e. 10 times the peculiar area composed of 2,000 inhabitants. The objective of the funders when they requiered this rate of :term:`FR` is to adress the issue of water scarcity for a long-term by minimizing the impact of the climate change during the dry season ( especially for the population living in Africa). Resistance It is simply the measured voltage over the measured (known) current (which is Ohm's Law). Resistance will change if the measurement geometry or if the volume of material changes. Therefore, it is NOT a physical property. Resistivity It is basically the resistance per unit volume. It is defined as the voltage measured across a unit cube's length (volts per metre, or V/m) divided by the current flowing through the unit cube's cross sectional area (Amps per metre squared, or A/m2). This results in units of :math:`\Omega.m^2/m` or :math:`\Omega.m`. The greek symbol:math:`\Omega`, is often used to represent resistivity. Society of Exploration Geophysicists SEG SEG is a learned society dedicated to promoting the science and education of exploration geophysics in particular and geophysics in general. The Society fosters the expert and ethical practice of geophysics in the exploration and development of natural resources, in characterizing the near-surface, and in mitigating earth hazards. See more in `SEG.ORG `__. Shape The shape parameter is mostly used in the basement medium to depict the better conductive zone for the drilling location. There are various shapes of anomalies can be described such as `"W"`, `"U"`, `"V"`, `"H"`, `"C"`, `"K"`, and `"M"`. The choice of the anomaly shape is sometimes influenced by project external requirements. Commonly, this situation arises during the campaigns for VH/IVH systems. One of these constraints is minimizing the distance between the center of the village and the future borehole (expected drilling location). In such circumstances, the order of priority of the anomaly shape is not always followed. For instance, the shape `W` found at 1000 m from the center of the village should be replaced by the shape `V` found at 100 m if there is no other shape `W` or `U` that satisfied the project conditions. One more constraint is the existence of anthropic waste sites in the vicinity of the ERP line. Thus, an anomaly of higher priority `W` or `U` found closer to pollution waste sites is systematically rejected to consider the shape with lower priority just comes afterward. In :term:`CDWS` especially in rural areas, the external project constraints constitute an important factor that influences the choice of the anomaly shape and must be taken into consideration. Signal-to-noise ratio Signal-to-Noise Ratio S/N SNR SNR is a measure used in science and engineering that compares the level of a desired signal to the level of background noise. SNR is defined as the ratio of signal power to the noise power, often expressed in decibels. A ratio higher than 1:1 (greater than 0 dB) indicates more signal than noise. The signal-to-noise ratio is difficult to quantify accurately because it is difficult to completely separate signal from noise. It also depends on how noise is defined. For further details click on the the following `documentation `__. tensor EM-tensor EM field tensor In electromagnetism, the tensor or :term:`EM` field tensor (sometimes called the field strength tensor, `Faraday `__ tensor or `Maxwell `__ bivector) is a mathematical object that describes the :term:`EM` field in spacetime. In :term:`EM` survey, different array types are considered for acquiring the most beneficial data for a particular area. The fameous are scalar or tensors. Typically, Ex are scalar data collected in the traverse direction with multiple Ex measurements, and one Hy. Tensor :term:`NSAMT` data measure additional components (Ey and Hx) which provide information about directionality. Ex and Ey refer to the directions of the array used to measure the electric field. Hx and Hy refer to the directions used to measure the magnetic field. Every Ex has an associated orthogonal Hy measurement, and every Ey has an associated Hx. ``xy``, ``xy``, ``yx`` and ``yy`` are called the tensor components. Mostly acquiring data according to ``xy`` and ``yx``, refers to the transverse electric (TE) and transverse magnetic (TM) modes respectively. Download the following `book `__ for further details about :term:`EM` methods. Type type_ The type parameter comes after the shape as a second criterion. For instance, two anomalies with the same :term:`shape` `W` will differ in the order of priority of their types. The type depends on the lateral resistivity distribution in the underground (resulting from the pace of the apparent resistivity curve) along with the whole :term:`ERP`. According to the African Hydraulic Study Committee report (CIEH), four types of anomalies were emphasized: * EC: The extensive or large conductive zone (`EC`); when the width of the conductive area :math:`W_cz` is greater than four times the dipole length everywhere in the ERP line; * CB2P: The contact between two plans (`CB2P`): defines the contact between two plans with different levels of the lateral distribution of the apparent resistivity values; * NC: The narrow conductive zone (`NC`): when the width of the conductive zone is ranging between the dipole length (:math:`d_l`) and four times the dipole length everywhere in the ERP line; * CP: The conductive plan zone (`CP`) shows a large and stretched conductive zone, and is consecutively evaluated according to the whole :term:`ERP` line. Vertical electrical sounding DC-Sounding VES It is a simple and fast geophysical method. It is used mostly to speculate about the existence of a fracture zone and the layer thicknesses. Mostly, it comes after the :term:`DC-Profiling`(ERP). Water H20 It is an inorganic compound with the chemical formula :math:`H_2O`. It is a transparent, tasteless, odorless, and nearly colorless chemical substance, which is the main constituent of Earth's hydrosphere and the fluids of all known living organisms (in which it acts as a solvent). See it different properties as parameters in :class:`watex.property.Water` class. Well A well is an excavation or structure created in the ground by digging, driving, or :term:`drilling` to access liquid resources, usually :term:`water`. It should not be confused with the :term:`borehole`. Mostly the well depth should not be greater than 100m and only refer to water whereas borehole needs more technical and advanced technologies especially when it's come for seismic investigation. .. _glossary_computer_concepts: Computer concepts ================== These are few concepts to understand conventional words in computers sciences. Note also that the maths concepts are also included. The list is not exhaustive. .. glossary:: 1d 1d array One-dimensional array. A NumPy array whose ``.shape`` has length 1. A vector. 2d 2d array Two-dimensional array. A NumPy array whose ``.shape`` has length 2. Often represents a matrix. API Refers to both the *specific* interfaces for estimators implemented in watex and the *generalized* conventions across types of estimators as described in this glossary and :ref:`overviewed in the contributor documentation `. The specific interfaces that constitute watex's public API are largely documented in :ref:`api_ref`. However, we less formally consider anything as public API if none of the identifiers required to access it begins with ``_``. We generally try to maintain :term:`backwards compatibility` for all objects in the public API. Private API, including functions, modules and methods beginning ``_`` are not assured to be stable. array-like The most common data format for *input* to watex estimators and functions, array-like is any type object for which :func:`numpy.asarray` will produce an array of appropriate shape (usually 1 or 2-dimensional) of appropriate dtype (usually numeric). This includes: * a numpy array * a list of numbers * a list of length-k lists of numbers for some fixed length k * a :class:`pandas.DataFrame` with all columns numeric * a numeric :class:`pandas.Series` It excludes: * a :term:`sparse matrix` * an iterator * a generator Note that *output* from scikit-learn estimators and functions (e.g. predictions) should generally be arrays or sparse matrices, or lists thereof (as in multi-output :class:`tree.DecisionTreeClassifier`'s ``predict_proba``). An estimator where ``predict()`` returns a list or a `pandas.Series` is not valid. attribute attributes We mostly use attribute to refer to how model information is stored on an estimator during fitting. Any public attribute stored on an estimator instance is required to begin with an alphabetic character and end in a single underscore if it is set in :term:`fit` or :term:`partial_fit`. These are what is documented under an estimator's *Attributes* documentation. The information stored in attributes is usually either: sufficient statistics used for prediction or transformation; :term:`transductive` outputs such as :term:`labels_` or :term:`embedding_`; or diagnostic data, such as :term:`feature_importances_`. Common attributes are listed :ref:`below `. A public attribute may have the same name as a constructor :term:`parameter`, with a ``_`` appended. This is used to store a validated or estimated version of the user's input. For example, :class:`decomposition.PCA` is constructed with an ``n_components`` parameter. From this, together with other parameters and the data, PCA estimates the attribute ``n_components_``. Further private attributes used in prediction/transformation/etc. may also be set when fitting. These begin with a single underscore and are not assured to be stable for public access. A public attribute on an estimator instance that does not end in an underscore should be the stored, unmodified value of an ``__init__`` :term:`parameter` of the same name. Because of this equivalence, these are documented under an estimator's *Parameters* documentation. backwards compatibility We generally try to maintain backward compatibility (i.e. interfaces and behaviors may be extended but not changed or removed) from release to release but this comes with some exceptions: Public API only The behavior of objects accessed through private identifiers (those beginning ``_``) may be changed arbitrarily between versions. As documented We will generally assume that the users have adhered to the documented parameter types and ranges. If the documentation asks for a list and the user gives a tuple, we do not assure consistent behavior from version to version. Deprecation Behaviors may change following a :term:`deprecation` period (usually two releases long). Warnings are issued using Python's :mod:`warnings` module. Keyword arguments We may sometimes assume that all optional parameters (other than X and y to :term:`fit` and similar methods) are passed as keyword arguments only and may be positionally reordered. Bug fixes and enhancements Bug fixes and -- less often -- enhancements may change the behavior of estimators, including the predictions of an estimator trained on the same data and :term:`random_state`. When this happens, we attempt to note it clearly in the changelog. Serialization We make no assurances that pickling an estimator in one version will allow it to be unpickled to an equivalent model in the subsequent version. Despite this informal contract with our users, the software is provided as is, as stated in the license. When a release inadvertently introduces changes that are not backward compatible, these are known as software regressions. callable A function, class or an object which implements the ``__call__`` method; anything that returns True when the argument of `callable() `_. categorical feature A categorical or nominal :term:`feature` is one that has a finite set of discrete values across the population of data. These are commonly represented as columns of integers or strings. Strings will be rejected by most scikit-learn estimators, and integers will be treated as ordinal or count-valued. For the use with most estimators, categorical variables should be one-hot encoded. Notable exceptions include tree-based models such as random forests and gradient boosting models that often work better and faster with integer-coded categorical variables. Cython Cython is a programming language, a (non-strict, at least for :term:`Python` 3) superset of the Python programming language (with slightly different semantics), designed to give C-like performance with code that is written mostly in Python with optional additional C-inspired syntax. See more `here `_. deprecation We use deprecation to slowly violate our :term:`backwards compatibility` assurances, usually to: * change the default value of a parameter; or * remove a parameter, attribute, method, class, etc. We will ordinarily issue a warning when a deprecated element is used, although there may be limitations to this. For instance, we will raise a warning when someone sets a parameter that has been deprecated, but may not when they access that parameter's attribute on the estimator instance. dimensionality May be used to refer to the number of :term:`features` (i.e. :term:`n_features`), or columns in a 2d feature matrix. Dimensions are, however, also used to refer to the length of a NumPy array's shape, distinguishing a 1d array from a 2d matrix. docstring The embedded documentation for a module, class, function, etc., usually in code as a string at the beginning of the object's definition, and accessible as the object's ``__doc__`` attribute. We try to adhere to `PEP257 `_, and follow `NumpyDoc conventions `_. double underscore double underscore notation When specifying parameter names for nested estimators, ``__`` may be used to separate between parent and child in some contexts. The most common use is when setting parameters through a meta-estimator with :term:`set_params` and hence in specifying a search grid in :ref:`parameter search `. See :term:`parameter`. It is also used in :meth:`pipeline.Pipeline.fit` for passing :term:`sample properties` to the ``fit`` methods of estimators in the pipeline. dtype data type NumPy arrays assume a homogeneous data type throughout, available in the ``.dtype`` attribute of an array (or sparse matrix). We generally assume simple data types for scikit-learn data: float or integer. We may support object or string data types for arrays before encoding or vectorizing. Our estimators do not work with struct arrays, for instance. Our documentation can sometimes give information about the dtype precision, e.g. `np.int32`, `np.int64`, etc. When the precision is provided, it refers to the NumPy dtype. If an arbitrary precision is used, the documentation will refer to dtype `integer` or `floating`. Note that in this case, the precision can be platform dependent. The `numeric` dtype refers to accepting both `integer` and `floating`. early stopping This consists in stopping an iterative optimization method before the convergence of the training loss, to avoid over-fitting. This is generally done by monitoring the generalization score on a validation set. estimator instance We sometimes use this terminology to distinguish an :term:`estimator` class from a constructed instance. For example, in the following, ``cls`` is an estimator class, while ``est1`` and ``est2`` are instances:: cls = RandomForestClassifier est1 = cls() est2 = RandomForestClassifier() examples We try to give examples of basic usage for most functions and classes in the API: * as doctests in their docstrings (i.e. within the ``watex/`` package code itself). * as examples in the :ref:`example gallery ` rendered (using `sphinx-gallery `_) from scripts in the ``examples/`` directory, exemplifying key features or parameters of the estimator/function. These should also be referenced from the User Guide. * sometimes in the :ref:`User Guide ` (built from ``doc/``) alongside a technical description of the estimator. experimental An experimental tool is already usable but its public API, such as default parameter values or fitted attributes, is still subject to change in future versions without the usual :term:`deprecation` warning policy. evaluation metric evaluation metrics Evaluation metrics give a measure of how well a model performs. We may use this term specifically to refer to the functions in :mod:`metrics` (disregarding :mod:`metrics.pairwise`), as distinct from the :term:`score` method and the :term:`scoring` API used in cross validation. See :ref:`model_evaluation`. These functions usually accept a ground truth (or the raw data where the metric evaluates clustering without a ground truth) and a prediction, be it the output of :term:`predict` (``y_pred``), of :term:`predict_proba` (``y_proba``), or of an arbitrary score function including :term:`decision_function` (``y_score``). Functions are usually named to end with ``_score`` if a greater score indicates a better model, and ``_loss`` if a lesser score indicates a better model. This diversity of interface motivates the scoring API. Note that some estimators can calculate metrics that are not included in :mod:`metrics` and are estimator-specific, notably model likelihoods. feature features feature vector In the abstract, a feature is a function (in its mathematical sense) mapping a sampled object to a numeric or categorical quantity. "Feature" is also commonly used to refer to these quantities, being the individual elements of a vector representing a sample. In a data matrix, features are represented as columns: each column contains the result of applying a feature function to a set of samples. Elsewhere features are known as attributes, predictors, regressors, or independent variables. Nearly all estimators in scikit-learn assume that features are numeric, finite and not missing, even when they have semantically distinct domains and distributions (categorical, ordinal, count-valued, real-valued, interval). See also :term:`categorical feature` and :term:`missing values`. ``n_features`` indicates the number of features in a dataset. fitting Calling :term:`fit` (or :term:`fit_transform`, :term:`fit_predict`, etc.) on an estimator. fitted The state of an estimator after :term:`fitting`. There is no conventional procedure for checking if an estimator is fitted. However, an estimator that is not fitted: * should raise :class:`exceptions.NotFittedError` when a prediction method (:term:`predict`, :term:`transform`, etc.) is called. (:func:`utils.validation.check_is_fitted` is used internally for this purpose.) * should not have any :term:`attributes` beginning with an alphabetic character and ending with an underscore. (Note that a descriptor for the attribute may still be present on the class, but hasattr should return False) function We provide ad hoc function interfaces for many algorithms, while :term:`estimator` classes provide a more consistent interface. In particular, watex may provide a function interface that fits a model to some data and returns the learnt model parameters, as in :func:`linear_model.enet_path`. For transductive models, this also returns the embedding or cluster labels, as in :func:`manifold.spectral_embedding` or :func:`cluster.dbscan`. Many preprocessing transformers also provide a function interface, akin to calling :term:`fit_transform`, as in :func:`preprocessing.maxabs_scale`. Users should be careful to avoid :term:`data leakage` when making use of these ``fit_transform``-equivalent functions. We do not have a strict policy about when to or when not to provide function forms of estimators, but maintainers should consider consistency with existing interfaces, and whether providing a function would lead users astray from best practices (as regards data leakage, etc.) gallery See :term:`examples`. hyperparameter hyper-parameter See :term:`parameter`. impute imputation Most machine learning algorithms require that their inputs have no :term:`missing values`, and will not work if this requirement is violated. Algorithms that attempt to fill in (or impute) missing values are referred to as imputation algorithms. indexable An :term:`array-like`, :term:`sparse matrix`, pandas DataFrame or sequence (usually a list). induction inductive Inductive (contrasted with :term:`transductive`) machine learning builds a model of some data that can then be applied to new instances. Most estimators in watex are inductive, having :term:`predict` and/or :term:`transform` methods. joblib A Python library (https://joblib.readthedocs.io) used in watex to facilite simple parallelism and caching. Joblib is oriented towards efficiently working with numpy arrays, such as through use of :term:`memory mapping`. See :ref:`parallelism` for more information. kernel Specifies the kernel function to be used by Kernel Method algorithms. For example, the estimators :class:`watex.exlib.SVC` has a ``kernel`` parameter that takes the name of the kernel to use as string or a callable kernel function used to compute the kernel matrix. label indicator matrix multilabel indicator matrix multilabel indicator matrices The format used to represent multilabel data, where each row of a 2d array or sparse matrix corresponds to a sample, each column corresponds to a class, and each element is 1 if the sample is labeled with the class and 0 if not. leakage data leakage A problem in cross validation where generalization performance can be over-estimated since knowledge of the test data was inadvertently included in training a model. This is a risk, for instance, when applying a :term:`transformer` to the entirety of a dataset rather than each training portion in a cross validation split. We aim to provide interfaces (such as :mod:`pipeline` and :mod:`model_selection`) that shield the user from data leakage. max_iter For estimators involving iterative optimization, this determines the maximum number of iterations to be performed in :term:`fit`. Raises :class:`exceptions.ConvergenceWarning` If ``max_iter`` iterations are run without convergence, memory Some estimators make use of :class:`joblib.Memory` to store partial solutions during fitting. Thus when ``fit`` is called again, those partial solutions have been memoized and can be reused. memmapping memory map memory mapping A memory efficiency strategy that keeps data on disk rather than copying it into main memory. Memory maps can be created for arrays that can be read, written, or both, using :obj:`numpy.memmap`. When using :term:`joblib` to parallelize operations in watex, it may automatically memmap large arrays to reduce memory duplication overhead in multiprocessing. missing values Most watex estimators do not work with missing values. When they do (e.g. in :class:`impute.SimpleImputer`), NaN is the preferred representation of missing values in float arrays. If the array has integer dtype, NaN cannot be represented. For this reason, we support specifying another ``missing_values`` value when :term:`imputation` or learning can be performed in integer space. :term:`Unlabeled data ` is a special case of missing values in the :term:`target`. ``n_features`` The number of :term:`features`. ``n_outputs`` The number of :term:`outputs` in the :term:`target`. ``n_samples`` The number of :term:`samples`. ``n_targets`` Synonym for :term:`n_outputs`. narrative docs narrative documentation An alias for :ref:`User Guide `, i.e. documentation written in ``doc/modules/``. Unlike the :ref:`API reference ` provided through docstrings, the User Guide aims to: * group tools provided by watex together thematically or in terms of usage; * motivate why someone would use each particular tool, often through comparison; * provide both intuitive and technical descriptions of tools; * provide or link to :term:`examples` of using key features of a tool. np A shorthand for Numpy due to the conventional import statement:: import numpy as np online learning Where a model is iteratively updated by receiving each batch of ground truth :term:`targets` soon after making predictions on corresponding batch of data. Intrinsically, the model must be usable for prediction after each batch. See :term:`partial_fit`. out-of-core An efficiency strategy where not all the data is stored in main memory at once, usually by performing learning on batches of data. See :term:`partial_fit`. outputs Individual scalar/categorical variables per sample in the :term:`target`. For example, in multilabel classification each possible label corresponds to a binary output. Also called *responses*, *tasks* or *targets*. See :term:`multiclass multioutput` and :term:`continuous multioutput`. pair A tuple of length two. parameter parameters param params In the statistical sense, parameters are values that specify a model and can be estimated from data. However, here, what we call parameters might be what statisticians call hyperparameters to the model: aspects for configuring model structure that are often not directly learnt from data. We mostly use *parameter* to refer to the aspects of an estimator that can be specified in its construction. For example, ``max_depth`` and ``random_state`` are parameters of :class:`~watex.exlib.RandomForestClassifier`. Parameters to an estimator's constructor are stored unmodified as attributes on the estimator instance, and conventionally start with an alphabetic character and end with an alphanumeric character. Each estimator's or assessor constructor parameters are described in the estimator's docstring. The list of parameters and their current values can be retrieved from an :term:`estimator instance` using its :term:`get_params` method. Between construction and fitting, parameters may be modified using :term:`set_params`. To enable this, parameters are not ordinarily validated or altered when the estimator is constructed, or when each parameter is set. Parameter validation is performed when :term:`fit` is called. Common parameters are listed :ref:`below `. pairwise metric pairwise metrics In its broad sense, a pairwise metric defines a function for measuring similarity or dissimilarity between two samples (with each ordinarily represented as a :term:`feature vector`). We particularly provide implementations of distance metrics (as well as improper metrics like Cosine Distance) through :func:`metrics.pairwise_distances`, and of kernel functions (a constrained class of similarity functions) in :func:`metrics.pairwise_kernels`. These can compute pairwise distance matrices that are symmetric and hence store data redundantly. See also :term:`precomputed` and :term:`metric`. Note that for most distance metrics, we rely on implementations from :mod:`scipy.spatial.distance`, but may reimplement for efficiency in our context. The :class:`metrics.DistanceMetric` interface is used to implement distance metrics for integration with efficient neighbors search. pd A shorthand for `Pandas `_ due to the conventional import statement:: import pandas as pd precomputed Where algorithms rely on :term:`pairwise metrics`, and can be computed from pairwise metrics alone, we often allow the user to specify that the :term:`X` provided is already in the pairwise (dis)similarity space, rather than in a feature space. That is, when passed to :term:`fit`, it is a square, symmetric matrix, with each vector indicating (dis)similarity to every sample, and when passed to prediction/transformation methods, each row corresponds to a testing sample and each column to a training sample. Use of precomputed X is usually indicated by setting a ``metric``, ``affinity`` or ``kernel`` parameter to the string 'precomputed'. If this is the case, then the estimator should set the `pairwise` estimator tag as True. Python python-based Python is a programming languages. It's used in everything from :term:`machine learning` to building websites and software testing. It can be used by developers and non-developers alike. It is also a general-purpose language, which means it’s designed to be used in a range of applications, including data science, :software and web development, automation, and generally getting stuff done. In computers sciences, Python is defined as a high-level, general-purpose, interpreted object-oriented programming language. Similar to PERL, Python is a programming language popular among experienced C++ and Java programmers. Get further details `here `__. rectangular Data that can be represented as a matrix with :term:`samples` on the first axis and a fixed, finite set of :term:`features` on the second is called rectangular. This term excludes samples with non-vectorial structures, such as text, an image of arbitrary size, a time series of arbitrary length, a set of vectors, etc. The purpose of a :term:`vectorizer` is to produce rectangular forms of such data. sample samples We usually use this term as a noun to indicate a single feature vector. Elsewhere a sample is called an instance, data point, or observation. ``n_samples`` indicates the number of samples in a dataset, being the number of rows in a data array :term:`X`. sample property sample properties A sample property is data for each sample (e.g. an array of length n_samples) passed to an estimator method or a similar function, alongside but distinct from the :term:`features` (``X``) and :term:`target` (``y``). As of version 0.19 we do not have a consistent approach to handling sample properties and their routing in :term:`meta-estimators`, though a ``fit_params`` parameter is often used. sparse matrix sparse graph A representation of two-dimensional numeric data that is more memory efficient the corresponding dense numpy array where almost all elements are zero. We use the :mod:`scipy.sparse` framework, which provides several underlying sparse data representations, or *formats*. Some formats are more efficient than others for particular tasks, and when a particular format provides especial benefit, we try to document this fact in watex parameter descriptions. Some sparse matrix formats (notably CSR, CSC, COO and LIL) distinguish between *implicit* and *explicit* zeros. Explicit zeros are stored (i.e. they consume memory in a ``data`` array) in the data structure, while implicit zeros correspond to every element not otherwise defined in explicit storage. Two semantics for sparse matrices are used in watex: matrix semantics The sparse matrix is interpreted as an array with implicit and explicit zeros being interpreted as the number 0. This is the interpretation most often adopted, e.g. when sparse matrices are used for feature matrices or :term:`multilabel indicator matrices`. graph semantics As with :mod:`scipy.sparse.csgraph`, explicit zeros are interpreted as the number 0, but implicit zeros indicate a masked or absent value, such as the absence of an edge between two vertices of a graph, where an explicit value indicates an edge's weight. This interpretation is adopted to represent connectivity in clustering, in representations of nearest neighborhoods (e.g. :func:`neighbors.kneighbors_graph`), and for precomputed distance representation where only distances in the neighborhood of each point are required. When working with sparse matrices, we assume that it is sparse for a good reason, and avoid writing code that densifies a user-provided sparse matrix, instead maintaining sparsity or raising an error if not possible (i.e. if an estimator does not / cannot support sparse matrices). target targets The *dependent variable* in :term:`supervised` (and :term:`semisupervised`) learning, passed as :term:`y` to an estimator's :term:`fit` method. Also known as *dependent variable*, *outcome variable*, *response variable*, *ground truth* or *label*. watex works with targets that have minimal structure: a class from a finite set, a finite real-valued number, multiple classes, or multiple numbers. transduction transductive A transductive (contrasted with :term:`inductive`) machine learning method is designed to model a specific dataset, but not to apply that model to unseen data. .. _glossary_ml_concepts: Machine Learning concepts ========================== These are basic concepts about Machine Learning. For more-in depth, refer to Scikit-learn `glossary `_. .. glossary:: Artificial intelligence AI Intelligence artificielle IA It is the science and engineering of making intelligent machines, especially intelligent computer programs. It is related to the similar task of using computers to understand human intelligence, but AI does not have to confine itself to methods that are biologically observable. See more `here `__. binary classification A 1-dimensional array, where values strictly greater than zero indicate the positive class (i.e. the last class in :term:`classes_`). classifier An array of shape ``(n_samples,)`` ``(n_samples, n_outputs)``. :term:`Multilabel ` data may be represented as a sparse matrix if a sparse matrix was used in fitting. Each element should be one of the values in the classifier's :term:`classes_` attribute. clusterer An array of shape ``(n_samples,)`` where each value is from 0 to ``n_clusters - 1`` if the corresponding sample is clustered, and -1 if the sample is not clustered. continuous A regression problem where each sample's target is a finite floating point number represented as a 1-dimensional array of floats (or sometimes ints). ``cv`` Determines a cross validation splitting strategy, as used in cross-validation based routines. leakage data leakage A problem in cross validation where generalization performance can be over-estimated since knowledge of the test data was inadvertently included in training a model. This is a risk, for instance, when applying a :term:`transformer` to the entirety of a dataset rather than each training portion in a cross validation split. ``n_components`` The number of features which a :term:`transformer` should transform the input into. See :term:`components_` for the special case of affine projection. ``n_iter_no_change`` Number of iterations with no improvement to wait before stopping the iterative procedure. This is also known as a *patience* parameter. It is typically used with :term:`early stopping` to avoid stopping too early. ``n_jobs`` This parameter is used to specify how many concurrent processes or threads should be used for routines that are parallelized with :term:`joblib`. ``metric`` As a parameter, this is the scheme for determining the distance between two data points. See :func:`metrics.pairwise_distances`. In practice, for some algorithms, an improper distance metric (one that does not obey the triangle inequality, such as Cosine Distance) may be used. Machine Learning Machine learning Machine-learning Learning machine ML Machine learning is a branch of artificial intelligence (:term:`AI`) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy. See more `here `__. multiclass classification A 2-dimensional array, where the row-wise arg-maximum is the predicted class. Columns are ordered according to :term:`classes_`. multilabel classification Scikit-learn is inconsistent in its representation of :term:`multilabel` decision functions. It may be represented one of two ways: - List of 2d arrays, each array of shape: (`n_samples`, 2), like in multiclass multioutput. List is of length `n_labels`. - Single 2d array of shape (`n_samples`, `n_labels`), with each 'column' in the array corresponding to the individual binary classification decisions. This is identical to the multiclass classification format, though its semantics differ: it should be interpreted, like in the binary case, by thresholding at 0. outlier detector outlier detectors An :term:`unsupervised` binary :term:`predictor` which models the distinction between core and outlying samples. predictor predictors An :term:`estimator` supporting :term:`predict` and/or :term:`fit_predict`. This encompasses :term:`classifier`, :term:`regressor`, :term:`outlier detector` and :term:`clusterer`. In statistics, "predictors" refers to :term:`features`. regressor regressors A :term:`supervised` (or :term:`semi-supervised`) :term:`predictor` with :term:`continuous` output values. transformer transformers An estimator supporting :term:`transform` and/or :term:`fit_transform`. A purely :term:`transductive` transformer, such as :class:`manifold.TSNE`, may not implement ``transform``. semi-supervised semi-supervised learning semisupervised Learning where the expected prediction (label or ground truth) is only available for some samples provided as training data when :term:`fitting` the model. supervised supervised learning Learning where the expected prediction (label or ground truth) is available for each sample. This is the approach taken in a :term:`classifier` or :term:`regressor` among other estimators. target targets The *dependent variable* in :term:`supervised` (and :term:`semisupervised`) learning, passed as :term:`y` to an estimator's :term:`fit` method. Also known as *dependent variable*, *outcome variable*, *response variable*, *ground truth* or *label*. Scikit-learn works with targets that have minimal structure: a class from a finite set, a finite real-valued number, multiple classes, or multiple numbers. See :ref:`glossary_target_types`. unsupervised unsupervised learning Learning where the expected prediction (label or ground truth) is not available for each sample. ``pos_label`` Value with which positive labels must be encoded in binary classification problems in which the positive class is not assumed. This value is typically required to compute asymmetric evaluation metrics such as precision and recall. ``random_state`` Whenever randomization is part of a Scikit-learn algorithm, a ``random_state`` parameter may be provided to control the random number generator used. Note that the mere presence of ``random_state`` doesn't mean that randomization is always used, as it may be dependent on another parameter, e.g. ``shuffle``, being set. The passed value will have an effect on the reproducibility of the results returned by the function (:term:`fit`, :term:`split`, or any other function like :func:`~sklearn.cluster.k_means`). `random_state`'s value may be: None (default) Use the global random state instance from :mod:`numpy.random`. Calling the function multiple times will reuse the same instance, and will produce different results. An integer Use a new random number generator seeded by the given integer. Using an int will produce the same results across different calls. However, it may be worthwhile checking that your results are stable across a number of different distinct random seeds. Popular integer random seeds are 0 and `42 `_. Integer values must be in the range `[0, 2**32 - 1]`. A :class:`numpy.random.RandomState` instance Use the provided random state, only affecting other users of that same random state instance. Calling the function multiple times will reuse the same instance, and will produce different results. ``scoring`` Specifies the score function to be maximized (usually by :ref:`cross validation `), or -- in some cases -- multiple score functions to be reported. The score function can be a string accepted by :func:`watex.metrics.get_scorers`. .. _glossary_methods: Estimators and assessors methods ================================= .. glossary:: ``decision_function`` In a fitted :term:`classifier` or :term:`outlier detector`, predicts a "soft" score for each sample in relation to each class, rather than the "hard" categorical prediction produced by :term:`predict`. Its input is usually only some observed data, :term:`X`. ``fit`` The ``fit`` method is provided on every estimator. It usually takes some :term:`samples` ``X``, :term:`targets` ``y`` if the model is supervised, and potentially other :term:`sample properties` such as :term:`sample_weight`. It should: * clear any prior :term:`attributes` stored on the estimator, unless :term:`warm_start` is used; * validate and interpret any :term:`parameters`, ideally raising an error if invalid; * validate the input data; * estimate and store model attributes from the estimated parameters and provided data; and * return the now :term:`fitted` estimator to facilitate method chaining. :ref:`glossary_target_types` describes possible formats for ``y``. ``fit_predict`` Used especially for :term:`unsupervised`, :term:`transductive` estimators, this fits the model and returns the predictions (similar to :term:`predict`) on the training data. In clusterers, these predictions are also stored in the :term:`labels_` attribute, and the output of ``.fit_predict(X)`` is usually equivalent to ``.fit(X).predict(X)``. The parameters to ``fit_predict`` are the same as those to ``fit``. ``fit_transform`` A method on :term:`transformers` which fits the estimator and returns the transformed training data. It takes parameters as in :term:`fit` and its output should have the same shape as calling ``.fit(X, ...).transform(X)``. There are nonetheless rare cases where ``.fit_transform(X, ...)`` and ``.fit(X, ...).transform(X)`` do not return the same value, wherein training data needs to be handled differently (due to model blending in stacked ensembles, for instance; such cases should be clearly documented). :term:`Transductive ` transformers may also provide ``fit_transform`` but not :term:`transform`. One reason to implement ``fit_transform`` is that performing ``fit`` and ``transform`` separately would be less efficient than together. :class:`base.TransformerMixin` provides a default implementation, providing a consistent interface across transformers where ``fit_transform`` is or is not specialized. In :term:`inductive` learning -- where the goal is to learn a generalized model that can be applied to new data -- users should be careful not to apply ``fit_transform`` to the entirety of a dataset (i.e. training and test data together) before further modelling, as this results in :term:`data leakage`. ``get_feature_names_out`` Primarily for :term:`feature extractors`, but also used for other transformers to provide string names for each column in the output of the estimator's :term:`transform` method. It outputs an array of strings and may take an array-like of strings as input, corresponding to the names of input columns from which output column names can be generated. If `input_features` is not passed in, then the `feature_names_in_` attribute will be used. If the `feature_names_in_` attribute is not defined, then the input names are named `[x0, x1, ..., x(n_features_in_ - 1)]`. ``get_n_splits`` On a :term:`CV splitter` (not an estimator), returns the number of elements one would get if iterating through the return value of :term:`split` given the same parameters. Takes the same parameters as split. ``get_params`` Gets all :term:`parameters`, and their values, that can be set using :term:`set_params`. A parameter ``deep`` can be used, when set to False to only return those parameters not including ``__``, i.e. not due to indirection via contained estimators. Most estimators adopt the definition from :class:`base.BaseEstimator`, which simply adopts the parameters defined for ``__init__``. :class:`pipeline.Pipeline`, among others, reimplements ``get_params`` to declare the estimators named in its ``steps`` parameters as themselves being parameters. ``partial_fit`` Facilitates fitting an estimator in an online fashion. Unlike ``fit``, repeatedly calling ``partial_fit`` does not clear the model, but updates it with the data provided. The portion of data provided to ``partial_fit`` may be called a mini-batch. Each mini-batch must be of consistent shape, etc. In iterative estimators, ``partial_fit`` often only performs a single iteration. ``predict`` Makes a prediction for each sample, usually only taking :term:`X` as input (but see under regressor output conventions below). In a :term:`classifier` or :term:`regressor`, this prediction is in the same target space used in fitting (e.g. one of {'red', 'amber', 'green'} if the ``y`` in fitting consisted of these strings). Despite this, even when ``y`` passed to :term:`fit` is a list or other array-like, the output of ``predict`` should always be an array or sparse matrix. In a :term:`clusterer` or :term:`outlier detector` the prediction is an integer. If the estimator was not already :term:`fitted`, calling this method should raise a :class:`exceptions.NotFittedError`. ``predict_log_proba`` The natural logarithm of the output of :term:`predict_proba`, provided to facilitate numerical stability. ``predict_proba`` A method in :term:`classifiers` and :term:`clusterers` that can return probability estimates for each class/cluster. Its input is usually only some observed data, :term:`X`. If the estimator was not already :term:`fitted`, calling this method should raise a :class:`exceptions.NotFittedError`. Output conventions are like those for :term:`decision_function` except in the :term:`binary` classification case, where one column is output for each class (while ``decision_function`` outputs a 1d array). For binary and multiclass predictions, each row should add to 1. Like other methods, ``predict_proba`` should only be present when the estimator can make probabilistic predictions (see :term:`duck typing`). This means that the presence of the method may depend on estimator parameters (e.g. in :class:`linear_model.SGDClassifier`) or training data (e.g. in :class:`model_selection.GridSearchCV`) and may only appear after fitting. regressor A numeric array of shape ``(n_samples,)``, usually float64. Some regressors have extra options in their ``predict`` method, allowing them to return standard deviation (``return_std=True``) or covariance (``return_cov=True``) relative to the predicted value. In this case, the return value is a tuple of arrays corresponding to (prediction mean, std, cov) as required. ``score`` A method on an estimator, usually a :term:`predictor`, which evaluates its predictions on a given dataset, and returns a single numerical score. A greater return value should indicate better predictions; accuracy is used for classifiers and R^2 for regressors by default. If the estimator was not already :term:`fitted`, calling this method should raise a :class:`exceptions.NotFittedError`. Some estimators implement a custom, estimator-specific score function, often the likelihood of the data under the model. ``score_samples`` A method that returns a score for each given sample. The exact definition of *score* varies from one class to another. In the case of density estimation, it can be the log density model on the data, and in the case of outlier detection, it can be the opposite of the outlier factor of the data. If the estimator was not already :term:`fitted`, calling this method should raise a :class:`exceptions.NotFittedError`. ``set_params`` Available in any estimator, takes keyword arguments corresponding to keys in :term:`get_params`. Each is provided a new value to assign such that calling ``get_params`` after ``set_params`` will reflect the changed :term:`parameters`. Most estimators use the implementation in :class:`base.BaseEstimator`, which handles nested parameters and otherwise sets the parameter as an attribute on the estimator. The method is overridden in :class:`pipeline.Pipeline` and related estimators. ``split`` On a :term:`CV splitter` (not an estimator), this method accepts parameters (:term:`X`, :term:`y`, :term:`groups`), where all may be optional, and returns an iterator over ``(train_idx, test_idx)`` pairs. ``transform`` In a :term:`transformer`, transforms the input, usually only :term:`X`, into some transformed space (conventionally notated as :term:`Xt`). Output is an array or sparse matrix of length :term:`n_samples` and with the number of columns fixed after :term:`fitting`. If the estimator was not already :term:`fitted`, calling this method should raise a :class:`exceptions.NotFittedError`. .. _glossary_parameters: Core parameters ================= These parameter names, are commonly used in estimator, assessors and common functions. .. glossary:: as_frame Transform the data in a pandas DataFrame including columns with appropriate types (numeric). The target is a panda DataFrame or Series depending on the number of target columns. If `as_frame` is False, then returning a :class:`~watex.utils.box.Boxspace` dictionary-like object, with the following attributes: * data : {ndarray, dataframe} The data matrix. If `as_frame=True`, `data` will be a pandas DataFrame. * resistivity: {array-like} of shape (shape[0],) The resistivity of the sounding point. * MN: {array-like} of shape (shape[0],) The step value of potential electrodes increasing in meters * AB: {array-like} of shape (shape[0],) The step value of current electrodes increasing in meters * feature_names: list The names of the dataset columns. * DESCR: str The full description of the dataset. * filename: str The path to the location of the data. data str, filepath_or_buffer or :class:`pandas.core.DataFrame` Path -like object or Dataframe. If data is given as path-like object, data is read, asserted and validated. Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be a file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any :code:`os.PathLike`. By file-like object, we refer to objects with a `read()` method, such as a file handle e.g. via builtin `open` function or `StringIO`. index_rhoa int, index of the resistivy columns to retrieve. Note that this is useful in the cases many sounding values are collected in the same survey area. `index_rhoa=0` fetches the first sounding values in the collection of all values. tag str, Name of the dataset to fectched. Tag can be a data set processing stages. See `datasets ` for consistent details. X Ndarray of shape ( :math:`M x N`), :math:`M = m_{samples}` & :math:`N=n_{features}` training set; Denotes data that is observed at training and prediction time, used as independent variables in learning. The notation is uppercase to denote that it is ordinarily a matrix. When a matrix, each sample may be represented by a feature vector, or a vector of precomputed (dis)similarity with each training sample. :code:`X` may also not be a matrix, and may require a feature extractor or a pairwise metric to turn it into one before learning a model. y array-like of shape (:math:`M,` ), :math:`M=m_{samples}` train target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction. Xt Ndarray ( :math:`M x N` matrix where :math:`M=m_{samples}`, & :math:`N=n_{features}`) Shorthand for "test set"; data that is observed at testing and prediction time, used as independent variables in learning.The notation is uppercase to denote that it is ordinarily a matrix. yt array-like, shape (:math:`M`, ) :math:`M=m_{samples}`, test target; Denotes data that may be observed at training time as the dependent variable in learning, but which is unavailable at prediction time, and is usually the target of prediction. tname str, A target name or label. In supervised learning the target name is considered as the reference name of :term:`y` or label variable. z array-like 1d, pandas.Series Array of depth or a pandas series that contains the depth values. Two dimensional array or more is not allowed. However when `z` is given as a dataframe and :term:`zname` is not supplied, an error raises since `zname` is used to fetch and overwritten `z` from the dataframe. zname str, int Name of depth columns. `zname` allows to retrieve the depth column in a dataframe. If integer is passed, it assumes the index of the dataframe fits the depth column. Integer value must not be out the dataframe size along axis 1. Commonly `zname`needs to be supplied when a dataframe is passed to a function argument. kname str, int Name of permeability coefficient columns. `kname` allows to retrieve the permeability coefficient :term:`k` in a specific dataframe. If integer is passed, it assumes the index of the dataframe fits the :term:`k` columns. Note that integer value must not be out the dataframe size along axis 1. Commonly `kname` needs to be supplied when a dataframe is passed as a positional or keyword argument. k array-like 1d, pandas.Series Array of permeability coefficient 'k' or a pandas series that contains the 'k' values. Two dimensional array or more is not allowed. However, when `k` passes as a dataframe and `kname` is not supplied, an error raises since `kname` is used to retrieve `k` values from the dataframe and overwritten it. target Array-like or :class:`pd.Series` Is the dependent variable in supervised (and semisupervised) learning, passed as :term:`y` to an estimator's fit method. Also known as dependent variable, outcome variable, response variable, ground truth or label. :term:`watex` works with targets that have minimal structure: a class from a finite set, a finite real-valued number, multiple classes, or multiple numbers. Refer to `target types`_ . Note that throughout this library, a :term:`target` is considered as a `pd.Series` where the name is `tname` and the variable :term:`y` i.e `target = tname + y`. .. _target types: https://scikit-learn.org/stable/glossary.html#glossary-target-types model callable, always as a function, A model estimator. An object which manages the estimation and decoding of a model. The model is estimated as a deterministic function of: * parameters provided in object construction or with set_params; * the global numpy.random random state if the estimator’s random_state parameter is set to None; and * any data or sample properties passed to the most recent call to fit, fit_transform or fit_predict, or data similarly passed in a sequence of calls to partial_fit. The estimated model is stored in public and private attributes on the estimator instance, facilitating decoding through prediction and transformation methods. Estimators must provide a fit method, and should provide `set_params` and `get_params`, although these are usually provided by inheritance from `base.BaseEstimator`. The core functionality of some estimators may also be available as a ``function``. clf callable, always as a function, classifier estimator A supervised (or semi-supervised) predictor with a finite set of discrete possible output values. A classifier supports modeling some of binary, multiclass, multilabel, or multiclass multioutput targets. Within scikit-learn, all classifiers support multi-class classification, defaulting to using a one-vs-rest strategy over the binary classification problem. Classifiers must store a classes_ attribute after fitting, and usually inherit from base.ClassifierMixin, which sets their _estimator_type attribute. A classifier can be distinguished from other estimators with is_classifier. It must implement: * fit * predict * score It may also be appropriate to implement decision_function, predict_proba and predict_log_proba. reg callable, always as a function A regression estimator; Estimators must provide a fit method, and should provide `set_params` and `get_params`, although these are usually provided by inheritance from `base.BaseEstimator`. The estimated model is stored in public and private attributes on the estimator instance, facilitating decoding through prediction and transformation methods. The core functionality of some estimators may also be available as a``function``. cv float, A cross validation splitting strategy. It used in cross-validation based routines. cv is also available in estimators such as multioutput. ClassifierChain or calibration.CalibratedClassifierCV which use the predictions of one estimator as training data for another, to not overfit the training supervision. Possible inputs for cv are usually: * An integer, specifying the number of folds in K-fold cross validation. K-fold will be stratified over classes if the estimator is a classifier (determined by base.is_classifier) and the targets may represent a binary or multiclass (but not multioutput) classification problem (determined by utils.multiclass.type_of_target). * A cross-validation splitter instance. Refer to the User Guide for splitters available within :code:`watex`. * An iterable yielding train/test splits. With some exceptions (especially where not using cross validation at all is an option), the default is ``4-fold``. .. _Scikit-learn: https://scikit-learn.org/stable/glossary.html#glossary scoring str, Specifies the score function to be maximized (usually by :ref:`cross validation `), or -- in some cases -- multiple score functions to be reported. random_state int, RandomState instance or None, default=None Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.. test_size float or int, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If ``train_size`` is also None, it will be set to 0.25. n_jobs int, is used to specify how many concurrent processes or threads should be used for routines that are parallelized with joblib. It specifies the maximum number of concurrently running workers. If 1 is given, no joblib parallelism is used at all, which is useful for debugging. If set to -1, all CPUs are used. For instance: * `n_jobs` below -1, (n_cpus + 1 + n_jobs) are used. * `n_jobs`=-2, all CPUs but one are used. * `n_jobs` is None by default, which means unset; it will generally be interpreted as n_jobs=1 unless the current joblib.Parallel backend context specifies otherwise. Note that even if n_jobs=1, low-level parallelism (via Numpy and OpenMP) might be used in some configuration. verbose int, `default` is ``0`` Control the level of verbosity. Higher value lead to more messages. Any True value should enable some logging, but larger integers (e.g. above 7) may be needed for full verbosity. self: `Baseclass` instance returns ``self`` for easy method chaining. .. _glossary_plot_parameters: Plotting parameters ==================== These are the parameters for plotting purposes. Each plotting class that inherits from :class:`watex.property.BasePlot` such as (:mod:`~watex.view.ExPlot`, :mod:`~watex.view.TPlot`, :mod:`~watex.view.QuickPlot` or :mod:`~watex.view.EvalPlot`) accepts the following parameters for customizing plots. See also the property parameters :class:`~watex.property.BasePlot`. .. glossary:: savefig str, Path-like object, savefigure's name, *default* is ``None`` fig_dpi float, dots-per-inch resolution of the figure. *default* is 300 fig_num int, size of figure in inches (width, height). *default* is [5, 5] fig_size Tuple (int, int) or inch size of figure in inches (width, height).*default* is [5, 5] fig_orientation str, figure orientation. *default* is ``landscape`` fig_tile str, figure title. *default* is ``None`` fs float, size of font of axis tick labels, axis labels are fs+2. *default* is 6 ls str, line style, it can be [ '-' | '.' | ':' ] . *default* is '-' lc str, Optional, line color of the plot, *default* is ``k`` lw float, Optional, line weight of the plot, *default* is ``1.5`` alpha float between 0 < alpha < 1, transparency number, *default* is ``0.5``, font_weight str, Optional weight of the font , *default* is ``bold``. font_style str, Optional style of the font. *default* is ``italic`` font_size float, Optional size of font in inches (width, height). *default* is ``3``. ms float, Optional size of marker in points. *default* is ``5`` marker str, Optional marker of stations *default* is ``o``. marker_style str, Optional facecolor of the marker. *default* is ``yellow`` marker_edgecolor str, Optional facecolor of the marker. *default* is ``yellow`` marker_edgewidth float, Optional width of the marker. *default* is ``3``. xminorticks float, Optional minortick according to x-axis size and *default* is ``1``. yminorticks float, Optional yminorticks according to x-axis size and *default* is ``1``. bins histograms element separation between two bar. *default* is ``10``. xlim tuple (int, int), Optional limit of x-axis in plot. ylim tuple (int, int), Optional limit of x-axis in plot. xlabel str, Optional, label name of x-axis in plot. ylabel str, Optional, label name of y-axis in plot. rotate_xlabel float, Optional angle to rotate `xlabel` in plot. rotate_ylabel float, Optional angle to rotate `ylabel` in plot. leg_kws dict, Optional keyword arguments of legend. *default* is empty ``dict`` plt_kws dict, Optional keyword arguments of plot. *default* is empty ``dict`` glc str, Optional line color of the grid plot, *default* is ``k`` glw float, Optional line weight of the grid plot, *default* is ``2`` galpha float, Optional, transparency number of grid, *default* is ``0.5`` gaxis str ('x', 'y', 'both') type of axis to hold the grid, *default* is ``both`` gwhich str, Optional kind of grid in the plot. *default* is ``major`` tp_axis bool, axis to apply the ticks params. default is ``both`` tp_labelsize str, Optional labelsize of ticks params. *default* is ``italic`` tp_bottom bool, position at bottom of ticks params. *default* is ``True``. tp_labelbottom bool, put label on the bottom of the ticks. *default* is ``False`` tp_labeltop bool, put label on the top of the ticks. *default* is ``True`` cb_orientation str , ('vertical', 'horizontal') orientation of the colorbar, *default* is ``vertical`` cb_aspect float, Optional aspect of the colorbar. *default* is ``20``. cb_shrink float, Optional shrink size of the colorbar. *default* is ``1.0`` cb_pad float, pad of the colorbar of plot. *default* is ``.05`` cb_anchor tuple (float, float) anchor of the colorbar. *default* is ``(0.0, 0.5)`` cb_panchor tuple (float, float) proportionality anchor of the colorbar. *default* is ``(1.0, 0.5)`` cb_label str, Optional label of the colorbar. cb_spacing str, Optional spacing of the colorbar. *default* is ``uniform`` cb_drawedges bool, draw edges inside of the colorbar. *default* is ``False`` ax :class:`matplotlib.axes.Axes` The matplotlib axes containing the plot. .. _glossary_miscellaneous: Miscellaneous ============== These are different plotting functions from `seaborn `__ used in some for visualization. .. glossary:: scatterplot Plot data using points. lineplot Plot data using lines. displot Figure-level interface to distribution plot functions. histplot Plot a histogram of binned counts with optional normalization or smoothing. kdeplot Plot univariate or bivariate distributions using kernel density estimation. violinplot Draw an enhanced boxplot using kernel density estimation. pointplot Plot point estimates and CIs using markers and lines. boxplot Draw an enhanced boxplot. jointplot Draw a bivariate plot with univariate marginal distributions. jointplot Draw multiple bivariate plots with univariate marginal distributions. JointGrid Set up a figure with joint and marginal views on bivariate data. PairGrid Set up a figure with joint and marginal views on multiple variables. facetgrid :class:`FacetGrid` An object managing one or more subplots that correspond to conditional data subsets with convenient methods for batch-setting of axes attributes. jointgrid :class:`JointGrid` An object managing multiple subplots that correspond to joint and marginal axes for plotting a bivariate relationship or distribution. pairgrid class:`PairGrid` An object managing multiple subplots that correspond to joint and marginal axes for pairwise combinations of multiple variables in a dataset. .. _glossary_reference_links: Resource-links ================= .. glossary:: Bagoue region `Bagoue region `__ is located in the northern part of `Cote d’Ivoire `__ and lies between longitudes 6° and 7° W and latitudes 9° and 11° N. According to the area regional hydraulics report from the Ivorian Hydraulic Ministry, 40.98% of boreholes are unsuccessful after drilling, 33.25 % are unsustainable during the dry seasons and 25.77% of the water of the productive boreholes dried up after three years of use (MHCI, 2012). The average FR observed in this area fluctuates between 1 and 3 m3/h Optimization drilling `Dieng et al `_ papers. Case-history in Bagoue-region `Kouadio et al `_ case-history paper Côte d'Ivoire Cote d'Ivoire Ivory-Coast Ivory coast CIV Is a west africa country known as the first productor of cocoa in the world. To get futher details, click of the following `link `__. To see more about the country cultures and other features, click `here `__. Flow rate prediction using SVMs See :term:`Case-history in Bagoue-region` GeekforGeeks matplotlib `Plotting using matplotlib `__ IUPAC nommenclature The `IUPAC `__ system of nomenclature is a universally-recognized method for naming organic chemical compounds. The goal of the system is to provide each organic compound with a unique and unambiguous name based on its chemical formula and structure. Matplotlib scatter Example of `scattering plots `__ Matplotlib plot Example of `pyplot plotting `__ Matplotlib figure Example of `plotting figure `__ Matplotlib figsuptitle Example of adding `figure suptitle `__ mtpy MTpy Python toolbox for standard Magnetotelluric (:term:`MT`) data analysis. See resources in `MTpy repository `__ pycsamt pyCSAMT Python for audio-frequency magnetotelluric (:term:`AMT`). See more in `pyCSAMT repository `__ Properties of water Resources towards the `water properties `__ documentation. Pandas DataFrame Understand `Pandas data frame `__ Pandas Series Understand `Pandas Series `__ Scipy Optimize Understand the `curve-fit `__ of Scipy. Water Get more resources about the `water `__ definition and concepts. Water triple point Learn about the water `triple point `_ WATex watex Resources towards the `watex repository `_ . It standas for `WAT`-er `Ex`-ploration as package for solving unsuccessful drillings, unsustainable boreholes, predicting the permeability coefficient :math:`k`, and others hydrogeophysics engineering problems.