3. Case Histories#

cases implements functions and modules already available and ready to use for solving real engineering problems like flow rate predictions and boosting using the bases learners, SVC, and ensemble paradigms. watex.cases.features, watex.cases.processing, watex.cases.modeling and watex.cases.prepare modules have base steps and can be used for processing and analyses to give a quick depiction of how data looks like. This can figure out the next processing steps for solving the evidence problem.

3.1. Features#

features is a set of different manipulation that can be performed on the case history feature data.

3.1.1. GeoFeatures#

GeoFeatures expects the geological, the boreholes and DC-electrical resistivity data. GeoFeatures set all feature values of different investigation sites. GeoFeatures class is composed of:

erp class get from ERPCollection
geol obtained from Geology
boreh get from Borehole

>>> from watex.cases.features import GeoFeatures
>>> data ='data/geodata/main.bagciv.data.csv'
>>> featObj =GeoFeatures().fit(data )
>>> featObj.id_
array(['e0000001', 'e0000002', 'e0000003', 'e0000004', 'e0000005',
       'e0000006', 'e0000007'], dtype='<U8')
>>> featObj.site_names_[:7] # view the site for borehole
array(['b1', 'b2', 'b3', 'b4', 'b5', 'b6', 'b7'], dtype=object)

3.1.2. Feature Inspection#

FeatureInspection summarizes flow features. It deals with data feature categorization. When numerical values are provided standard qualitative or quantitative analysis is performed.

>>> from watex.cases.features import FeatureInspection
>>> data = 'data/geodata/main.bagciv.data.csv'
>>> fobj = FeatureInspection().fit(data)
>>> fobj.data_.columns
Index(['num', 'name', 'east', 'north', 'power', 'magnitude', 'shape', 'type',
           'sfi', 'ohmS', 'lwi', 'geol', 'flow'],
          dtype='object')

3.2. Prepare#

prepare base module helps to automate data preparation at once. It is created fast data preparation in real engineering cases study. This is a naive approach for quickly reproducing the published paperwork, especially for flow rate prediction.

3.2.1. Base data preparation for case studies#

The base step has been used to solve flow rate prediction problems [1]. Its steps procedure can straightforwardly help users to fast reach the same goal as the published paper. An example of a different kind of Bagoue dataset [2], is prepared using the BaseSteps module.

3.3. Processing#

processing gives basic processing for achieving results. Here, we implement the processing step performed for predicting the flow rate prediction [1].

3.3.1. Preprocessing#

Preprocessing gives the prior steps for flow rate prediction.

Note

If \(X\) and \(y\) are provided, they are considered as a feature set and target respectively. They should be split into the training set and test set respectively.

>>> from watex.cases.processing import Preprocessing
>>> from watex.datasets import fetch_data
>>> data = fetch_data('bagoue original').get('data=dfy2')
>>> pc = Preprocessing (drop_features = ['lwi', 'num', 'name']
                                                ).fit(data =data )
>>> len(pc.X ),  len(y), len(pc.Xt ),  len(pc.yt)
(344, 344, 87, 87) # trainset (X,y) and testset (Xt, yt)

One can assemble pipes and an estimator to make a model (default) following the snippet code below. Indeed, the model is composed of transformers and estimators. If one is set to None, it uses the default pipe and estimator which might be not the one expected. Therefore providing a pipe and estimator is recommended.

We can get the default preprocessor by merely calling:

>>> from watex.cases.processing import Preprocessing
>>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num'])
>>> data = fetch_data ('bagoue original').get('data=dfy2')
>>> pc.fit(data =data)
>>> pc.makeModel() # use default model and preprocessor
>>> pc.model_
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numpipe',
                                Pipeline(steps=[('imputer',
                                                   SimpleImputer()),
                                                ('polynomialfeatures',
                                                                           PolynomialFeatures(degree=10,
                                                                                                                  include_bias=False)),
                                                                          ('selectors',
                                                                           SelectKBest(k=4)),
                                                                          ('scalers',
                                                                           RobustScaler())]),
                                          ['east', 'north', 'power',
                                           'magnitude', 'sfi',
                                           'ohmS']),
                                         ('catpipe',
                                          Pipeline(steps=[('imputer',
                                                                           SimpleImputer()),
                                                                          ('onehotencoder',
                                                                           OneHotEncoder(handle_unknown='ignore'))]),
                                          ['type', 'shape', 'geol'])])),
                                ('SVC', SVC(C=100, gamma=0.001, random_state=42))])

Or build your preprocessor object using the example below:

>>> from watex.exlib.sklearn import (
        Pipeline,
        ColumnTransformer,
        SimpleImputer,
        StandardScaler,
        OneHotEncoder,
        LogisticRegression
        )
>>> from watex.datasets import fetch_data
>>> from watex.cases.processing import Preprocessing
>>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num'])
>>> numeric_features = ['east', 'north', 'power', 'magnitude', 'sfi', 'ohmS']
>>> numeric_transformer = Pipeline(
        steps=[("imputer", SimpleImputer(strategy="median")),
                   ("scaler", StandardScaler())]
        )
>>> categorical_features = ['shape', 'geol', 'type']
>>> categorical_transformer = OneHotEncoder(handle_unknown="ignore")
>>> preprocessor = ColumnTransformer(
        transformers=[
                ("num", numeric_transformer, numeric_features),
                ("cat", categorical_transformer, categorical_features),
        ])
>>> pc.makeModel (pipe = preprocessor,
                                  estimator =  LogisticRegression())
>>> # or pc.model_
Pipeline(steps=[('preprocessor',
                                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                                   SimpleImputer(strategy='median')),
                                                                                  ('scaler',
                                                                                   StandardScaler())]),
                                                  ['east', 'north', 'power',
                                                   'magnitude', 'sfi',
                                                   'ohmS']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['shape', 'geol', 'type'])])),
                                ('LogisticRegression', LogisticRegression())])

Once a model is created, a dummy baseline model can be evaluated from preprocessing pipeline; onto a model by providing an estimator. This is possible thanks to baseEvaluation(). A code snippet is given below:

>>> from watex.cases.processing import Preprocessing
>>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num'])
>>> data = fetch_data ('bagoue original').get('data=dfy2')
>>> pc.fit(data =data)
# (1) -> default estimator
>>> pc.baseEvaluation (eval_metric=True)
0.47126436781609193 # score

# (2) -> multiples estimators
>>> from watex.exlib.sklearn import RandomForestClassifier , SGDClassifier, SimpleImputer
>>> estimators={'RandomForestClassifier':RandomForestClassifier
                                (n_estimators=200, random_state=0),
                                'SDGC':SGDClassifier(random_state=0)}
>>> pc.X= SimpleImputer().fit_transform(pc.X)
>>> pc.Xt= SimpleImputer().fit_transform(pc.Xt) # remove NaN values
>>> pc.BaseEvaluation(estimator={
 'RandomForestClassifier':RandomForestClassifier(
    n_estimators=200, random_state=0),
  'SDGC':SGDClassifier(random_state=0)}, eval_metric =True)
>>> pc.ypred_
{'RandomForestClassifier': array([2, 1, 2, 2, 2, 2, 0, 1, 1, 2, 3, 1, 0, 0, 1, 1, 1, 2, 2, 3, 2, 3,
                1, 2, 1, 2, 0, 2, 2, 3, 2, 2, 1, 1, 3, 3, 0, 2, 3, 3, 2, 1, 0, 2,
                1, 1, 2, 2, 2, 2, 1, 1, 0, 2, 0, 2, 1, 2, 1, 1, 2, 0, 1, 2, 0, 2,
                2, 3, 2, 2, 3, 0, 1, 2, 2, 3, 1, 1, 0, 1, 1, 2, 0, 0, 2, 0, 1],
           dtype=int8),
 'SGDClassifier': array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
                3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
                3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
                3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
           dtype=int8)}
>>> pc.BaseEvaluation(estimator={
         'RandomForestClassifier':RandomForestClassifier(
                n_estimators=200, random_state=0),
          'SDGC':SGDClassifier(random_state=0)}, eval_metric =True)
>>> pc.base_score_
{'RandomForestClassifier': 0.7816091954022989,
'SGDClassifier': 0.14942528735632185}

When using the estimator “randomForest”, the score is a little bit improved to 78% whereas it was 47.13 % for dummy prediction.

3.3.2. Processing#

Processing is dedicated to managing baseline model evaluation and learning. It also manages the validation curves after fiddling with a few estimator hyperparameters.

>>> from watex.cases.processing  import Processing
>>> from watex.exlib.sklearn import (StandardScaler,RandomForestClassifier, make_column_selector, PolynomialFeatures, SelectKBest, f_classif)
>>> data = fetch_data ('bagoue original').get('data=dfy2')
>>> my_own_pipeline= {'num_column_selector_':
...                       make_column_selector(dtype_include=np.number),
...                'cat_column_selector_':
...                    make_column_selector(dtype_exclude=np.number),
...                'features_engineering_':
...                    PolynomialFeatures(3,include_bias=True),
...                'selectors_': SelectKBest(f_classif, k=4),
...               'encodages_': StandardScaler()
...                 }
>>> my_estimator={
...    'RandomForestClassifier':RandomForestClassifier(
...    n_estimators=200, random_state=0)
...    }
>>> processObj= Processing (tname = 'flow', drop_features =['lwi', 'name', 'num'], pipeline= my_own_pipeline, estimator=my_estimator)
>>> processObj.fit(data=data )
>>> processObj.baseEvaluation (eval_metric=True )
0.4942528735632184 # score is an ensemble score for both model
>>> processObj.get_validation_curve (switch_plot='on', val_params= {'param_name': "n_estimators", "param_range": np.arange (1, 20, 5), "scoring": 'neg_mean_squared_error'} )