3. Case Histories#
cases implements functions and modules already available and ready to use
for solving real engineering problems like flow rate predictions and boosting using the
bases learners, SVC, and ensemble paradigms. watex.cases.features, watex.cases.processing,
watex.cases.modeling and watex.cases.prepare modules have base steps and can
be used for processing and analyses to give a quick depiction of how data looks like. This
can figure out the next processing steps for solving the evidence problem.
3.1. Features#
features is a set of different manipulation that can be performed on the
case history feature data.
3.1.1. GeoFeatures#
GeoFeatures expects the geological, the boreholes and DC-electrical
resistivity data. GeoFeatures set all feature values of
different investigation sites. GeoFeatures class is composed of:
erp class get from
ERPCollectiongeol obtained from
Geologyboreh get from
Borehole
>>> from watex.cases.features import GeoFeatures
>>> data ='data/geodata/main.bagciv.data.csv'
>>> featObj =GeoFeatures().fit(data )
>>> featObj.id_
array(['e0000001', 'e0000002', 'e0000003', 'e0000004', 'e0000005',
'e0000006', 'e0000007'], dtype='<U8')
>>> featObj.site_names_[:7] # view the site for borehole
array(['b1', 'b2', 'b3', 'b4', 'b5', 'b6', 'b7'], dtype=object)
3.1.2. Feature Inspection#
FeatureInspection summarizes flow features. It deals with
data feature categorization. When numerical values are provided standard qualitative or
quantitative analysis is performed.
>>> from watex.cases.features import FeatureInspection
>>> data = 'data/geodata/main.bagciv.data.csv'
>>> fobj = FeatureInspection().fit(data)
>>> fobj.data_.columns
Index(['num', 'name', 'east', 'north', 'power', 'magnitude', 'shape', 'type',
'sfi', 'ohmS', 'lwi', 'geol', 'flow'],
dtype='object')
3.2. Prepare#
prepare base module helps to automate data preparation at once. It is created fast
data preparation in real engineering cases study. This is a naive approach for quickly reproducing the
published paperwork, especially for flow rate prediction.
3.2.1. Base data preparation for case studies#
The base step has been used to solve flow rate prediction problems [1]. Its steps procedure can straightforwardly help users to fast reach the same goal as the published paper. An example of a different kind of Bagoue dataset [2], is prepared using the BaseSteps module.
See also
An example of Bagoue dataset preparation in the _p module.
3.3. Processing#
processing gives basic processing for achieving results. Here, we implement the
processing step performed for predicting the flow rate prediction [1].
3.3.1. Preprocessing#
Preprocessing gives the prior steps for flow rate prediction.
Note
If \(X\) and \(y\) are provided, they are considered as a feature set and target respectively. They should be split into the training set and test set respectively.
>>> from watex.cases.processing import Preprocessing
>>> from watex.datasets import fetch_data
>>> data = fetch_data('bagoue original').get('data=dfy2')
>>> pc = Preprocessing (drop_features = ['lwi', 'num', 'name']
).fit(data =data )
>>> len(pc.X ), len(y), len(pc.Xt ), len(pc.yt)
(344, 344, 87, 87) # trainset (X,y) and testset (Xt, yt)
One can assemble pipes and an estimator to make a model (default) following the snippet code below. Indeed, the model is composed of transformers and estimators. If one is set to None, it uses the default pipe and estimator which might be not the one expected. Therefore providing a pipe and estimator is recommended.
We can get the default preprocessor by merely calling:
>>> from watex.cases.processing import Preprocessing
>>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num'])
>>> data = fetch_data ('bagoue original').get('data=dfy2')
>>> pc.fit(data =data)
>>> pc.makeModel() # use default model and preprocessor
>>> pc.model_
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('numpipe',
Pipeline(steps=[('imputer',
SimpleImputer()),
('polynomialfeatures',
PolynomialFeatures(degree=10,
include_bias=False)),
('selectors',
SelectKBest(k=4)),
('scalers',
RobustScaler())]),
['east', 'north', 'power',
'magnitude', 'sfi',
'ohmS']),
('catpipe',
Pipeline(steps=[('imputer',
SimpleImputer()),
('onehotencoder',
OneHotEncoder(handle_unknown='ignore'))]),
['type', 'shape', 'geol'])])),
('SVC', SVC(C=100, gamma=0.001, random_state=42))])
Or build your preprocessor object using the example below:
>>> from watex.exlib.sklearn import (
Pipeline,
ColumnTransformer,
SimpleImputer,
StandardScaler,
OneHotEncoder,
LogisticRegression
)
>>> from watex.datasets import fetch_data
>>> from watex.cases.processing import Preprocessing
>>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num'])
>>> numeric_features = ['east', 'north', 'power', 'magnitude', 'sfi', 'ohmS']
>>> numeric_transformer = Pipeline(
steps=[("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())]
)
>>> categorical_features = ['shape', 'geol', 'type']
>>> categorical_transformer = OneHotEncoder(handle_unknown="ignore")
>>> preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
])
>>> pc.makeModel (pipe = preprocessor,
estimator = LogisticRegression())
>>> # or pc.model_
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median')),
('scaler',
StandardScaler())]),
['east', 'north', 'power',
'magnitude', 'sfi',
'ohmS']),
('cat',
OneHotEncoder(handle_unknown='ignore'),
['shape', 'geol', 'type'])])),
('LogisticRegression', LogisticRegression())])
Once a model is created, a dummy baseline model can be evaluated from preprocessing
pipeline; onto a model by providing an estimator. This is possible thanks to
baseEvaluation(). A code snippet is
given below:
>>> from watex.cases.processing import Preprocessing
>>> pc = Preprocessing (tname = 'flow', drop_features =['lwi', 'name', 'num'])
>>> data = fetch_data ('bagoue original').get('data=dfy2')
>>> pc.fit(data =data)
# (1) -> default estimator
>>> pc.baseEvaluation (eval_metric=True)
0.47126436781609193 # score
# (2) -> multiples estimators
>>> from watex.exlib.sklearn import RandomForestClassifier , SGDClassifier, SimpleImputer
>>> estimators={'RandomForestClassifier':RandomForestClassifier
(n_estimators=200, random_state=0),
'SDGC':SGDClassifier(random_state=0)}
>>> pc.X= SimpleImputer().fit_transform(pc.X)
>>> pc.Xt= SimpleImputer().fit_transform(pc.Xt) # remove NaN values
>>> pc.BaseEvaluation(estimator={
'RandomForestClassifier':RandomForestClassifier(
n_estimators=200, random_state=0),
'SDGC':SGDClassifier(random_state=0)}, eval_metric =True)
>>> pc.ypred_
{'RandomForestClassifier': array([2, 1, 2, 2, 2, 2, 0, 1, 1, 2, 3, 1, 0, 0, 1, 1, 1, 2, 2, 3, 2, 3,
1, 2, 1, 2, 0, 2, 2, 3, 2, 2, 1, 1, 3, 3, 0, 2, 3, 3, 2, 1, 0, 2,
1, 1, 2, 2, 2, 2, 1, 1, 0, 2, 0, 2, 1, 2, 1, 1, 2, 0, 1, 2, 0, 2,
2, 3, 2, 2, 3, 0, 1, 2, 2, 3, 1, 1, 0, 1, 1, 2, 0, 0, 2, 0, 1],
dtype=int8),
'SGDClassifier': array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
dtype=int8)}
>>> pc.BaseEvaluation(estimator={
'RandomForestClassifier':RandomForestClassifier(
n_estimators=200, random_state=0),
'SDGC':SGDClassifier(random_state=0)}, eval_metric =True)
>>> pc.base_score_
{'RandomForestClassifier': 0.7816091954022989,
'SGDClassifier': 0.14942528735632185}
When using the estimator “randomForest”, the score is a little bit improved to 78% whereas it was 47.13 % for dummy prediction.
3.3.2. Processing#
Processing is dedicated to managing baseline model evaluation
and learning. It also manages the validation curves after fiddling with a few estimator hyperparameters.
>>> from watex.cases.processing import Processing
>>> from watex.exlib.sklearn import (StandardScaler,RandomForestClassifier, make_column_selector, PolynomialFeatures, SelectKBest, f_classif)
>>> data = fetch_data ('bagoue original').get('data=dfy2')
>>> my_own_pipeline= {'num_column_selector_':
... make_column_selector(dtype_include=np.number),
... 'cat_column_selector_':
... make_column_selector(dtype_exclude=np.number),
... 'features_engineering_':
... PolynomialFeatures(3,include_bias=True),
... 'selectors_': SelectKBest(f_classif, k=4),
... 'encodages_': StandardScaler()
... }
>>> my_estimator={
... 'RandomForestClassifier':RandomForestClassifier(
... n_estimators=200, random_state=0)
... }
>>> processObj= Processing (tname = 'flow', drop_features =['lwi', 'name', 'num'], pipeline= my_own_pipeline, estimator=my_estimator)
>>> processObj.fit(data=data )
>>> processObj.baseEvaluation (eval_metric=True )
0.4942528735632184 # score is an ensemble score for both model
>>> processObj.get_validation_curve (switch_plot='on', val_params= {'param_name': "n_estimators", "param_range": np.arange (1, 20, 5), "scoring": 'neg_mean_squared_error'} )