ML basics: features, training and testing, model evaluation...

Published: February 07, 2023 by Chia-Hsien (Cathy) Shih

Categories:
Fundamentals 7

Feature extraction

Feature normalization

Goal: make features in the same scale so that the analysis result won't be biased towards big values.

Example:

gradient descent takes longer to converge if one feature has bigger scale (the updates are not homogeneous).
Decision trees do NOT benefit from normalization because it uses information gain ratio.

Method:

Min-max scaling
Z-score normalization

Categorical feature

Goal: obatin value representations for categorical features for most algorithms.

Example: Blood type A/ B/ AB/ O

Method

Ordinal encoding: 1, 2, 3, 4
One-hot encoding: 1000, 0100, 0010, 0001
Binary encoding: 001, 010, 011, 100 (binary of index 1, 2, 3, 4)

Feature combination

Goal: combine multiple complementary features and yield a more powerful feature.

Example: ($m$ languages)X($n$ music types) Feature combination renders $m\cdot n$ features, but then the dimension can be too high for practical uses. We can perform dimension reduction with arrray decomposition This is useful for recommendation applications.

Method to find effective combinations using decision trees

Use original features to build a decision tree (say 4 leaves)
Treat each paths from the root to leaves as a combination feature (4 paths)
Rewrite all samples features in terms of the combinations (sample 1's features fit (combination 1, combination 2, combination 3, combination 4))

Note combination features can have overlapping features.

Lack of samples

Goal: obtain more information from limited samples to avoid overfitting and increase model accuracy.

Method

Model: special structure or restrictions with priori information or assumptions. For example, drop out or L1 normalization to avoid overfitting.
Data:
- augmentation with transformation and noise.
- new samples from GAN
- Transfer learning
- class weight: give bigger weights to loss value og the minority class so the model will give equal attention to all the classes
upsampling
- SMOTE(Synthetic Minority Oversampling Technique): synthetically generating data points that fall in the proximity of the already existing outnumbered group using K nearest neighbors.
- DataDuplication
downsampling
- Tomek(T-Links): pair up data points from different classes (nearest-neighbors), and drop the sample that corresponds to the majority so that the count of the dominating labels is reduced

Evaluate Model Performance

what can go wrong if a ML model fails?

wrong evaluation metric
overfitting/ underfitting (model size, sample size)
bad split of training and testing data
unmatch distributions of training and testing data set

Limitation of accuracy

When the number of samples are extremely different among classes, the accuracy of a model can be determined by the one with the most samples. To avoid this, we can try calculating accuracy within each class instead of the whole data.

Trade off between precision and recall

Precision: TP / (TP+FP) -> among all the positives the predictors classified, how many are true positives
Recall: TP / (TP+FN) -> among all positives how many the predictors got right

To increase precision, a classifier needs to give positive predictions only when the confidence is high. However, to increase recall, it needs to be less conservative and give positive predictions.

This is important for a searching model. We want the model to return top N results that fit our description. To evaluate models' performance, we can

plot P-R curve with different N values and pick the best model based on it.
F1 score 2precisionrecall/(precision+recall)

Receiver operating characteristic (ROC) curve

False positive rate (FPR): FP/(FP+TN) -> classifier predicts positives when those are actually negatives among all negatives
True positive rate (TPR) = recall: TP/(TP+FN) -> true positives among all positives
data: (FPR,TPR) of different classifier threshold
area under curve (AUC)
- values are between 0.5~1
- higher values mean better classification

ROC v.s. P-R curve

ROC is less sensitive to sample size in positive and negative classes
- good to compare different datasets -> more applications
P-R curve
- good to analyze the performance on a specific dataset (because of the nature of the curve)

Limitation of RMSE

RMSE is very sensitive to outliers. Even if a model predicts most samples well, RMSE can still be bad because of the few outliers. For example, to predict IG clicks, there are always other factors (like media exposure or unexpected trend) that create outliers.

Methods

treat these outliers as noise and remove them from dataset
build a model that consider that capture the mechanism of these outliers
use MAPE - The normalization in MAPE limits the effect of outliers.

Cosine distance

only cares about "angle" regardless or the magnitude

good when the data is in high-dim space -> two sentences with different lengths but similar meanings
range is between -1 to 1, so the definition of similarity is clear

how to choose?

watching behavior user A watch type 1 more (0,1) v.s. user B watch type 0 more (1,0) the two vectors have big cosine dis but small euler dist
user stickiness user A watch 1 time per week and 10 mins per time (1,10) v.s. user B watch 10 time per week and 100 mins per time (10,100) the two vectors have big euler dist but small cos dist

is cosine a distance

dist defn

positive definiteness -> dist(A,B)>=0
symmetry -> dist(A,B)=dist(B,A)
triangle inequality -> dist(A,C)<=dist(A,B)+dist(B,C)

ans: NO (yes, yes, no)

KL divergence (discrepancy in distributions) is also not a real distance (yes, no, no)

A/B testing

why A/B testing when we already did offline testing

offline testing cannot completely detect overfitting, so online evaluation is still needed
offline env does not capture all factors in online env such as delay or missing labels
some metrics can only be calculated in online env such as user clicks or page views (these online metrics are broader evaluation in contrast to offline testing metrics that focus on model itself)

how to perform?

separate users to experimental group (new model A) vs control group (old model B)

training / testing data

Holdout testing
- simply split the dataset to training and testing according to an arbitrary percentage
Cross-validation
- k-fold: separate to k groups -> average the testing results
Bootstrap (when dataset is small)
- randomly sample n data as training set (can be repeated)
- the unsampled data serves as testing set

Hyperparameter optimization

Grid search
- coarse to fine grid
random search
bayesian optimization
- Since the objective function f is unknown, the Bayesian strategy is to treat it as a random function and place a prior over it. The prior captures beliefs about the behavior of the function. After gathering the function evaluations, which are treated as data D, the prior is updated to form the posterior distribution over the objective function P(f|D). The posterior distribution, in turn, is used to construct an acquisition function (often also referred to as infill sampling criteria) that determines the next query point. P(f|D) = P(D|f) * P(f) (posterior = likelihood * prior)
- gaussian process is a common choice of prior
transfer learning (google vizier)

Overfitting and underfitting

Overfitting:good performance in training set but not testing set Underfitting:bad performance in both training and testing set

Prevent overfitting

more training data (reduce the effect of noise)
reduce model complexity
regularization (penalty on weights)
ensemble learning (e.g. bagging)

Prevent underfitting

add new features (like combination features or features across the text)
- Factorization machine
- deep-crossing
increase model complexity
reduce penalty on regularization