ML basics: features, training and testing, model evaluation...

Feature extraction

Feature normalization

Goal: make features in the same scale so that the analysis result won't be biased towards big values.

Example:

  • gradient descent takes longer to converge if one feature has bigger scale (the updates are not homogeneous).
  • Decision trees do NOT benefit from normalization because it uses information gain ratio.

Method:

  • Min-max scaling
  • Z-score normalization

Categorical feature

Goal: obatin value representations for categorical features for most algorithms.

Example: Blood type A/ B/ AB/ O

Method

  • Ordinal encoding: 1, 2, 3, 4
  • One-hot encoding: 1000, 0100, 0010, 0001
  • Binary encoding: 001, 010, 011, 100 (binary of index 1, 2, 3, 4)

Feature combination

Goal: combine multiple complementary features and yield a more powerful feature.

Example: ($m$ languages)X($n$ music types) Feature combination renders $m\cdot n$ features, but then the dimension can be too high for practical uses. We can perform dimension reduction with arrray decomposition This is useful for recommendation applications.

Method to find effective combinations using decision trees

  1. Use original features to build a decision tree (say 4 leaves)
  2. Treat each paths from the root to leaves as a combination feature (4 paths)
  3. Rewrite all samples features in terms of the combinations (sample 1's features fit (combination 1, combination 2, combination 3, combination 4))

Note combination features can have overlapping features.

Lack of samples

Goal: obtain more information from limited samples to avoid overfitting and increase model accuracy.

Method

  • Model: special structure or restrictions with priori information or assumptions. For example, drop out or L1 normalization to avoid overfitting.
  • Data:
    • augmentation with transformation and noise.
    • new samples from GAN
    • Transfer learning
    • class weight: give bigger weights to loss value og the minority class so the model will give equal attention to all the classes
  • upsampling
    • SMOTE(Synthetic Minority Oversampling Technique): synthetically generating data points that fall in the proximity of the already existing outnumbered group using K nearest neighbors.
    • DataDuplication
  • downsampling
    • Tomek(T-Links): pair up data points from different classes (nearest-neighbors), and drop the sample that corresponds to the majority so that the count of the dominating labels is reduced

Evaluate Model Performance

what can go wrong if a ML model fails?

  • wrong evaluation metric
  • overfitting/ underfitting (model size, sample size)
  • bad split of training and testing data
  • unmatch distributions of training and testing data set

Limitation of accuracy

When the number of samples are extremely different among classes, the accuracy of a model can be determined by the one with the most samples. To avoid this, we can try calculating accuracy within each class instead of the whole data.

Trade off between precision and recall

  • Precision: TP / (TP+FP) -> among all the positives the predictors classified, how many are true positives
  • Recall: TP / (TP+FN) -> among all positives how many the predictors got right

To increase precision, a classifier needs to give positive predictions only when the confidence is high. However, to increase recall, it needs to be less conservative and give positive predictions.

This is important for a searching model. We want the model to return top N results that fit our description. To evaluate models' performance, we can

  • plot P-R curve with different N values and pick the best model based on it.
  • F1 score 2precisionrecall/(precision+recall)

Receiver operating characteristic (ROC) curve

  • False positive rate (FPR): FP/(FP+TN) -> classifier predicts positives when those are actually negatives among all negatives
  • True positive rate (TPR) = recall: TP/(TP+FN) -> true positives among all positives
  • data: (FPR,TPR) of different classifier threshold
  • area under curve (AUC)
    • values are between 0.5~1
    • higher values mean better classification

ROC v.s. P-R curve

  • ROC is less sensitive to sample size in positive and negative classes
    • good to compare different datasets -> more applications
  • P-R curve
    • good to analyze the performance on a specific dataset (because of the nature of the curve)

Limitation of RMSE

RMSE is very sensitive to outliers. Even if a model predicts most samples well, RMSE can still be bad because of the few outliers. For example, to predict IG clicks, there are always other factors (like media exposure or unexpected trend) that create outliers.

Methods

  • treat these outliers as noise and remove them from dataset
  • build a model that consider that capture the mechanism of these outliers
  • use MAPE - The normalization in MAPE limits the effect of outliers.

Cosine distance

only cares about "angle" regardless or the magnitude

  • good when the data is in high-dim space -> two sentences with different lengths but similar meanings
  • range is between -1 to 1, so the definition of similarity is clear

how to choose?

  • watching behavior user A watch type 1 more (0,1) v.s. user B watch type 0 more (1,0) the two vectors have big cosine dis but small euler dist

  • user stickiness user A watch 1 time per week and 10 mins per time (1,10) v.s. user B watch 10 time per week and 100 mins per time (10,100) the two vectors have big euler dist but small cos dist

is cosine a distance

dist defn

  1. positive definiteness -> dist(A,B)>=0
  2. symmetry -> dist(A,B)=dist(B,A)
  3. triangle inequality -> dist(A,C)<=dist(A,B)+dist(B,C)

ans: NO (yes, yes, no)

KL divergence (discrepancy in distributions) is also not a real distance (yes, no, no)

A/B testing

why A/B testing when we already did offline testing

  1. offline testing cannot completely detect overfitting, so online evaluation is still needed
  2. offline env does not capture all factors in online env such as delay or missing labels
  3. some metrics can only be calculated in online env such as user clicks or page views (these online metrics are broader evaluation in contrast to offline testing metrics that focus on model itself)

how to perform?

separate users to experimental group (new model A) vs control group (old model B)

training / testing data

  • Holdout testing
    • simply split the dataset to training and testing according to an arbitrary percentage
  • Cross-validation
    • k-fold: separate to k groups -> average the testing results
  • Bootstrap (when dataset is small)
    • randomly sample n data as training set (can be repeated)
    • the unsampled data serves as testing set

Hyperparameter optimization

  • Grid search
    • coarse to fine grid
  • random search
  • bayesian optimization
    • Since the objective function f is unknown, the Bayesian strategy is to treat it as a random function and place a prior over it. The prior captures beliefs about the behavior of the function. After gathering the function evaluations, which are treated as data D, the prior is updated to form the posterior distribution over the objective function P(f|D). The posterior distribution, in turn, is used to construct an acquisition function (often also referred to as infill sampling criteria) that determines the next query point. P(f|D) = P(D|f) * P(f) (posterior = likelihood * prior)
    • gaussian process is a common choice of prior
  • transfer learning (google vizier)

Overfitting and underfitting

Overfitting:good performance in training set but not testing set Underfitting:bad performance in both training and testing set

Prevent overfitting

  • more training data (reduce the effect of noise)
  • reduce model complexity
  • regularization (penalty on weights)
  • ensemble learning (e.g. bagging)

Prevent underfitting

  • add new features (like combination features or features across the text)
    • Factorization machine
    • deep-crossing
  • increase model complexity
  • reduce penalty on regularization