review of The Hundred-Page Machine Learning Book

This is a great concise refresher. It covers many of the same concepts as the machine learning course I took earlier this year - decision trees, linear regression, SVM, neural nets, kNN, expectation maximization, ... - and it's sort of like having a set of high-quality, focused notes.

It also explains some details I've ignored in the past and covers many topics I wasn't familiar with:

One reason for using logarithms when they're not mathematically necessary is to avoid overflow
Negative cosine similarity as a distance function for e.g. k-nearest-neighbors
Why do normalize features: so the derivatives with respect to each feature will be in similar ranges
Rules of thumb on normalization vs standardization of features: standardize when there are outliers, or the data is close to normally distributed, or when doing unsupervised learning
Multiple strategies for dealing with missing data
Regularization
Convolutional neural nets
Recurrent neural nets, GRUs in particular
Kernel regression, a non-parametric way to model nonlinear data
Bagging; in particular random forest, which tries to produce a bunch of uncorrelated decision trees
Comparison of boosting and bagging: "boosting reduces the bias ... instead of the variance"
Sequence-to-Sequence learning: encoder, embedding, decoder
Active learning: "Once we know the importance score of each unlabeled example, we pick the one with the highest importance score and ask the expert to annotate it."
Denoising autoencoders: try to produce the same output as your input, after passing through an embedding layer, and despite corruption of the input
Semi-supervised learning, such as the ladder network (a type of denoising autoencoder) that can perform excellently on MNIST with only a tiny fraction of the examples labeled
One-shot learning with siamese neural networks and triplet loss
Zero-shot learning, where "we want the model to be able to predict labels that we didn’t have in the training data"
Stacking: "building a meta-model that takes the output of base models as input"
Data augmentation: making additional training data by e.g. (for images) "zooming it slightly, rotating, flipping, darkening"
Transfer learning
Density estimation
HDBSCAN, a clustering approach the author recommends trying before k-means
Calculating "prediction strength" to choose number of clusters
t-SNE and UMAP do dimensionality reduction "specifically for visualization purposes"
The embedding in the bottleneck layer of an autoencoder can be used for dimensionality reduction
Learning a distance metric from data
Ranking problems: pointwise (the most obvious approach to me) and pairwise perform worse than listwise approaches such as LambdaMART
Factorization machines, an approach to recommendation systems
One reason you might use genetic algorithms is that the objective function doesn't have to be differentiable

I'd like to follow up on the concept of Bayesian hyperparameter learning, which was mentioned but not discussed.

This is an intriguing comment that I'd like to understand better:

...not many supervised learning algorithms can boast that they optimize a metric directly. Optimizing a metric is what we really want, but what we do in a typical supervised learning algorithm is we optimize the cost instead of the metric (because metrics are usually not differentiable). Usually, in supervised learning, as soon as we have found a model that optimizes the cost function, we try to tweak hyperparameters to improve the value of the metric. LambdaMART optimizes the metric directly.