This is a great concise refresher. It covers many of the same concepts as the machine learning course I took earlier this year - decision trees, linear regression, SVM, neural nets, kNN, expectation maximization, ... - and it's sort of like having a set of high-quality, focused notes.
It also explains some details I've ignored in the past and covers many topics I wasn't familiar with:
- One reason for using logarithms when they're not mathematically necessary is to avoid overflow
- Negative cosine similarity as a distance function for e.g. k-nearest-neighbors
- Why do normalize features: so the derivatives with respect to each feature will be in similar ranges
- Rules of thumb on normalization vs standardization of features: standardize when there are outliers, or the data is close to normally distributed, or when doing unsupervised learning
- Multiple strategies for dealing with missing data
- Regularization
- Convolutional neural nets
- Recurrent neural nets, GRUs in particular
- Kernel regression, a non-parametric way to model nonlinear data
- Bagging; in particular random forest, which tries to produce a bunch of uncorrelated decision trees
- Comparison of boosting and bagging: "boosting reduces the bias ... instead of the variance"
- Sequence-to-Sequence learning: encoder, embedding, decoder
- Active learning: "Once we know the importance score of each unlabeled example, we pick the one with the highest importance score and ask the expert to annotate it."
- Denoising autoencoders: try to produce the same output as your input, after passing through an embedding layer, and despite corruption of the input
- Semi-supervised learning, such as the ladder network (a type of denoising autoencoder) that can perform excellently on MNIST with only a tiny fraction of the examples labeled
- One-shot learning with siamese neural networks and triplet loss
- Zero-shot learning, where "we want the model to be able to predict labels that we didn’t have in the training data"
- Stacking: "building a meta-model that takes the output of base models as input"
- Data augmentation: making additional training data by e.g. (for images) "zooming it slightly, rotating, flipping, darkening"
- Transfer learning
- Density estimation
- HDBSCAN, a clustering approach the author recommends trying before k-means
- Calculating "prediction strength" to choose number of clusters
- t-SNE and UMAP do dimensionality reduction "specifically for visualization purposes"
- The embedding in the bottleneck layer of an autoencoder can be used for dimensionality reduction
- Learning a distance metric from data
- Ranking problems: pointwise (the most obvious approach to me) and pairwise perform worse than listwise approaches such as LambdaMART
- Factorization machines, an approach to recommendation systems
- One reason you might use genetic algorithms is that the objective function doesn't have to be differentiable
I'd like to follow up on the concept of Bayesian hyperparameter learning, which was mentioned but not discussed.
This is an intriguing comment that I'd like to understand better:
...not many supervised learning algorithms can boast that they optimize a metric directly. Optimizing a metric is what we really want, but what we do in a typical supervised learning algorithm is we optimize the cost instead of the metric (because metrics are usually not differentiable). Usually, in supervised learning, as soon as we have found a model that optimizes the cost function, we try to tweak hyperparameters to improve the value of the metric. LambdaMART optimizes the metric directly.