Here is a list of quick, memorizable descriptions of various unsupervised algorithms that can be
used for analysis, dimension reduction, and data cleaning. I put this list together for interview
prep for various data science and MLE interviews. Note that there are some sections with **TODO:**
notes that I left for myself. That's because these notes are unfinished, and I hope to come back
and update this post with more details.

## PCA

### Theory

#### The Covariance Matrix

Let $X$ be an $n\times m$ dataset and let $\mu_X$ be the matrix with entries $(\mu_X)_{i,j} = \mu_{X_j} = \frac{1}{n}\sum_kX_{k,j}$ (i.e., the column means repeated over each column). The covariance matrix of $X$ is defined to be

This is ultimately due to the identity

where the estimated expectation in the last line is viewing the columns as univariate random variables (so that isn't meant to be a dot product, although the expectation itself is). Thus, the entries of the covariance matrix are the pairwise covariances of the column vectors of $X$.

#### Eigenvalues of the Covariance Matrix

Now, assume that $\mu_X = 0$. All of this is still true otherwise, but the calculations are more annoying. Because $\Sigma$ is symmetric, it is diagonalizable. Consider the eigendecomposition

where $Q$ is the matrix whose columns are the eigenvectors of $\Sigma$.

Consider the first principal component. By definition, this is the unit vector
linear combination of the features $X_{\cdot, i}$ which *maximizes* variance.
So if we write

Then the vector $\phi = (\phi_{1,1}, \ldots, \phi_{m, 1})^T$ maximizes the equation

Now, it is a fact that I looked up on the internet that the $L2$-norm of a matrix is equal to the largest singular value of that matrix (why?). It is also a fact that the singular values and eigenvalues coincide for a symmetric matrix (this is a straightforward computation). Thus, the final line in the above equation is equal to the largest eigenvalue of $\Sigma$, and therefore the eigenvector of $\Sigma$ corresponding to this eigenvalue is in fact the first principal component.

The rest of the eigenvectors of $\Sigma$ are called the other principal components of $X$

## Collaborative Filtering

### Problem Motivation

Suppose we are building a recommender system for a movie service with an explicit rating system. Users rate movies on a scale of 0 to 5 stars, or something. Imagine we have $k$ genres for titles in the form of a $k$-dimensional vector $x^{(i)}$ for each movie $i$, as well as known weights for genre preference of users in the form of $k$-dimensional vectors $\theta^{(u)}$ for each user $u$ (in both cases, assuming the weights sum to 1). Then it would be easy to predict the potential rating of a movie by a user. The rating $r_{u,i}$ would be approximated by 5 times the dot product of $\theta^{(u)}$ and $x^{(i)}$.

### Alternating Least Squares

This suggests an optimization problem. If we know the user preference vectors $\theta^{(u)}$, then learning the movie genre vectors $x^{(i)}$ is achieved by minimizing the expression

(where the summands are only for the $(u,i) pairs such that$r_{u,i}$exists). Similarly, if we know the genre embeddings$x^{(i)}$, then learning the user preferences$\theta^{(u)}$ is achieved by minimizing the expression

So, the way we find both user and movie genre embeddings, we simply alternate minimizing the above expressions, with some initial random guesses. We can combine the above expressions into one:

We can also just minimize $J$ directly using gradient descent. I'm not sure if that's advantageous.

### Low Rank Matrix Factorization

Let $m$ be the number of users and let $n$ be the number of movies, and let $k$
be the number of genres (called *latent factors* in generality). Let
$R\in\mathbb{R}^{m\times n}$ be the matrix of ratings, let
$X\in\mathbb{R}^{k\times n}$ be the matrix of genre weights for the movies, and
let $\Theta\in\mathbb{R}^{m\times k}$ be the matrix of genre preferences for the
users. Then

The fact that the matrix is low rank is actually highly important. It says that we don't need a lot of genres. That in fact, the full ratings matrix has a lot of dependencies (similarities between people and between movies, and also linear dependencies between people's preferences for various genres). We can rely on this low rank assumption when designing algorithms for recommender systems.

## Other models I did not get a chance to write up

**TODO:** Write these up some time

- Singular Value Decomposition
- K Means
- Robust Covariance
- One Clas SVM
- Isolation Forest