# Paper: Wide & Deep Learning for Recommender Systems

There seems to be an interesting new model architecture for ranking & recommendation, developed by Google Research. It uses Logistic Regression & Deep Learning in a single model.

This is different from ensemble models, where a sub-model is trained separately, and it’s score is used as a feature for the parent model. In this paper, the authors learn a wide model (Logistic Regression, which is trying to “memorize”), and a deep model (Deep Neural Network, which is trying to “generalize”), jointly.

The input to the wide network are standard features, while the deep network uses dense embeddings of the document to be scored, as input.

The main benefits as per the authors, are:

1. DNNs can learn to over-generalize, while LR models are limited in how much they can memorize from the training data.

2. Learning the models jointly means that the ‘wide’ and ‘deep’ part are aware of each other, and the ‘wide’ part only needs to augment the ‘deep’ part.

3. Also, training jointly helps reduce the side of the individual models.

They also have a TensorFlow implementation. Also a talk on this topic.

The authors employed this model to recommend apps to be downloaded to the user in Google Play, where they drove up app installs by 3.9% using the Wide & Deep model.

However, the Deep model in itself, drove up installs by 2.9%. It is natural to expect that the ‘wide’ part of the model should help in further improving the metric to be optimized, but it is unclear to me, if the delta could have been achieved by further bulking up the ‘deep’ part (i.e., adding more layers / training bigger dimensional embeddings, which are inputs to the DNNs).

# Birthday Paradox and Hash Functions

The Birthday Problem is concerned with computing the probability that, in a set of $n$ randomly chosen people, at least one pair of people will have the same birthday (let’s call it a ‘collision’).

Intuitively, what would be your guess for the probability of collision to become > 0.5? Given that there are 365 possible days, I would imagine $n$ to be quiet large for this to happen. Let’s compute it by hand.

What is the probability that there isn’t a collision?

• The total number of combinations of birthdays possible for the set of $n$ people is $365^n$.
• However, for the birthdays to not overlap, each one should pick $n$ different birthdays out of the 365 possible ones. This is equal to $_{365}P_{n}$ (pick any $n$ out of the 365 days, and allow permutations).

• $P(\text{no collision}) = \frac{_{365}P_{n}}{365^n}$.

• $P(\text{collision}) = 1 - \frac{_{365}P_{n}}{365^n}$.

Plotting the graph for collision to happen, let’s see where this becomes true.

So it seems that the collision happens with a probability >= 0.5 after $n$ is greater than 23. The paradox in the Birthday Paradox is that we expect $n$ to be quite large, where as it seems you need only $\approx \sqrt{365}$ people.

In general, it has been proven that the if there are $n$ balls, and $b$ bins, then the probability of any bin having > 1 ball is >= 0.5, when $n \approx \sqrt{b}$.

Considering hash functions to be placing balls in bins, if the number of distinct elements that could be fed to the hash function are $n$, to ensure that the probability of collision remains < 0.5, the length of the hash in number of bits required for the hash function, $l$, should be such that $2^l > n^2$.

This means, if you expect $2^{32}$ distinct elements, make sure to use at least a 64 bit hash, if you want the probability of collision to be < 0.5

# Deathbed Formulae

1. $\large{\left(\frac{y}{x}\right)^{x}} \leq \large{y \choose x} \leq \large{\left(\frac{ey}{x}\right)^{x}}$.
2. For a large value of $n$, $\left(1 - \large{\frac{1}{n}}\right)^{n} \approx \large{\frac{1}{e}}$.
3. For a large value of $n$, $\left(1 + \large{\frac{1}{n}}\right)^{n} \approx e$.