ModelLevel Parallelism: Works with large models by splitting the model graph itself into several parts. Each part of the model is assigned to a different machine. If there is an edge between two nodes in different parts, the two machines hosting those parts would need to communicate. This is to get around the problem of fitting a large model on a single GPU.
Downpour SGD: To be able to scale to large datasets, DistBelief also runs several replicas of the model itself. The training data is split into several subsets, and each replica works on a single subset. Each of the replica sends the updates of its params to a Parameter Server. The parameter server itself is sharded, and is responsible for getting updates for a subset of params.
Whenever a new replica starts a new minibatch, it gets the relevant params from the parameter server shards, and then sends its updates when its done with its minibatch.
The authors found Adagrad to be useful in the asynchrous SGD setting, since it uses an adaptive learning rate for each parameter, which makes it easy to implement locally per parameter shard.
But seriously, people make a big deal out of ‘Dynamic Programming’, in the context of software engineering interviews. Also, the name sounds fancy but for most problems in such interviews, you can go from a naive recursive solution to an efficient solution, pretty easily.
Any problem that has the following properties can be solved with Dynamic Programming:
You just have to do two things here:
That’s it.
Usually the second part is harder. After that, it is like clockwork, and the steps remain the same almost all the time.
Example
Assume, your recursive solution to say, compute the nth fibonacci number, is:
Step 1: Write this as a recursive solution first
1 2 3 4 5 6 7 

Now, this is an exponential time solution. Most of the inefficiency comes in because we recompute the solutions again and again. Draw the recursion tree as an exercise, and convince yourself that this is true.
Also, when you do this, you at least get a naive solution out of the way. The interviewer at least knows that you can solve the problem (perhaps, not efficiently, yet).
Step 2: Let’s just simply cache everything
Store every value ever computed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 

Let us compute how many unique calls can we make to fibDP
?
n
.n
unique values of n
can be passed to fibDP.n
unique calls.Now, realize two things:
That’s it. We just optimized the recursive code from a $O(2^n)$ time complexity, $O(n)$ space complexity (recursive call stack space) to an $O(n)$ time, $O(n)$ space (recursive + extra space).
Example with a higher number of parameters
1 2 3 4 5 6 7 

Time complexity: $O(3^{n+m})$ [Work it out on paper, why this would be the complexity, if you are not sure.]
DP Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

Assume I tweak foo and add an $O(n \log m)$ work inside each call, that would just be multiplied for the time complexity, i.e.,
Time complexity = O(unique calls) * O(workpercall)
$Space Complexity = O(unique calls) * O(space per call)
Now just reinforce these ideas with this question
Extra Credit
What we saw is called topdown DP, because we are taking a bigger problem, breaking it down into subproblems and solving them first. This is basically recursion with memoization (we ‘memoize’ (fancy word for caching) the solutions of the subproblems).
When you absolutely, totally nail the recursive solution, some interviewers might want a solution without recursion. Or, probably want to optimize the space complexity even further (which is not often possible in the recursive case). In this case, we want a bottomup DP, which is slightly complicated. It starts by solving the smallest problems iteratively, and builds the solution to bigger problems from that.
Only if you have time, go in this area. Otherwise, even if you mention to the interviewer that you know there is something called bottomup DP which can be used to do this iteratively, they should be at least somewhat okay. I did a short blogpost on converting a topdown DP to a bottomup DP if it sounds interesting.
]]>Examples of such datasets could be clicks on Google.com, friend requests on Facebook.com, etc.
One simple problem that could be posed with such datasets is:
Pick a random element from the given dataset, ensuring that the likelihood of picking any element is the same.
Of course, it is trivial to solve if we know the size of your dataset. We can simply pick a random number in $(0, n1)$ ($n$ being the size of your dataset). And index to that element in your dataset. That is of course not possible with a stream, where we don’t know $n$.
Reservoir Sampling does this pretty elegantly for a stream $S$:
1 2 3 4 5 6 7 

The idea is simple. Every new element you encounter, replace your current choice with this new element with a probability of $\large\frac{1}{l}$. Where $l$ is the total length of the stream (including this element), encountered so far. When the stream ends, you return the element that you had picked at the end.
However, we need to make sure that the probability of each element being picked is the same. Let’s do a rough sketch:
There could be a weighted variant of this problem. Where, each element has an associated weight with it. At the end, the probability of an item $i$ being selected should be $\large\frac{w_i}{W}$. Where, $w_i$ is the weight of the $i$th element, and $W$ is the sum of the weights of all the elements.
It is straightforward to extend our algorithm to respect the weights of the elements:
len
, keep W
.W
by $w_i$ instead of just incrementing it by 1.In a manner similar to the proof above, we can show that this algorithm will also respect the condition that we imposed on it. Infact the previous algorithm is a special case of this one, with all $w_i = 1$.
Credits:
[1] Jeff Erickson’s notes on streaming algorithms for the general idea about streaming algorithms.
]]>Essentially the premise is that labeling all the data is expensive, and we should learn as much as we can from as small a dataset as possible. For their datalabeling needs, the industry either relies on Mechanical Turks or fulltime labelers on contract (Google Search is one example where they have a large team of human raters). Overall, it is costly to build a large labeled dataset. Therefore, if we can minimize our dependence on labeled data, and learn from known / inferred similarity within the dataset, that would be great. That’s where SemiSupervised Learning helps.
Assume there is a graphstructure to our data, where a node is a datum / row, which needs to be labeled. And an edge exists between two nodes if they are similar, along with a weight. In this case, Label Propagation is a classic technique which has been used commonly.
I read this paper from Google Research which does a good job of generalizing and summarizing similar work done by Weston et. al, 2012, around training Neural Nets augmented by such a graphbased structure.
To summarize the work very quickly, the network tries to do two things:
a. For labeled data, try to predict the correct label (of course),
b. For the entire data set, try to learn a representation of each datum (embedding) in the hidden layers of the neural net.
For nodes which are adjacent to each other, the distance between their respective embeddings should be small, and the importance of keeping this distance small is proportional to the edge weight. That is, if there are two adjacent nodes with a high edge weight, if the neural net doesn’t learn to create embeddings such that these two examples are close to each other, there would be a larger penalty, than if the distance was smaller / the edge weight was lower.
Check the figure above. The blue layer is the hidden layer used for generating the embedding, whose output is represented by $h_{\theta}(X_i)$. $y$ is the final output. If $X_i$ and $X_j$ are closeby, the distance between them is represented by $d(h_{\theta}(X_i), h_{\theta}(X_j))$.
The costfunction is below. Don’t let this scare you. The first term is just the total loss from the predictions of the network for labeled data. The next three terms are for tweaking the importance of distances between the various (labeled / unlabeled) (labeled / unlabeled) pairs, weighed by their respective edge weights, $w_{uv}$.
I skimmed through the paper to get the gist, but I wonder that the core contribution of such a network is to go from a graphbased structure to an embedding. If we were to construct an embedding directly from the graph structure, and train a NN separately using the embedding as the input, my guess is it should fetch similar results with a less complex objective function.
]]>To keep things short, I liked it because:
.numpy()
suffix to convert a Tensor to a numpy array.PyTorch’s website has a 60 min. blitz tutorial, which is laid out pretty well.
Here is the summary to get you started on PyTorch:
torch.Tensor
is your np.array
(the NumPy array). torch.Tensor(3,4)
will create a Tensor
of shape (3,4).torch.rand
can be used to generate random Tensors..numpy()
allows converting Tensor to a numpy array.Variable
s which are similar to placeholder
in TF.This is all it takes to compute the gradient, where x
is a variable:
1 2 3 4 5 6 7 

Doing backprop simply with the backward
method call on the scalar out
, computes gradients all the way to x
. This is amazing!
nn.Module
which wraps around the boring boilerplate.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

As seen, in the __init__
method, we just need to define the various NN layers we are going to be using. Then the forward
method just runs through them. The view
method is analogous to the NumPy reshape
method.
The gradients will be applied after the backward pass, which is autocomputed. The code is selfexplanatory and fairly easy to understand.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 

The criterion
object is used to compute your loss function. optim
has a bunch of convex optimization algorithms such as vanilla SGD, Adam, etc. As promised, simply calling the backward
method on the loss object allows computing the gradient.
Overall, I could get to 96% accuracy, with the current setup. The complete gist is here.
]]>Essentially, in Linear Regression, we try to estimate a dependent variable $y$, using independent variables $x_1$, $x_2$, $x_3$, $…$, using a linear model.
More formally, $y = b + W_1 x_1 + W_2 x_2 + … + W_n x_n + \epsilon$. Where, $W$ is the weight vector, $b$ is the bias term, and $\epsilon$ is the noise in the data.
It can be used when there is a linear relationship between the input $X$ (input vector containing all the $x_i$, and $y$).
One example could be, given a vending machine’s sales of different kind of soda, predict the total profit made. Let’s say there are three kinds of soda, and for each can of that variety sold, the profit is 0.25, 0.15 and 0.20 respectively. Also, we know that there will be a fixed cost in terms of electricity and maintenance for the machine, this will be our bias, and it will be negative. Let’s say it is $100. Hence, our profit will be:
$y = 100 + 0.25x_1 + 0.15x_2 + 0.20x_3$.
The problem is usually the inverse of the above example. Given the profits made by the vending machine, and sales of different kinds of soda (i.e., several pairs of $(X_i, y_i)$), find the above equation. Which would mean being able to find $b$, and $W$. There is a closedform solution for Linear Regression, but it is expensive to compute, especially when the number of variables is large (10s of thousands).
Generally in Machine Learning the following approach is taken for similar problems:
Step 1 The first step is fairly easy, we just pick a random $W$ and $b$. Let’s say $\theta = (W, b)$, then $h_\theta(X_i) = b + X_i.W$. Given an $X_i$, our prediction would be $h_\theta(X_i)$.
Step 2 For the second step, one loss function could be, the average absolute difference between the prediction and the real output. This is called the ‘L1 norm’.
$L_1 = \frac{1}{n}\sum_{i=1}^{n} \text{abs}(h_\theta(X_i)  y_i)$
L1 norm is pretty good, but for our case, we will use the average of the squared difference between the prediction and the real output. This is called the ‘L2 norm’, and is usually preferred over L1.
$L_2 = \frac{1}{2n}\sum_{i=1}^{n} (h_\theta(X_i)  y_i)^2$.
Step 3 We have two sets of params $b$ and $W$. Ideally, we want $L_2$ to be 0. But that would depend on the choices of these params. Initially the params are randomly chosen, but we need to tweak them so that we can minimize $L_2$.
For this, we follow the Stochastic Gradient Descent algorithm. We will compute ‘partial derivatives’ / gradient of $L_2$ with respect to each of the parameters. This will tell us the slope of the function, and using this gradient, we can adjust these params to reduce the value of the method.
Again,
$L_2 = \frac{1}{2n}\sum_{i=1}^{n} (h_\theta(X_i)  y_i)^2$.
Deriving w.r.t. $b$ and applying chain rule,
$\large \frac{\partial L}{\partial b}$ = $2 . \large\frac{1}{2n}$ $\sum_{i=1}^{n} (h_\theta(X_i)  y_i) . 1$ (Since, $\frac{\partial (h_\theta(X_i)  y_i)}{\partial b} = 1$)
$ \implies \large \frac{\partial L}{\partial b}$ $= \sum_{i=1}^{n} (h_\theta(X_i)  y_i)$
Similarly, deriving w.r.t. $W_j$ and applying chain rule,
$\large \frac{\partial L}{\partial W_j}$ = $\large\frac{1}{n}$ $\sum_{i=1}^{n} (h_\theta(X_i)  y_i) . X_{ij}$
Hence, at each iteration, the updates we will perform will be,
$b = b  \eta \large\frac{\partial L}{\partial b}$, and, $W_j = W_j  \eta \large\frac{\partial L}{\partial W_j}$.
Where, $\eta$ is what is called the ‘learning rate’, which dictates how big of an update we will make. If we choose this to be to be small, we would make very small updates. If we set it to be a large value, then we might skip over the local minima. There are a lot of variants of SGD with different tweaks around how we make the above updates.
Eventually we should converge to a value of $L_2$, where the gradients will be nearly 0.
The complete implementation with dummy data in about 100 lines is here. A short walkthrough is below.
The only two libraries that we use are numpy
(for vector operations) and matplotlib
(for plotting losses). We generate random data without any noise.
1 2 3 4 5 6 7 8 9 10 

Where num_rows
is $n$ as used in the above notation, and num_feats
is the number of variables. We define the class LinearRegression
, where we initialize W
and b
randomly initially. Also, the predict
method computes $h_\theta(X)$.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 

The crux of the code is in the train
method, where we compute the gradients.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 

For the given input, with the fixed seed and five input variables, the solution as per the code is:
1 2 

This is how the $L_2$ loss converges over number of iterations:
To verify that this is actually correct, I serialized the input to a CSV file and used R to solve this.
1 2 3 4 5 6 7 8 9 10 

The intercept
is same as $b$, and the rest five outputs are the $W_i$, and are similar to what my code found.
This is different from ensemble models, where a submodel is trained separately, and it’s score is used as a feature for the parent model. In this paper, the authors learn a wide model (Logistic Regression, which is trying to “memorize”), and a deep model (Deep Neural Network, which is trying to “generalize”), jointly.
The input to the wide network are standard features, while the deep network uses dense embeddings of the document to be scored, as input.
The main benefits as per the authors, are:
DNNs can learn to overgeneralize, while LR models are limited in how much they can memorize from the training data.
Learning the models jointly means that the ‘wide’ and ‘deep’ part are aware of each other, and the ‘wide’ part only needs to augment the ‘deep’ part.
Also, training jointly helps reduce the side of the individual models.
They also have a TensorFlow implementation. Also a talk on this topic.
The authors employed this model to recommend apps to be downloaded to the user in Google Play, where they drove up app installs by 3.9% using the Wide & Deep model.
However, the Deep model in itself, drove up installs by 2.9%. It is natural to expect that the ‘wide’ part of the model should help in further improving the metric to be optimized, but it is unclear to me, if the delta could have been achieved by further bulking up the ‘deep’ part (i.e., adding more layers / training bigger dimensional embeddings, which are inputs to the DNNs).
]]>Intuitively, what would be your guess for the probability of collision to become > 0.5? Given that there are 365 possible days, I would imagine $n$ to be quiet large for this to happen. Let’s compute it by hand.
What is the probability that there isn’t a collision?
However, for the birthdays to not overlap, each one should pick $n$ different birthdays out of the 365 possible ones. This is equal to $_{365}P_{n}$ (pick any $n$ out of the 365 days, and allow permutations).
$P(\text{no collision}) = \frac{_{365}P_{n}}{365^n}$.
Plotting the graph for collision to happen, let’s see where this becomes true.
So it seems that the collision happens with a probability >= 0.5 after $n$ is greater than 23. The paradox in the Birthday Paradox is that we expect $n$ to be quite large, where as it seems you need only $\approx \sqrt{365}$ people.
In general, it has been proven that the if there are $n$ balls, and $b$ bins, then the probability of any bin having > 1 ball is >= 0.5, when $n \approx \sqrt{b}$.
Considering hash functions to be placing balls in bins, if the number of distinct elements that could be fed to the hash function are $n$, to ensure that the probability of collision remains < 0.5, the length of the hash in number of bits required for the hash function, $l$, should be such that $2^l > n^2$.
This means, if you expect $2^{32}$ distinct elements, make sure to use at least a 64 bit hash, if you want the probability of collision to be < 0.5
]]>$\large{\left(\frac{y}{x}\right)^{x}} \leq \large{y \choose x} \leq \large{\left(\frac{ey}{x}\right)^{x}}$.
For a large value of $n$, $\left(1  \large{\frac{1}{n}}\right)^{n} \approx \large{\frac{1}{e}}$.
For a large value of $n$, $\left(1 + \large{\frac{1}{n}}\right)^{n} \approx e$.
These were superhelpful in the Graduatelevel Algorithms and Advanced Algorithms courses, which were heavy on randomized algorithms.
I might post about some interesting bits related to randomized algorithms some time soon, so wanted to share these preemptively.
]]>Similarly, when computing which hashtable bucket a particular item goes to, the common way to do it is: $b = h(x)\bmod n$. Where $h(x)$ is the hash function output, $n$ is the number of buckets you have.
In hash functions, one should expect to receive pathological inputs. Assume, $n = 8$. What happens, if we receive $h(x)$ such as that they are all multiples of $4$? That is, $h(x)$ is in $[4, 8, 12, 16, 20, …]$, which in $\bmod 8$ arithmetic will be $[4, 0, 4, 0, 4, …]$. Clearly, only 2 buckets will be used, and the rest 6 buckets will be empty, if the input follows this pattern. There are several such examples.
As a generalization, if the greatest common factor of $h(x)$ and $n$ is $g$, and the input is going to be of the form $[h(x), 2h(x), 3h(x), …]$, then the number of buckets that will be used is $\large \frac{n}{g}$. This is easily workable on paper.
We ideally want to be able to use all the buckets. Hence, the number of buckets used, $\large \frac{n}{g}$ $= n$, which implies $g = 1$.
This means, the input and the modulus ($n$) should be coprime (i.e., share no common factors). Given, we can’t change the input, we can only change the modulus. So we should choose the modulus such that it is coprime to the input.
For the coprime requirement to hold for all inputs, $n$ has to be a prime. Now it will have no common factors with any input (except it’s own multiples), and $g$ would be 1.
Therefore, we need the modulus to be prime in such settings.
Let me know if I missed out on something, or my intuition here is incorrect.
]]>I picked TensorFlow rather than Caffe, because of the possibility of being able to run my code on mobile, which I enjoy. Also, the documentation and community around TF seemed slightly more vibrant than Caffe/Caffe2.
What we want to do is:
The hidden state at time step $t$ is $h_t$ is a function of the hidden step at the previous timestep, and the current input. Which is:
$f_W$ can be expanded to:
$W_{xh}$ is a matrix of weights for the input at that time $x_t$. $W_{hh}$ is a matrix of weights for the hiddenstate at the previous timestep, $h_{t1}$.
Finally, $y_{t}$, the output at timestep $t$ is computed as:
Dimensions:
For those like me who are finicky about dimensions:
This is my implementation of mincharrnn, which I am going to use for the purpose of the post.
We start with just reading the input data, finding the distinct characters in the vocabulary, and associating an integer value with each character. Pretty standard stuff.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 

Then we specify our hyperparameters, such as size of the hidden state ($H$). seq_length
is the number of steps we will train an RNN per initialization of the hidden state. In other words, this is the maximum context the RNN is expected to retain while training.
1 2 3 4 5 6 

We have a method called genEpochData
which does nothing fancy, apart from breaking the data into batch_size
number of batches, each with a fixed number of examples, where each example has an (x
, y
) pair, both of which are of seq_length
length. x
is the input, and y is the output.
In our current setup, we are training the network to predict the next character. So y
would be nothing but x
shifted right by one character.
Now that we have got the boilerplate out of the way, comes the fun part.
The way TensorFlow (TF) works is that it creates a computational graph. With numpy, I was used to creating variables which hold actual values. So data and computation went handinhand.
In TF, you define ‘placeholders’, which are where your input will go, such as place holders for x
and y
, like so:
1 2 3 

Then you can define ‘operations’ on these input placeholders. In the code below, we convert x
and y
to their respective ‘one hot’ representations (a binary vector of size, vocab_size
, where the if the value of x
is i
, the i
th bit is set).
1 2 3 

This is a very simple computation graph, wherein if we set the placeholders correctly, x_oh
and y_oh
will have the corresponding onehot representations of the x
and y
. But you can’t print out their values directly, because they don’t contain them. We need to evaluate them through a TF session (coming up later in the post).
One can also define variables, such as when defining the hidden state, we do it this way:
1


We’ll use the above declared variable and placeholders to compute the next hidden state, and you can compute arbitrarily complex functions this way. For example, the picture below from the TF whitepaper shows how can we represent the output of a Feedforward NN using TF (b
and W
are variables, and x
is the placeholder. Everything else is an operation).
The code below computes $y_t$, given the $x_t$ and $h_{t1}$.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

Now we are ready to complete our computation graph.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

As we saw above, we can compute the total loss in the batch pretty easily. This is usually the easier part.
While doing CS231N assignments, I learned the harder part was the backprop, which is based on how off your predictions are from the expected output.You need to compute the gradients at each stage of the computation graph. With large computation graphs, this is tedious and error prone. What a relief it is, that TF does it for you automagically (although it is super to know how backprop really works).
1 2 

Evaluating the graph is pretty simple too. You need to initialize a session, and then initialize all the variables. The run
method of the Session
object does the execution of the graph. The first input is the list of the graph nodes that you want to evaluate. The second argument is the dictionary of all the placeholders.
It returns you the list of values for each of the requested nodes in order.
1 2 3 4 5 6 7 8 9 10 

After this, it is pretty easy to stitch all this together into a proper RNN.
While writing the post, I discovered a couple of implementation issues, which I plan to fix. But nevertheless, training on a Shakespeare’s ‘The Tempest’, after a few hundred epochs, the RNN generated this somewhat englishlike sample:
1 2 3 4 5 6 7 

Not too bad. It learns that there are characters named Sebastian and Ferdinand. And mind you this was a character level model, so this isn’t super crappy :)
(All training was done on my MBP. No NVidia GPUs were used whatsoever. I have a good ATI Radeon GPU at home, but TF doesn’t support OpenCL yet. It’s coming soonish though.)
]]>Memcache was being used as a cache for serving the FB graph, which is persisted on MySQL. Using Memcache along with MySQL as a lookaside/writethrough cache makes it complicated for Product Engineers to write code modifying the graph while taking care of consistency, retries, etc. There has to be glue code to unify this, which can be buggy.
A new abstraction of Objects & Associations was created, which allowed expressing a lot of actions on FB as objects and their associations. Initially there seems to have been a PHP layer which deprecated direct access to MySQL for operations which fit this abstraction, while continuing to use Memcache and MySQL underneath the covers.
This PHP layer for the above model is not ideal, since:
Incremental Updates: For onetomany associations, such as the association between a page and it’s fans on FB, any incremental update to the fan list, would invalidate the entire list in the cache.
Distributed Control Logic: Control logic resides in fat clients. Which is always problematic.
Expensive Read After Write Consistency: Unclear to me.
TAO is a writethrough cache backed by MySQL.
TAO objects have a type ($otype$), along with a 64bit globally unique id. Associations have a type ($atype$), and a creation timestamp. Two objects can have only one association of the same type. As an example, users can be Objects and their friendship can be represented as an association. TAO also provides the option to add inverseassocs, when adding an assoc.
The TAO API is simple by design. Most are intuitive to understand.
assoc_add(id1, atype, id2, time, (k→v)*)
: Add an association of type atype
from id1
to id2
.assoc_delete(id1, atype, id2)
: Delete the association of type atype
from id1
to id2
.assoc_get(id1, atype, id2set, high?, low?)
: Returns assocs of atype
between id1 and members of id2set
, and creation time lies between $[high, low]$.assoc_count(id1, atype)
: Number of assocs from id1
of type atype
.As per the paper:
TAO enforces a peratype upper bound (typically 6,000) on the actual limit used for an association query.
This is also probably why the maximum number of friends you can have on FB is 5000.
There are two important factors in the TAO architecture design:
The choice of being okay with multiple roundtrips to build a page, while wanting to ensure a snappy product experience, imposes the requirement that:
The underlying DB is MySQL, and the TAO API is mapped to simple SQL queries. MySQL had been operated at FB for a long time, and internally backups, bulk imports, async replication etc. using MySQL was well understood. Also MySQL provides atomic write transactions, and few latency outliers.
Objects and Associations are in different tables. Data is divided into logical shards. Each shard is served by a database.
Quoting from the paper:
In practice, the number of shards far exceeds the number of servers; we tune the shard to server mapping to balance load across different hosts.
And it seems like the sharding trick we credited to Pinterest might have been used by FB first :)
Each object id contains an embedded shard id that identifies its hosting shard.
The above setup means that your shard id is predecided. An assoc is stored in the shard belonging to its id1
.
TAO also requires “readwhatyouwrote” consistency semantics for writers, and eventual consistency otherwise.
TAO is setup with multiple regions, and user requests hit the regions closest to them. The diagram below illustrates the caching architecture.
There is one ‘leader’ region and several ‘slave’ regions. Each region has a complete copy of the databases. There is an ongoing async replication between leader to slave(s). In each region, there are a group of machines which are ‘followers’, where each individual group of followers, caches and completely serves read requests for the entire domain of the data. Clients are sticky to a specific group of followers.
In each region, there is a group of leaders, where there is one leader for each shard. Read requests are served by the followers, cache misses are forwarded to the leaders, which in turn return the result from either their cache, or query the DB.
Write requests are forwarded to the leader of that region. If the current region is a slave region, the request is forwarded to the leader of that shard in the master region.
The leader sends cacherefill/invalidation messages to its followers, and to the slave leader, if the leader belongs to the master region. These messages are idempotent.
The way this is setup, the reads can never be stale in the master leader region. Followers in the master region, slave leader and by extension slave followers might be stale as well. The authors mention an average replication lag of 1s between master and slave DBs, though they don’t mention whether this is samecoast / crosscountry / transatlantic replication.
When the leader fails, the reads go directly to the DB. The writes to the failed leader go through a random member in the leader tier.
There are multiple places to read, which increases readavailability. If the follower that the client is talking to, dies, the client can talk to some other follower in the same region. If all followers are down, you can talk directly to the leader in the region. Following whose failure, the client contacts the DB in the current region or other followers / leaders in other regions.
These are some clientside observed latency and hitrate numbers in the paper.
The authors report a failure rate of $4.9 × 10^{−6}$, which is 5 9s! Though one caveat as mentioned in the paper is, because of the ‘chained’ nature of TAO requests, an initial failed request would imply the dependent requests would not be tried to begin with.
This again is a very readable paper relatively. I could understand most of it in 3 readings. It helped that there is a talk and a blog post about this. Makes the material easier to grasp.
I liked that the system is designed to have a simple API, and foucses on making them as fast as they can. Complex operations have not been built into the API. Eventual consistency is fine for a lot of use cases,
There is no transactional support, so if we have assocs and inverse assocs (for example likes_page
and page_liked_by
edges), and we would ideally want to remove both atomically. However, it is possible that assoc in one direction was removed, but there was a failure to remove the assoc in the other direction. These dangling pointers are removed by an async job as per the paper. So clients have to ensure that they are fine with this.
From the Q&A after the talk, Nathan Bronson mentions that there exists a flag in the calls, which could be set to force a cache miss / stronger consistency guarantees. This could be specifically useful in certain usecases such ash blocking / privacy settings.
Pinterest’s Zen is inspired by TAO and implemented in Java. It powers messaging as well at Pinterest, interestingly (apart from the standard feed / graph based usecase), and is built on top of HBase, and a MySQL backend was in development in 2014. I have not gone through the talk, just cursorily seen the slides, but they seem to have been working on CompareAndSwap style calls as well.
We start by reading the Early Bird paper.
The paper starts with laying out core design principles. Lowlatency and highthroughput are obvious. Ability to present realtime tweets is the unique requirement for Twitter at the time of the paper being written, i.e., new tweets should be immediately searchable. Regarding the second requirement, in the past search engines would crawl the web / index their documents periodically, and the indices were built via batch jobs through technologies such as MapReduce.
Since the time paper was authored (Fall 2011), this has changed. Both Google and Facebook surface realtime results in their search results and feed. But arguably a large fraction of Twitter’s core user intent is realtime content, so they have to get it right.
The core of the paper starts with going over the standard fanout architecture for distributed systems, with replication & caching for distributing query evaluation and then aggregating results. Then they start to focus specifically on what goes on in a single node while evaluating the query.
For IR newbies: An invertedindex maintains something called ‘posting lists’. Consider them to be something like map<Term, vector<Document>>
in C++, i.e., a map from a Term to a list of documents. If I am querying for the term beyonce
, I’ll look up the posting list for this term, and the list of documents having that term would be present in the list.
Of course, there can be millions of documents with such a popular term, so there is usually a twophase querying. In the first phase, we do a cheap evaluation on these documents. This is usually achieved by precomputing some sort of quality score such as PageRank (which is independent of the query and searcher), keeping the documents in the list sorted in descending order according to this quality score. Then at query time, we get the top $N$ candidates from this vector.
Once we have the candidates, then we do a second phase, which involves a more expensive ranking on these candidates, to return a set of ranked results according to all the signals (query, searcher and document specific features) that we care for.
EarlyBird is based on Lucene (a Javabased opensource search engine). Each Tweet is assigned a static score on creation, and a resonance score (likes, retweets) which is liveupdated. Upon querying, the static score, resonance score and the personalization score, which is computed according to the searcher’s social graph are used to rank the tweets.
At the time of the paper being written, they state the latency between tweet creation and it’s searchability was ~ 10s. Their posting lists store documents in chronological order, and at the time of querying, these documents are queried in reverse chrono order (most recent tweet first).
For querying, they reuse Lucene’s and
, or
etc. operators. Lucene also supports positional queries (you can ask Lucene to return documents which have term A and B, and both are atmost D distance away from each other in the document).
EarlyBird seems to handle the problem of concurrent readwrites to the index shard by splitting the shard into ‘segments’. All segments but one are readonly. The mutable index continues to receive writes until it ‘fills up’, at which time it becomes immutable. This is analogous to the ‘memtable’ architecture of LSM trees. But I wonder if they do any sort of compactions on the segments. This is not clearly explained here.
Layout for Mutable (Unoptimzed) Index: Then they discuss the problem of how to add new tweets into posting lists. Their posting lists at the time, were supposed to return reversechrono results. So they don’t use any sort of document score to sort the results. Instead tweet timestamp is what they want for ordering.
Appending at the end of posting lists, doesn’t gel well with deltaencoding schemes, since they naturally work with forward traversal, and they would have to traverse backwards. Prepending at the beginning of the lists using naive methods such as linked lists would be unfriendly for the cache, and require additional memory footprint for the next
pointers.
They fallback to using arrays. The posting list is an array, with 32bit integer values, where they reserve 24 bits for the document id, and 8 bits for the position of the term in the document. 24 bits is sufficient, because firstly they map global tweet ids, to a local document id in that posting list, secondly their upper limit of how many tweets can go in a segment is < $2^{23}$. Though, I might want to keep additional metadata about a document, and not just position of the term, so this is a little toospecific for tweets at the time of the paper being authored.
They also keep pools of preallocated arrays, or slices (of sizes $2^1$, $2^4$, $2^7$ and $2^{11}$), similar to how a Buddy allocator works. When a posting list exhausts it’s allocated array (slice), they allocate another one which is 8x bigger, until you reach a size of $2^{11}$. There is some cleverness in linking together these slices. If you can get this linking to work, you would not have to do $O(N)$ copy of your data as you outgrow your current allocated slice.
Layout for Immutable (Optimized) Index: The approach of pools is obviously not always efficient. We can end up wasting ~ 50% of the space, if the number of documents for a term are pathologically chosen. In the optimized index, the authors have a concept of long and short posting lists. Short lists are the same as in the unoptimized index.
The long lists comprise of blocks of 256 bytes each. The first four bytes have the first posting uncompressed. The remaining bytes are used to store the document id delta from the previous entry, and the position of the term in a compressed form. I wonder why they don’t do this compression to the entire list, and why have compressed blocks? My guess is that compressing the entire list would be prohibit random access.
Concurrency:
Most of the heavylifting of consistency within a specific posting list readerwriters is done by keeping a perposting list value of the maximum document id encountered so far (maxDoc
). Keeping this value as volatile
in Java introduces a memory barrier. So that there is consistency without giving up too much performance.
The paper was very easy to read. I would have hoped that the authors would have described how the switching between immutabletomutable index happens, how is the index persisted to disk, etc., apart from addressing the rigid structure of metadata in each posting list entry (just the term position).
There are a couple of new posts about improvements on top of EarlyBird.
This blogpost introduces Omnisearch. As I mentioned earlier, EarlyBird is strongly tied to the tweet’s content. In mature search systems, there are several “verticals”, which the infra needs to support. This blogpost describes how they are moving to a generic infrastructure which can be used to index media, tweets, users, moments, periscopes, etc.
Here is the blogpost, on this topic. It goes over what is mentioned in the paper before describing their contributions. They mostly work on the optimized index, as described earlier.
If a document has duplicate terms, it occurs that many times in the old posting list format. In the new format, they keep (document, count) pairs, instead of (document, position) pairs. They keep another table for positions. To further optimize, since most counts are 1, they store (document, count1) pairs. They achieve a 2% space saving and 3% latency drop. I’m not entirely convinced why this improves both for tweettext only index.
However, for indexing terms which are not present in the text (such as for user indices, where we want to keep a term for verified users) and hence the position does not make any sense. In that case, a separate position table makes sense, because we can completely skip the table in those cases.
SuperRoot is another layer on top of Twitter’s index servers, which exposes a single API to customers, instead of having them query individual indices themselves.
SuperRoot allows them to abstract away lowerlevel changes, add features like quota limitations, allow query optimization, and allow having thin clients. This is pretty essential when you start having a large number of customers.
]]>Given an array of elements, find the lexicographically next permutation of that array.
As an example, if the array is [1, 2, 2, 3], the lexicographically next permutation would be [1, 3, 2, 2], followed by [3, 1, 2, 2] and so on. There is a really neat article explaining how to solve this. If you don’t know how to do it, I encourage you to try examples with array sized 4 and above, and try to see the patterns.
A recent problem I was solving was a variant of this.
Given an unsigned integer, find the lexicographically next unsigned integer, with the same number of bits set.
It’s trivial to observe that, we can reduce this problem to the above problem, as an array with just two kinds of elements, 0s and 1s.
Alright. Are we done? Not quite. The solution mentioned in the link is an $O(n)$ solution. Technically, an unsigned integer would be 32 or 64 bits. So $n$ would be one of those numbers for the purpose of the problem. It should be easy to repurpose the algorithm in the article mentioned above for our case. But I wanted to see if we can do it with as few operations as possible, because looping when operating with binary numbers is just not cool :)
I chanced upon Sean Anderson’s amazing bitwise hacks page, and I distinctly remember having seen this 78 years ago, when I was in Mumbai. Regardless, if you understand his solution: Awesome! It took me some time to figure out and I wrote a slightly slower but arguably easier to comprehend solution, which is 50% slower than his in my microbenchmark, but better than the naive looping solution. So here goes.
Let’s pick an example: $10011100$. The next permutation would be $10100011$.
As per the article above, we find the longest increasing suffix from righttoleft (called as longest nonincreasing suffix from lefttoright in the article). In this case, it will be $11100$. Thus, the example is made of two parts: $100.11100$ ($.$ for separating).
The first zero before the suffix is the ‘pivot’, further breaking down the example: $10.0.11100$.
We swap the rightmost one in the suffix, with this pivot (rightmost successor in the article). So the example becomes $10.1.11000$. However, the suffix part needs to be sorted, since so far we were permuting with the prefix $10.0.$, but this is the first permuation with the prefix $10.1.$, and therefore the suffix should be it’s own smallest permutation (basically, it should be sorted).
Hence, we move the zeroes in the suffix to the end, resulting in $10.1.00011$. This is the correct next permutation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

We’ll break it down. Trust me, it’s easier to grasp than the terse wisdom of Sean Anderson’s snippet.
1 2 

To find the pivot in the example $n = 10011100$, we set all the bits in the suffix to 1 first.
$n1$ will set the trailing zeroes to 1, i.e., $10011011$, and $n  n1$ would result in a value with all original bits set, along with the trailing zeroes set to 1, i.e., $10011111$. Thus, all the bits in the suffix are now set.
We now set the pivot bit to 1 (and unset the entire suffix), by adding 1. Using __builtin_ctz
we can then find the number of trailing zeroes, which is the same as the bit number of the pivot. See the note below for __builtin
functions.
1 2 

We then proceed to set the pivot. Since the value of $n$ was $10011100$, $step1 = 10111100$.
1 2 

Now we need to unset the successor, which we can do by a trick similar to how we found the pivot. step1  1
would unset the lowest significant set bit (the successor) and set all it’s trailing zeroes (i.e., $10111011$). step1 & (step1  1)
i.e., $10111100 \& 10111011$ would lead to zeroing out of the successor bit and the trailing zeroes. Hence, $step2 = 10111000$.
1 2 3 

This is fairly straightforward, we extract the suffix mask, i.e., all the bits corresponding to the suffix part are set, and then $\&$ with the number $10.1.11000$ so far gives us the modified suffix, i.e. $11000$.
1 2 

All we need to do now is to pull the 1s to the left in the suffix. This is done by leftshifting them by the number of trailing zeroes, so we get $00011$ (the first part of the calculation of final
). This is our ‘sorted’ suffix we mentioned earlier.
Then we OR it with the number so far, but except the suffix part zeroed out, so that we replace the unsorted suffix with this sorted suffix, i.e. $00011  10100000 \implies 10100011$.
I hope this helped you breakdown what’s going on, and probably served as a bridge between the naive solution and oneliner bitwise hacks.
Please leave any comments below. I’d be super happy to discuss any alternative solutions!
__builtin
FunctionsWhenever you are working with bitwise operations, gcc provides builtin methods such as __builtin_popcount
(number of set bits in the argument), __builtin_ctz
(number of trailing zeroes in the argument), and so on. These functions map to fast hardware instructions, and are also concise to write in code, so I use them whenever I can.
A naive way to rebalance traffic is to assign a part of the keyspace to each machine. For an object $o$, with key $k$, the machine serving the object can be found out by $h$($k$) $\%$ $n$. Where $h$ is a hashfunction, such as SHA or MD5.
Advantages
Problems
Consistent Hashing is an optimization on the naive solution, in that it avoids the need to copy the entire dataset. Each key is projected as a point on the edge of a unit circle. Every node in the cluster is assigned a point on the edge of the same circle. Then each key is served by the node closest to it on the circle’s edge.
When a machine is added or removed from the cluster, each machine gives up / takes up a small number of keys. This rebalance happens in such a fashion that only $O$($\frac{K}{n}$) transfers happen. Where $n$ is the number of machines in the cluster.
Advantages
Problems
Note that in both the above methods, when you are adding or removing machines, there is some amount of shutdown involved. In the first case, we need to completely turnoff reads and writes because the cluster is going through a complete rebalance. In the second case, we can just turnoff reads and writes for a fraction of the keyspace which is $\frac{1}{n}$ as compared to the first solution.
Pinterest in it’s blogpost about MySQL sharding talks about a setup where they use the key itself as a marker of which shard the key belongs to. When doing a write for the first time on the object $o$, we generate a key for it, in which we keep the higher $x$ bits reserved for the shard the object belongs to. The next time, there is a request for a read, we use those $x$ bits to find which shard should be queried.
Each shard is located on a certain machine, and the shard>machine map is kept in ZooKeeper (a good place to read & write small configs in a consistent fashion). Upon finding the shard, we lookup this map to locate the machine to which the request needs to be made.
When new machines are added, we simply create more shards, and start utilizing those new shards for populating the bits corresponding to the shards. This way, new writes and the reads correspodning to those writes dont hit the existing machines.
I’m going to refer to this as the “Pinterest trick”, because I read it on their blog. Pretty sure, this is not the first time it’s being done though.
Advantages
Disadvantages
Another trick that some setups apply is to have the keyspace sufficiently presharded to begin with. Then these shards are simply moved to other machines, if their current hosts can’t serve them, as traffic increases. For MySQL, each shard is a separate database instance. We used a similar approach when operating HBase at FB, where we expected the need to add more machines in future.
Discussing with Dhruv, brought up an interesting point: Why are we sharding a database? Sure, we want to scale horizontally. But which resource are we running out of? CPU, Disk, Network?
The above tricks that we discussed, scale for disk. Note that, in the case of the Pinterest trick, new shards don’t proportionately help with serving existing read queries. For most Social Networks, the amount of data being created outpaces consumption, and they are bound on disk, rather than CPU.
If you would be bound on CPU, there are several ways to move your shards to notsohot machines, depending on which tradeoff you would like to make:
I wrote a lot of this from a highlevel knowledge, and discussing with people who have worked on these kind of things. I might have omitted something, or wrote something that is plainly incorrect. Moreover, this is an open topic, with no “right answers” that apply to all. If you have any comments about this topic, please feel free to share in the comments section below.
]]>The first day was dedicated to tutorials. Most tutorials were ‘survey’ like in the content, in that they did not present anything new, but were good if you want to get an idea about what’s happening in a particular area of IR.
Deep Learning Tutorial
This tutorial was conducted by members of the Huawei Labs, China. It was about the current state of Deep Learning in the industry.
Succinct Data Structures
This tutorial was about datastructures related to inverted indices. It went over both, theory (minimal and easy to understand), as well as a handson session. I really enjoyed this tutorial.
The second day started with a keynote from Christopher Manning. Some highlights from the talk:
Listing notes from few talks which were interesting:
Learning to Rank with Selection Bias in Personal Search (Google)
Fast and Compact Hamming Distance Index
Scalable Semantic Matching of Queries to Ads in Sponsored Search Advertising
Day 4 had talks from the Industry which talked about IR systems at scale (big / small). I found these to be very interesting. It’s sad that they were not recorded.
Search Is Not a Box  Harad Shemtov (Google)
There was a related paper presented by Ido Guy from Yahoo! Research.
Searching by Talking: Analysis of Voice Queries on Mobile Web Search
When Watson Went to Work  Aya Soffer (IBM Research)
Ask Your TV: RealTime Question Answering with Recurrent Neural Networks  (Ferhan Ture, Comcast)
Amazon Search: The Joy of Ranking Products (Daria Sorokina)
Learning to Rank Personalized Search Results in Professional Networks (Viet HaThuc) * LinkedIn Search has different usecases (recruiting, connecting, job seeking, sales, research, etc.) * They would want to personalize the results for recruiters, job seekers, etc. * Use LinkedIn “skills” as a way to cluster users, underlying assumption being that people with similar skills are likely to connect (while also removing unhelpful skills). * Getting Intent Estimations for different intents. * Use intent estimations for their federated search (people result, job result, group result, etc.) * Slides here
The last day had workshops. I attended the first few talks of the NeuIR (Neural Network IR) workshop, before I had to leave to catch my flight. The keynote was given by Tomas Mikolov from FAIR. His slides are here.
Key points:
Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 20062015
Query to Knowledge: Unsupervised Entity Extraction from Shopping Queries using Adaptor Grammars
Explicit In Situ User Feedback for Search Results
This was my first IR conference, and an academic conference in a long time. These are the key takeaways:
To those who are new to LISP, it is pretty simple to explain. LISP programs are based on something called ‘sexpressions’. An sexpression looks like this: $(\mathrm{operator}\ \mathrm{operand_1}\ \mathrm{operand_2}\ \mathrm{operand_3}\ …)$.
For example:
The operands can themselves be recursively computed too.
For example, $(+$ $1$ $(*$ $2$ $3))$ is a valid expression. First we evaluate the inner $(*$ $2$ $3)$ part, then the original expression resolves to $(+$ $1$ $6)$, which then evaluates to $7$. This can go on recursively.
For someone who wants to design an interpreter, LISP is the ideal reallife language to start with. This is for two reasons:
I could only stay motivated, and bring this project to a closure, because you can pick a very small subset of LISP, and still do a lot of interesting things.
To keep you motivated about reading the article, lets do a demo first and then we can go into details about how I built this.
Here is the GitHub repository for the interpreter, and the app on iTunes (Lambda Lisp). Feel free to file issues / contribute.
If you are still reading: Let’s build an interpreter!
Lexing involves finding lexemes, or syntactical tokens, which can then be combined to interpret a grammatical sentence. In the expression $(+$ $1$ $2)$, the tokens are [$($, $+$, $1$, $2$, $)$]. Sophisticated compilers use lex or flex for finding these tokens, handling whitespace, attaching a token type to each of them, etc.
I did not want to bloat up my simple interpreter by using lex / flex. I found this nifty oneline barebones Lexer in Peter Norvig’s article:
1 2 3 

Essentially, what this does is to handle whitespace (somewhat). It basically adds spaces around the brackets, and then splits the expression on whitespace.
We need to do the replacement for all operators, but otherwise it works well. This is because LISP is simple enough that attaching types to tokens (and erroring out, if required) can be done at the time of parsing. This is how I did it in Go, just for completeness sake.
1 2 3 4 5 

The recurrent theme in the design of this interpreter, is to be lazy and push the harder problems to the next layer.
Given an expression, we would need to make sure that the expression follows a structure, or a Grammar. This means two things in our case:
At this stage, we are only concerned about wellformedness of the sexpression. We don’t care if the $+$ operator received incompatible operands, for instance. This means that given an expression like $(+$ $1)$, we would mark this expression to be okay at this point, because the expression is wellformed. We will catch the problem of too few operands to $+$, at a later time.
We can start solving the problem of checking wellformedness of the expression by using an Abstract Syntax Tree (or AST). An AST is a way of representing the syntactic structure of code. In this tree, the leaf nodes are atomic values, and all the nonleaf nodes are operators. Recursion can be naturally expressed using an AST.
This is how we can represent a node of this tree in the code:
1 2 3 4 5 

To actually verify the wellformedness of the expression and build the AST, we would go about it this way:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 

You can see how the grammar for interpreting the sexpression grammar is hardcoded in the code here. We expect the expression to be either a single value, or something like $(\mathrm{operator}\ \mathrm{o_1}\ \mathrm{o_2}\ …\ )$, where $\mathrm{o_i}$ can be an atomic value, or a nested expression.
Note that we construct the AST slightly differently. The operator is also part of the children
.
We combine the parsing and evaluation of the AST into one stage. The result of evaluating an AST is an Atom
, which can either have a Value
or an errror
.
1 2 3 4 

Here is a stripped down AST evaluation code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 

Basic evaluation is very simple. We have a struct called LangEnv
, which is the ‘environment’ datastructure storing amongst other things, defined operators. When evaluating an AST, if it is a single node, the value of the node is the result. Otherwise, we simply lookup the operator in the environment using getOperator
, then resolve the operands recursively, and pass the operands to the operator. The operand deals with making sure that the operands are sane.
An operator looks something like this:
1 2 3 4 5 6 

As seen, symbol
is the name of operator, so for the binary addition it can be “+”. handler
is the function which will actually do the heavy lifting we have been avoiding all this while. It takes in a slice of Atom
s (and a LangEnv
, more on that later) and returns an Atom
as a result.
Now, the fun stuff.
Remember Atom
has a Value
inside? Value
is an interface, and any type which wants to be a Value
, needs to implement the following methods.
1 2 3 4 5 6 7 

This is enough power to figure out which value is of which type. In LangEnv
we keep a list of builtin Value
types, such as intValue
, floatValue
, stringValue
, etc.
To deduce the type of a value, we simply do this:
1 2 3 4 5 6 7 8 9 

Now imagine an expression like $(+$ $1.2$ $3)$.
$1.2$ resolves to floatValue
, and $3$ resolves to intValue
. How would we implement the handler
method for the $+$ operator to add these two different types? You might say, that this would involve casting of $3$ to floatValue
. Now how do we decide what values to cast, and what type should we cast them to?
This is how we do it. In the method, typeCoerce
, we try to find out which is the right value type to cast all our values to. It is declared like:
1 2 

This is what we do inside typeCoerce
:
Hence, the $+$ operator could be implemented this way:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 

Here we basically just call typeCoerce
on the operands, and if its possible to cast them to one single type, we do that casting, and actually perform the addition in the new type.
The $+$ operator can be used to add strings as well. However, we don’t want something like $($$+$ $3.14\ \mathrm{“foo”})$. The typeCoerce
method can be trivially extended to support a list of type valid type precedence maps, and all operands need to belong to the same precedence map. In this case, it could be { {intType: 1, floatType: 2}, {stringType: 1 } }
. This particular list ensures that we don’t add ints and strings for example, because they don’t belong to the same precedence map.
Note that the entire implementation of the operator is defined in the Operator
struct’s handler
method. Whether or not the operator supports this sort of casting, or decides to roll its own, or not use it at all, is the prerogative of the operator.
A typical variable definition could look like this: $(\mathrm{defvar}\ \mathrm{x}\ 3.0)$.
Now, defvar
is an operator too. It expects the first argument to be of varType
(matches the regex of a variable), and the value can be anything (except varType
). We just need to check if the type conditions match, and the variable is not a defined operator. If both are okay, we define the variable in our LangEnv
’s varMap
.
We need to change the part in our evalAST
method which to support variable lookup.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

Here we can assume we have a helper method called getVarValue
, which looks up the value of a variable in the varMap
, or throws an error if required (this is also simple to implement).
Defining methods is even more fun! We use the defun
operator. Consider this:
1


The first operand is the name of the method, the second is a list of variables that you would pass to the method, and the last is the actual method, assuming that the variables you need are already defined. (circlearea 10)
after this definition should return 314
.
Calculating the area of a rectangle is pretty much similar.
1


We need a couple of things for function calls to work fine:
astValue
which can be used for keeping ASTs. So far we were keeping ints, floats and so on.evalAST
to not evaluate the AST in the defun
arguments. This is because in circlearea
, the (* 3.14 r r)
itself is the value (AST value).defun
operator needs to add an operator to the opMap
, with the same name as the method, and define its handler
method.(defvar x 3.0)
defining the variable x
, followed by (defun foo (x) (+ 1 x))
which defines a method uses a param labelled x
. The interpreter may look at the varMap
and think that the programmer wants to use the global x
, which is $3.0$. The actual intention of the programmer is to use the parameter x
.For this to work correctly, we would need:
* A new LangEnv
to be created, inside the handler
.
* First copy the same varMap
as the parent LangEnv
passed to the handler.
* Then copy the params passed to the handler. Any duplicates will be overwritten, but all global definitions would be preserved. The variable defined in the inner scope would be visible.
* Inside the handler, we will call evalAST
to evaluate the AST we were provided in the method definition with the new LangEnv
* We also keep track of the recursion depth in LangEnv
, and it is incremented every time a recursive call is made. If it exceeds a large value (100000 for now), we can error out, so as to salvage the interpreter at least.
This is the only complicated part of the interpreter. Those interested in the code can check it out here.
Apart from making sure that we have some sort of recursion depth limit enforced, recursion does not need any special handling. Except, defining some new operators like cond
(the ifelse equivalent), which are required for writing something useful.
Here is the implementation of the factorial function:
1 2 3 4 5 6 7 8 

fact(10)
returns 3628800
as expected.
Once I had the interpreter working fine, I wanted to run this on an iOS app. Why? Just for fun. It turns out with Go 1.5, a new tool called gomobile. Hana Kim gave a talk about this at GopherCon 2015.
What it does is, it compiles your Go package into a static library, and generates ObjectiveC bindings for it, and wraps them together in a nice iOS friendly .framework
package.
There are a few restrictions regarding not being able to return complex types such as structs within structs, but apart from that it was fairly easy to use in my barebones app. Here is the code for the app, and we have already seen the demo earlier.
(Thanks to Dhruv Matani for reviewing this blogpost.)
]]>Following the theme from the previous post, the first question is: ‘Why do we need it?’. If you are familiar with network programming, or any multithreaded programming which involves blocking IO, you already know the problem at hand. Right from the hardware level to the software level, a common problem that happens is: IO is slower than the CPU. If we have several tasks to finish, and the current task being executed is waiting for a blocking IO to finish, we should ideally work on the other tasks and let that blocking IO finish in the background, and check on it later.
When we have several such operations happening in the background, we need a way to figure out when a particular operation (such as read, write, accept a connection), can be performed without blocking, so that we can quickly do that, and return to other things. select(2), poll(2), epoll(4), kqueue(2) (on *BSD systems), are one of the several ways to do it. In essence, you register a set of file descriptors that you are interested in, and then call one of these ‘backends’. They would usually block until either one of the fds that you are interested in, is ready for data to be read or written to it. If none of them is ready, it would block for a configured amount of time and then return.
The problem that libevent solves is, it provides an easy to use library for notifying when an event happens on the file descriptors which you consider interesting. It also hides the real backend (select, epoll, kqueue) being used, and this helps you avoid writing platformdependent code (eg., kqueue works only on *BSD) and if there were a new and improved backend in the future, your code would not change. It is like the JVM for asynchronous event notification system.
I only have experience with select
, so my context is limited. Using select
is very tedious.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

In essence, what we do here is to create a set of file descriptors (fd_set
). We then run a loop where every time we set all the file descriptors we are interested in, into that set. Then we call select(), and it either times out, or one of the bits in that set would be set. We have to check for each of the file descriptors. This makes it a little ugly. Other backends might be more pleasant to use, but libevent is way ahead of select in terms of usability. Here is some sample code:
1 2 3 4 

An event_base
represents a single event loop like the one that we saw in the select example. To subscribe to changes in a file descriptor, we will first create an event
. This can be done using event_new
, which takes in the event base we created, the file descriptor, flags signalling when the event is active, the callback method and arguments to the callback method. In this particular example, we ask that the acceptConnCob
method be called when the file descriptor is ready for reading (EV_READ
) and persist this event, i.e, even when the event fires, automatically add it for the next time (EV_PERSIST
is to be used here). Note that we had to add the file descriptors in every iteration of the while loop of the select example, so using the EV_PERSIST
flag is a slight convenience here. Once, I have created the event, I need to add that event to the event_base
it should be processed by, along with a timeout, using the event_add
method. If the fd doesn’t become active by the specified time, the callback will be called anyways. Finally, to get things running, we will ‘dispatch’ the event base, which will spawn a new thread to run the event loop.
Note that nowhere have I specified which backend to use. I don’t need to. However, there are ways to prefer or avoid certain backends in libevent, using the event_config
struct. Refer to the link in the end.
I can add multiple events, and there are a lot of options that you can use. For instance, I can create a timer by passing 1 as a file descriptor with the EV_TIMEOUT
and EV_PERSIST
flags, and the required timeout in event_add
. This would call the timeout callback every ‘timeout’ seconds. Example:
1 2 3 

I created a simple fortune cookie server (one of my favorite demo examples), where I have a set of messages, and if someone asks me for a fortune cookie, I will give them the current fortune cookie. Every few seconds, I will pick a new fortune cookie to be returned. This is implemented by having two events, one for accepting connections and the other for having a timer. The code is here.
One small thing to keep in mind is that if the callbacks themselves to do something heavy, then it defeats the purpose of using libevent. This is because the callbacks are called from the same thread as the actual event loop. The longer the callback runs, the longer the event loop is not able to process other events.
libevent allows you to do a lot of customizations. In the above example, I have added callbacks to override the logging mechanism, so that I can use glog (Google’s logging library). There are several other features such buffered events and a lot of utility methods (such as creating simple http servers), that you can find in the libevent book.
There are other similar async event notification systems such as libev, and libuv, etc. but I haven’t had the time to figure out the differences. I hope to cover the very interesting wrapper around libevent in folly, Facebook’s open source C++ library, in a future post.
]]>As Engineers I feel we are often excited to work on new and ambitious projects. I am specifically talking about nontrivial projects which break new ground, and/or have a reasonable change of not succeeding. The latter could be because it is often the case that, these projects are complicated enough and its hard to be exact with respect to the benefits. These projects might also touch certain areas of the system which are hazy in general.
‘Hazy’ doesn’t really imply that that particular area / part of the system, is naturally hard to understand. It could be just that we don’t know the problem well enough, and how it interacts with those ‘hazy’ areas. I cannot stress enough that it is critical to understand the problem really well before hand. It seems clichéd, and has been repeated so many times, that it will probably not make a good enough impact. So, I will repeat this again in detail, so it stays with you and me, a little longer.
As per Prof. Bender, when giving a presentation, making sure that people understand why we did what we did, is the most important thing. Extending this backwards, ever wondered if that problem really needs to be solved in the first place? A lot of times, as a new CS graduate, working on my first fulltime unsupervised big tasks, I would really be in awe of the supposed problem. Looking with rosetinted glasses, you feel that this is what you had told the recruiter and interviewers that you wanted to do in the job. Excellent, lets start working on it. And if you do this, and just jump into this directly, you are going to have a bad time.
Often I did not spend enough time understanding why exactly was I doing what I was doing. Do benchmarks show that this is really needed? Do I have a good enough prototype which shows that if I do what I am going to do, it will give us significant benefits? Do people need this? Has this problem been solved before? What is the minimum I can do to solve this reasonably, and move on to other bigger problems?
This proactive research is what I feel is the difference between new and experienced engineers. In fact, I think, in some cases senior engineers write LESS code than the less experienced ones and still get more things done. Its now clear to me, that the actual coding should only take 10% of the time allocated to the project. If I spend enough time doing my duediligence and am ‘lazy’, I can simply prune some potential duds much before they turn into huge time sinks. If I spend some more time on the problems which actually require my time, I can figure out things I can do to reduce the scope of the problem, or cleverly use prebuilt solutions to do part/most of the work. All this can only come if we (and I will repeat again) Understand. The. Problem.
(Please let me know if you agree or disagree with me about what I said. I would love to hear back).
[0] Sloth picture courtesy: http://en.wikipedia.org/wiki/File:SlothDWA.jpg
]]>The first question to be asked is, whether we allow permutations? That is, if, $c_1 + c_2 = N$, is one way, then do we count $c_2 + c_1 = N$, as another way? It makes sense to not allow permutations, and count them all as one. For example, if $N$ = 5, and $C$ = {1, 2, 5}, you can make 5 in the following ways: {1, 1, 1, 1, 1}, {1, 1, 1, 2}, {1, 2, 2}, {5}.
We came up with a simple bottomup DP. I have written a topdown DP here, since it will align better with the next iteration. The idea was, $f(N) = \sum f(Nc[i])$ for all valid $c[i]$, i.e., the number of ways you can construct $f(5) = f(51) + f(52) + f(55) \implies f(4) + f(3) + f(0)$. $f(0)$ is 1, because you can make $0$ in only one way, by not using any coins (there was a debate as to why $f(0)$ is not 0). With memoisation, this algorithm is $O(NC)$, with $O(N)$ space.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

This looks intuitive and correct, but unfortunately it is wrong. Hat tip to Vishwas for pointing out that the answers were wrong, or we would have moved to another problem. See if you can spot the problem before reading ahead.
The problem in the code is, we will count permutations multiple times, for example, for $n = 3$, the result is 3 ({1, 1, 1}, {1, 2} and {2, 1}). {1, 2} and {2, 1} are being treated distinctly. This is not correct. A generic visualization follows.
Assume we start with an amount $n$. We have only two types of coins of worth $1$ and $2$ each. Now, notice, how the recursion tree would form. If we take the coin with denomination $1$ first and the one with denomination $2$ second, we get to a subtree with amount $n3$, and on the other side, if we take $2$ first, and $1$ next, we get a subtree with the same amount. Both of these would be counted twice with the above solution, even though, the order of the coins does not matter.
After some discussion, we agreed on a topdown DP which keeps track of which coins to use, and avoids duplication. The idea is to always follow a lexicographic sequence when using the coins. It doesn’t matter if the coins are sorted or not (actually yes, if you check all the coins that you are allowed to use, if they can be used). What matters is, always follow the same sequence. For example, if I have three coins {1, 2, 5}. Let’s say, if I have used coin $i$, I can only use coins $[i, n]$ from now on. So, if I have used coin with value $2$, I can only use $2$ and $5$ in the next steps. The moment I use 5, I can’t use 2 any more.
If you follow, this will allow sequences of coins, in which the coin indices are monotonically increasing, i.e., we won’t encounter a scenario such as {1, 2, 1}. This was done in a topdown DP as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 

Now, this is a fairly standard problem. I decided to check on the interwebs, if my DP skills have been rusty. I found the solution to the same problem on GeeksforGeeks, where they present the solution in bottomup DP fashion. There is also an $O(N)$ space solution in the end, which is very similar to our first faulty solution, with a key difference that the two loops are exchanged. That is we loop over coins in the outerloop and loop over amount in the inner loop.
1 2 3 4 5 6 7 8 9 10 11 

This is almost magical. Changing the order of the loops fixes the problem. I have worked out the table here step by step. Please let me know if there is a mistake.
Step 1: Calculating with 3 coins, uptil N = 10. Although we use an onedimensional array, I have added multiple rows, to show how the values change over the iterations.
Step 2: Initialize table[0] = 1.
Step 3: Now, we start with coin 1. Only cell 0 has a value. We start filling in values for $n = 1, 2, 3, ..$. Since, all of these can be made by adding \$1 to the amount once less then that amount. Thus, the total number of ways right now, would be 1 for all, since we are using only the first coin, and the only way to construct an amount would be $1 + 1 + 1 + … = n$.
Step 4: Now, we will use coin 2 with denomination \$2. We will start with $n = 2$, since we can’t construct any amount less than \$2 with this coin. Now, the number of ways for making amount \$2 and \$3 would be $2$. One would be the current number of ways, the other would be removing the last two $1$s, and adding a two. Similarily, mentally (or manually, on paper) verify how the answers would be.
Step 5: We repeat the same for 3. The cells with a dark green bottom are the final values. All others would have been overwritten.
I was looking into where exactly are we maintaining the monotonically increasing order that we wanted in the topdown DP in this solution. It is very subtle, and can be understood, if you verify step 4 on paper, for $4, 5, 6, …$ and see the chains that they form.
In the faulty solution, when we compute amounts in the outer loop, when we reach to amount $n$, we have computed all previous amounts for all possible coins. Now, if you compute from the previous solutions, they have included the counts for all coins. If you try to calculate the count for $n$, using the coin $i$, and result for $n  cval[i]$, it is possible, that the result for $n  cval[i]$, includes the ways with coins > $i$. This is undesirable.
However, when we compute the other way round, we compute for each coin at a time, in that same lexicographical order. So, if we are using the results for $n  cval[i]$, we are sure, that it does not include the count for coins > $i$, because they haven’t been computed yet, since they would only happen after computing the result for $i$.
As they say, sometimes being simple is the hardest thing to do. This was a simple problem, but it still taught me a lot.
]]>