Simple graph traversal using redis lua UDF

While i was trying to think various approaches on some ‘work’ related projects, this suddenly occurred to me that, For simple graph traversals with edges stored in some DB, a db-server side UDF (User defined function) would be very efficient. In my work place we use aerospike a lot and that made me ask this question in aerospike forums forums (graph-traversal-using-udf). Then I started looking if something like this is possible with redis (which aerospike primarily tries to replace) and they do.

Essentially what we need is the the UDF api being able to make db requests. From what I understand as of now aerospike does not allow that. So I decided to do some basic experiment about this on redis. Theoretically this should bring a significant performance gain as most of the time is spent on the network IO (redis server side latency will be minimal as all the data is stored in memory).

Background on Redis UDFs

Redis supports UDFs in lua language. For detail more details checkout lua website and redis UDF docs. To reproduce my experiment, the only requirements are loading lua scripts to redis server and invoking the UDF from the python client. The below command loads the udf to redis.

redis-cli script load "$(cat ~/find_parent.lua)"
# this loads the lua script in 'find_parent.lua' to redis 
# the output for redis outputs 'SHA1 digest of the script
# 9e4c3550b2961d085916ecdead255cde6b450f6b in my case
# http://redis.io/commands/script-load

Problem Statement

All the edges of a graph are stored, in a redis db. The graph is a forest ( a bunch of trees). Given a node, the goal is to find the root of the tree the input node belongs to.

Generating the graph edges

I wrote the below simple program which populates a forest in redis.

# first arg is to set the group size
# example : if group size = 10
# there will be edges between
# 0-1,1-2,2-3,..8-9 : this tree is rooted at 0
# 10-11, 11-12, ... 18-19 : this tree is rooted at 10
import redis
import sys
tot = 1000
split = int(sys.argv[1])
if tot % split != 0 :
    print "first param should divide 1000 perfectly"
    sys.exit(1)
rows = tot/split
cols = split
rc = redis.StrictRedis(host='127.0.0.1')
for i in xrange(rows):
    rc.delete(i*cols)
    for j in xrange(1, cols):
        rc.set(i*cols + j, i*cols + j - 1)

Standard Approach

The standard approach would be to get parent node in redis until a node is reached which does not have a parent

import redis
import time
import sys
tc = int(sys.argv[1])
rc = redis.StrictRedis(host='127.0.0.1')
def get_root(val):
    limit = 100
    while limit > 0:
        pres = rc.get(val)
        if pres == None : return val
        val = pres
        limit -= 1
    return val
start = time.time()
print "start", start
for i in xrange(tc):
    get_root(str(i%1000))

end = time.time()
print "end  ", end
print end-start

Lua Udf

I wrote the a lua script which does exactly what my get_root function does in the python code in the previous section

--"9e4c3550b2961d085916ecdead255cde6b450f6b"
local ret = KEYS[1]
local limit = 100
while limit ~= 0 do
    local pres = redis.pcall("GET", ret)
    if pres == nil or type(pres) == "boolean" then return ret end
    ret = pres
    limit = limit - 1
end
return ret

This is updated python code to use the UDF

import redis
import time
import sys
tc = int(sys.argv[1])
rc = redis.StrictRedis(host='127.0.0.1')
def get_root(val):
    return rc.evalsha("9e4c3550b2961d085916ecdead255cde6b450f6b", 1, val)
start = time.time()
print "start", start
for i in xrange(tc):
    get_root(str(i%1000))

end = time.time()
print "end  ", end
print end-start

Results

Here are the results in secs. First column has the size of each group (groupsize/2 will be the expected number of traversals). The 2nd and 3rd columns show the time taken for 10K requests (see problem statement) without and with udf

groupsize  without_udf  with_udf
==================================
5          1.39         0.58
10         2.72         0.61
20         5.00         0.75
50         13.64        0.72
100        24.98        0.82
1          0.45         0.57

When every node in itself is a tree (last row), we can observe the extra cost it takes to execute the ‘udf’. But in general when the expected number of traversals keep increasing performance with udf does not degrade while it (obviously) does in the other case, as more traversals means more DB calls (network IO). And this is the case when everything is happening on localhost. When the db is lying on different servers, the udf will be even more beneficial.

Notes/Extra Reading

  1. All these make a lot of sense when the DB is not close to its full load capacity. If full capacity, it gets tricky, but I believe even in that case UDF can only make things better.
  2. Latency Numbers Every Programmer Should Know : https://gist.github.com/jboner/2841832
  3. Source code : https://gist.github.com/dotslash/5238a225072d2e74627c
  4. One downside of this is debugging is really painful and documentation redis has for UDF is pretty poor (IMHO) or its only me (?). But before considering this, keep in mind that ‘Premature optimization is root of all evil’
  5. UPDATE : Why aerospike does not support this
    “The UDF can operate on a single record in an invocation. Once its is invoked for a record, it cannot access other records. The basic reason why this is not allowed is that the record may be on a different node. We don’t want the UDF to do distributed transaction” – https://discuss.aerospike.com/t/graph-traversal-using-udf/2127/2?u=yesteapea.
    It will be interesting to see how well redis handles this when partitioning is enabled

The pains of an average programmer

The last time I attempted a programming contest was on 28th Feb when I was trying to be a purple on codeforces. After a long break I thought of attempting codechef cookoff yesterday and I did.

The first problem was a cake walk. Given co-ordinates of triangles the task is to find how many of them were right-angled.

Based on popularity I chose to solve DIVQUERY next. The task is to find how many numbers in an array (size :10**5) between 2 the given indices (Ex: all values in the array between 2nd and 10th element ) are divisible by a given number K. There can be upto 10**5 such test cases. This looked difficult and almost impossible to solve.

After going through the constraints I observed that the values in the array are less than 10**5. After some thought I understood that this could be solved using some kullu thelivithetalu (street smartness). I estimated that the number of factors any number less than 10**5 can have is less than 100. 2*3*5*7*11*13*17 > 10**5 (miscalculation-1). This means that no number less than 10**5 can have more than 64 (2**(7-1)) factors.

For each number in the array I need to store the index for all its factors. For a given number this store can give me indices of all its multiples in the array. As each number can have a maximum of 64 factors the total space this store requires would be 64* 10**5 (<10**7) which is perfectly fine.

What data-structure to use. I have 2 options

  • An array of vectors: Where the ith vector gives the indices of multiples of ith value in the input array. But initializing 10**5 vectors might be very slow and I never liked vectors and could not convince myself about their amortized constant time inserts. Also as it is not possible to have a good estimate of expected size of each vector I cannot use the reserve space option. Option ruled out (miscalculation-2)
  • Map<long,int> : Where the key is actually a pair .
    num = key%1000000
    index  = key/1000000
    

    So the touple (num,index) is the key and the value is the number of multiples of num in the array upto indexth value in the array. Expected number of inserts is 10**7 and expected time is 2* 10**8 .This is perfectly fine. Map it is

Implementation comes next. At first I need to get all the keys that of the map and store them somewhere. Once this is done the required map can be built by iterating over the keys in ascending order.

Attempt-1:
Store all the keys in the map with some dummy value and give them the correct values during the iteration. Java does not allow modifying a data structure while iterating over it. C++ might be doing the same. So store all the keys in a set and then build the map by iterating over the set.

for (long i = 0; i < n; i += 1){
   scanf("%lld",&x);
   for (long j = 0; j < cnt[x]; j += 1){
      long num = factors[x][j];
      st.insert(i + 1000000l*num);
   }
}
long presNum=0,presValue=0;
set::iterator it;
for(it = st.begin(); it != st.end() ; it++){
   long num = (*it)/1000000l,ind=(*it)%1000000l;
   if(num==presNum){
      presValue++;
   }
   else {
      used_cnt[presNum]=presValue;
      presNum=num;
      presValue=1;
   }
   mp[*it]=presValue;
   used[num][presValue-1]=ind;
}
used_cnt[presNum]=presValue;
 
for (long i_ = 0; i_ < tc; i_ += 1){
   long l,r,k; scanf("%lld %lld %lld",&l,&r,&k);
   l-=2;r--;
   long l1=-1,r1=-1;
 
   for (long i = 0; i < used_cnt[k]; i += 1){       
      long i1 = used[k][i];       
      if(l>=i1 && r>=i1) {
         l1=i1; r1=i1;
      }
      else if(r>=i1) r1=i1;
      else break;
   }
 
   long ans =0;
   if(r1>=0) ans = mp[r1+1000000l*k];
   if(l1>=0) ans = ans-mp[l1+1000000l*k];
   printf("%lld\n",ans);
}

Result: TLE

Attempt-2:
Java is very strict. C++ is very loose and may be it does not do the locking business while iterating. After some digging I discovered that this is the case. So the set is not required.


for (long i = 0; i < n; i += 1){
   scanf("%lld",x);
   for (long j = 0; j &lt; cnt[x]; j += 1){
      long num = factors[x][j];
      mp[i + 1000000l*num]=1;
   }
}

long presNum=0,presValue=0;
map<long,long>::iterator it;

for(it = mp.begin(); it != mp.end() ; it++){
   long num = (it->val1)/1000000l,ind=(it->val1)%1000000l;
   if(num==presNum){
      presValue++;
   }
   else {
      used_cnt[presNum]=presValue;
      presNum=num;
      presValue=1;
   }
   mp[it->val1]=presValue;
   used[num][presValue-1]=ind;
}

Result: TLE

Attempt-3:
May be I’m loosing very narrowly and some minor modifications might result in AC. So I better modify the key to (num<<20) + ind. Getting back num, index from key will be key>>20, key&((1<<20) -1). After lot of pain with bracketting I ended with the code is the best I can do.


for (long i = 0; i < n; i += 1){
   ni(x);
   for (long j = 0; j < cnt[x]; j += 1){
      num = factors[x][j];
      //cout<<num<<"-"<<(i + (num<<20))<<" ";
      mp[i + (num<<20)]=1;
   }

}

long presNum=0,presValue=0,ind,r1;
map<long,long>::iterator it;
long __num = ((1<<20)-1);
for(it = mp.begin(); it != mp.end() ; it++){
   num = (it->val1)>>20,ind= ((it->val1)& __num) ;
   if(num==presNum){
      presValue++;
   }
   else {
      used_cnt[presNum]=presValue;
      presNum=num;
      presValue=1;
   }
   it->val2=presValue;
   used[num][presValue-1]=ind;
}
used_cnt[presNum]=presValue;

Result: TLE

Attempt-4:
What if all the test cases are 1,10**5,1 ? My solution would blow off. Use binary search. I always prefer not to use the upper_bound and lower_bound functions. But now I have to.

l1 = *(upper_bound(used[k],used[k]+used_cnt[k],l)-1);
r1 = *(upper_bound(used[k],used[k]+used_cnt[k],r)-1);

Result: TLE again. Screw this . Its not gonna happen and there is not much I can do as the other questions have to be coded with lot of care and I will definitely mess up as there is less than 20 mins to go . Some fb. Sudden thought : Why not try the vector idea? Nothing to lose.

Attempt-5:
Code code code. I implemented the vetor idea.

for (long i = 0; i < n; i += 1){
   ni(x);
   for (long j = 0; j &lt; cnt[x]; j += 1){
      v[factors[x][j]].pb(i);
   }
}
long l,r,k,l1,r1;
while(tc--){
   scanf("%lld %lld %lld",&l,&r,&k);
   l1=(upper_bound(v[k].begin(),v[k].end(),l-2) -v[k].begin());
   r1=(upper_bound(v[k].begin(),v[k].end(),r-1)-v[k].begin());

   printf("%lld\n",r1-l1);
}

Result: Wtf? Where is the submit button? Contest over! The question is added to practice. Submitted my solution and got a WA. I thought I messed up some where.

I opened the editorial and after a quick glance I saw the number 128. Then I realised my miscalculation-1 and then modified the declaration of factors to factors[100001][150] (from factors[100001][100]) and the result is an AC.

Counting the number of set bits in an integer

Most natural ways to do this is to loop through all bits in the integer increment the count for each set bit.

int NumberOfSetBits(int n){
  int count = 0;
  while(n){
    count += n & 1;
    n >>= 1;
  }
  return count;
}

But we can do better. The following code works perfectly fine for counting the number of bits set.

//for 64 bit numbers
int NumberOfSetBits64(long long i)
{
    i = i - ((i >> 1) & 0x5555555555555555);
    i = (i & 0x3333333333333333) +
        ((i >> 2) & 0x3333333333333333);
    i = ((i + (i >> 4)) & 0x0F0F0F0F0F0F0F0F);
    return (i*(0x0101010101010101))>>56;
}
//for 32 bit integers
int NumberOfSetBits32(int i)
{
    i = i - ((i >> 1) & 0x55555555);
    i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
    i = ((i + (i >> 4)) & 0x0F0F0F0F);
    return (i*(0x01010101))>>24;
}

The magic numbers might seem very confusing. But they are not random (obviously!).
Let us start with 0X55555555. Binary representation of 5 is 0101. So 0X55555555 has 16 ones, 16 zeros and the ones,zeros take alternate positions. Similarly 0X33333333 has 16 ones, 16 zeros and 2 consecutive ones, 2 consecutive zeros alternate.

F = 0x55555555 = 01010101010101010101010101010101
T = 0x33333333 = 00110011001100110011001100110011
O = 0x0f0f0f0f = 00001111000011110000111100001111

Now split the number into 16 sets of 2 bits
Ex: 9 = 1001 = 00,00,…..00,10,01
First step that algorithm does is to count the number of set bits in each of these 2-bit numbers. As each 2-bit number can have a maximum of 2 set bits. Each of these counts fit into a 2-bit number. That is what the below snippet does.

i = (i & 0x55555555) + ((i >> 1) & 0x55555555);

‘i & F’ preserves only the odd positioned ones. (i>>1) & F preserves only the even positioned ones and gets the even ones to odd positions. Adding them results in each pair of 2-bits having the number of set bits in their corresponding positions.
Similarly the after next 2 steps we end up having a number which can be seen as 4 8-bit numbers whose sum is the required count.

i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
i = (i & 0x0F0F0F0F) + (i >> 4)) & 0x0F0F0F0F);

Multiplying this number with 0X01010101 will result in a number whose most significant 8 bits contain the required count. Finally on summing up all these steps we get the following code.

int NumberOfSetBits32(int i)
{
    i = (i & 0x55555555) + ((i >> 1) & 0x55555555);
    i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
    i = (i & 0x0F0F0F0F) + (i >> 4)) & 0x0F0F0F0F);
    return (i*(0x01010101))>>24;
}

Here some more micro optimizations are possible. After giving some thought it can be understood that the 2 lines of code in the below snippet produce the same result.

i = (i & 0x55555555) + ((i >> 1) & 0x55555555);
i = i - ((i >> 1) & 0x55555555);

In general ‘&’ operator is not distributive over ‘+’. But in this case it does. In i and i>>4, each of the eight 4-bit numbers has a maximum value of 4 (fits in 3 bits). And adding these numbers will not cause an overflow in any of the 4-bit slots. So it works in this case. (Think of why it will not work int the previous steps).

i = (i & 0x0F0F0F0F) + (i >> 4)) & 0x0F0F0F0F);
i = (i & + (i >> 4)) & 0x0F0F0F0F;

So finally we arrive at the code for NumberOfSetBits32 as in the 2nd snippet ( NumberOfSetBits64 can be implemented in similar lines).

Resolutions 2013

No Shit: facebook, quora
Last year I got completely addicted to  facebook and quora.  My hands couldn’t be still on seeing status updates of friends in my close circle. I could not resist myself from not sharing(on facebook) anything I find interesting. This is good upto a certain extent , but everything has a limit and I crossed it. Finally I became the guy who always has fb open and scrolls down and down till it takes practically infinite time to load the feed.

Quora says it connects me to everything I want to know about. That is true. But it connects to many other things I don’t need at all. If not facebook in the second half of 2012 it was quora I was browsing all the time. The bad thing about quora is that , it gives me a false feeling that I’m doing some thing useful and that it is not a waste of time. I spent almost 4 hrs/day on quora in week days and even more in weekends.

Twitter is an exception and remains so. I was never addicted to twitter. One good thing about twitter is most of my friends are not active on twitter and nothing inspires me to write a comment (reply). I’m nothing more than a spectator. It is the same with quora as well. But the answers/posts are generally long. It eats a lot of time.

So ‘no shit’ boils down to no posts on facebook, not more than a couple of hours per week on quora. So from now on fb is just a platform for chatting and poking.

Github/BitBucket
Why do I need github/BitBucket

  • I always backed up code on my external hard disk. I never felt the need for an online backup. But after losing my hard disk , everything is gone. Suddenly I realized the need for an online backup.
  • Every company hosts their code on some content tracker. I never cared to know about them properly. Getting familiar with them might avoid some pains  in my professional life.
  • I was such a fool to not know that git is not just about github and gitbub one of the many online services that host git repositories.

From this year all not-shitty, non-proprietary code I write will be on github/bitbcket.

Coursera/Udacity
How should I spend the time I gain by giving up facebook and quora? Coursera/udacity is the answer.  Coursera has some very  nice and interesting courses taught by professors from top universities.

Here is my coursera wish list . The target is not to complete all these courses.  It is difficult to put in continuous,  consistent efforts for them (coursera expects that). Also there is no point in solving assignments on topics with which I’m super familiar. Learning is the only target.

Lose Weight
This needs no explanations!

Python for course assignments

Python is a beautiful programming language . The code looks very neat and verbal. To give it a try I chose to do my course assignments in python. Here are the advantages of using python over other languages for the course assignments.

  1. Not many students use python for assignments. This is a big advantage because if you get stuck in the assignments , you can borrow your friend’s code (c++/java) and blindly covert it line by line to python. You don’t have to worry about instructor’s warning on copying course assignments.
  2. The code looks very clean. It forces code indentation which is a good programming practice.
  3. Every damn thing is already implemented in the language and the documentation is pretty good.
  4. Java libraries can also be used with python via jython.
  5. Python might be very slow compared to C++/java. But speed is never an issue for course assignments. So it should not be a problem!

Happy python’ing !