Essentially what we need is the the UDF api being able to make db requests. From what I understand as of now aerospike does not allow that. So I decided to do some basic experiment about this on redis. Theoretically this should bring a significant performance gain as most of the time is spent on the network IO (redis server side latency will be minimal as all the data is stored in memory).
Redis supports UDFs in lua language. For detail more details checkout lua website and redis UDF docs. To reproduce my experiment, the only requirements are loading lua scripts to redis server and invoking the UDF from the python client. The below command loads the udf to redis.
redis-cli script load "$(cat ~/find_parent.lua)" # this loads the lua script in 'find_parent.lua' to redis # the output for redis outputs 'SHA1 digest of the script # 9e4c3550b2961d085916ecdead255cde6b450f6b in my case # http://redis.io/commands/script-load
All the edges of a graph are stored, in a redis db. The graph is a forest ( a bunch of trees). Given a node, the goal is to find the root of the tree the input node belongs to.
I wrote the below simple program which populates a forest in redis.
# first arg is to set the group size # example : if group size = 10 # there will be edges between # 0-1,1-2,2-3,..8-9 : this tree is rooted at 0 # 10-11, 11-12, ... 18-19 : this tree is rooted at 10 import redis import sys tot = 1000 split = int(sys.argv[1]) if tot % split != 0 : print "first param should divide 1000 perfectly" sys.exit(1) rows = tot/split cols = split rc = redis.StrictRedis(host='127.0.0.1') for i in xrange(rows): rc.delete(i*cols) for j in xrange(1, cols): rc.set(i*cols + j, i*cols + j - 1)
The standard approach would be to get parent node in redis until a node is reached which does not have a parent
import redis import time import sys tc = int(sys.argv[1]) rc = redis.StrictRedis(host='127.0.0.1') def get_root(val): limit = 100 while limit > 0: pres = rc.get(val) if pres == None : return val val = pres limit -= 1 return val start = time.time() print "start", start for i in xrange(tc): get_root(str(i%1000)) end = time.time() print "end ", end print end-start
I wrote the a lua script which does exactly what my get_root function does in the python code in the previous section
--"9e4c3550b2961d085916ecdead255cde6b450f6b" local ret = KEYS[1] local limit = 100 while limit ~= 0 do local pres = redis.pcall("GET", ret) if pres == nil or type(pres) == "boolean" then return ret end ret = pres limit = limit - 1 end return ret
This is updated python code to use the UDF
import redis import time import sys tc = int(sys.argv[1]) rc = redis.StrictRedis(host='127.0.0.1') def get_root(val): return rc.evalsha("9e4c3550b2961d085916ecdead255cde6b450f6b", 1, val) start = time.time() print "start", start for i in xrange(tc): get_root(str(i%1000)) end = time.time() print "end ", end print end-start
Here are the results in secs. First column has the size of each group (groupsize/2 will be the expected number of traversals). The 2nd and 3rd columns show the time taken for 10K requests (see problem statement) without and with udf
groupsize without_udf with_udf ================================== 5 1.39 0.58 10 2.72 0.61 20 5.00 0.75 50 13.64 0.72 100 24.98 0.82 1 0.45 0.57
When every node in itself is a tree (last row), we can observe the extra cost it takes to execute the ‘udf’. But in general when the expected number of traversals keep increasing performance with udf does not degrade while it (obviously) does in the other case, as more traversals means more DB calls (network IO). And this is the case when everything is happening on localhost. When the db is lying on different servers, the udf will be even more beneficial.
The first problem was a cake walk. Given co-ordinates of triangles the task is to find how many of them were right-angled.
Based on popularity I chose to solve DIVQUERY next. The task is to find how many numbers in an array (size :10**5) between 2 the given indices (Ex: all values in the array between 2nd and 10th element ) are divisible by a given number K. There can be upto 10**5 such test cases. This looked difficult and almost impossible to solve.
After going through the constraints I observed that the values in the array are less than 10**5. After some thought I understood that this could be solved using some kullu thelivithetalu (street smartness). I estimated that the number of factors any number less than 10**5 can have is less than 100. 2*3*5*7*11*13*17 > 10**5 (miscalculation-1). This means that no number less than 10**5 can have more than 64 (2**(7-1)) factors.
For each number in the array I need to store the index for all its factors. For a given number this store can give me indices of all its multiples in the array. As each number can have a maximum of 64 factors the total space this store requires would be 64* 10**5 (<10**7) which is perfectly fine.
What data-structure to use. I have 2 options
num = key%1000000 index = key/1000000
So the touple (num,index) is the key and the value is the number of multiples of num in the array upto indexth value in the array. Expected number of inserts is 10**7 and expected time is 2* 10**8 .This is perfectly fine. Map it is
Implementation comes next. At first I need to get all the keys that of the map and store them somewhere. Once this is done the required map can be built by iterating over the keys in ascending order.
Attempt-1:
Store all the keys in the map with some dummy value and give them the correct values during the iteration. Java does not allow modifying a data structure while iterating over it. C++ might be doing the same. So store all the keys in a set and then build the map by iterating over the set.
for (long i = 0; i < n; i += 1){ scanf("%lld",&x); for (long j = 0; j < cnt[x]; j += 1){ long num = factors[x][j]; st.insert(i + 1000000l*num); } } long presNum=0,presValue=0; set::iterator it; for(it = st.begin(); it != st.end() ; it++){ long num = (*it)/1000000l,ind=(*it)%1000000l; if(num==presNum){ presValue++; } else { used_cnt[presNum]=presValue; presNum=num; presValue=1; } mp[*it]=presValue; used[num][presValue-1]=ind; } used_cnt[presNum]=presValue; for (long i_ = 0; i_ < tc; i_ += 1){ long l,r,k; scanf("%lld %lld %lld",&l,&r,&k); l-=2;r--; long l1=-1,r1=-1; for (long i = 0; i < used_cnt[k]; i += 1){ long i1 = used[k][i]; if(l>=i1 && r>=i1) { l1=i1; r1=i1; } else if(r>=i1) r1=i1; else break; } long ans =0; if(r1>=0) ans = mp[r1+1000000l*k]; if(l1>=0) ans = ans-mp[l1+1000000l*k]; printf("%lld\n",ans); }
Result: TLE
Attempt-2:
Java is very strict. C++ is very loose and may be it does not do the locking business while iterating. After some digging I discovered that this is the case. So the set is not required.
for (long i = 0; i < n; i += 1){ scanf("%lld",x); for (long j = 0; j < cnt[x]; j += 1){ long num = factors[x][j]; mp[i + 1000000l*num]=1; } } long presNum=0,presValue=0; map<long,long>::iterator it; for(it = mp.begin(); it != mp.end() ; it++){ long num = (it->val1)/1000000l,ind=(it->val1)%1000000l; if(num==presNum){ presValue++; } else { used_cnt[presNum]=presValue; presNum=num; presValue=1; } mp[it->val1]=presValue; used[num][presValue-1]=ind; }
Result: TLE
Attempt-3:
May be I’m loosing very narrowly and some minor modifications might result in AC. So I better modify the key to (num<<20) + ind. Getting back num, index from key will be key>>20, key&((1<<20) -1). After lot of pain with bracketting I ended with the code is the best I can do.
for (long i = 0; i < n; i += 1){ ni(x); for (long j = 0; j < cnt[x]; j += 1){ num = factors[x][j]; //cout<<num<<"-"<<(i + (num<<20))<<" "; mp[i + (num<<20)]=1; } } long presNum=0,presValue=0,ind,r1; map<long,long>::iterator it; long __num = ((1<<20)-1); for(it = mp.begin(); it != mp.end() ; it++){ num = (it->val1)>>20,ind= ((it->val1)& __num) ; if(num==presNum){ presValue++; } else { used_cnt[presNum]=presValue; presNum=num; presValue=1; } it->val2=presValue; used[num][presValue-1]=ind; } used_cnt[presNum]=presValue;
Result: TLE
Attempt-4:
What if all the test cases are 1,10**5,1 ? My solution would blow off. Use binary search. I always prefer not to use the upper_bound and lower_bound functions. But now I have to.
l1 = *(upper_bound(used[k],used[k]+used_cnt[k],l)-1); r1 = *(upper_bound(used[k],used[k]+used_cnt[k],r)-1);
Result: TLE again. Screw this . Its not gonna happen and there is not much I can do as the other questions have to be coded with lot of care and I will definitely mess up as there is less than 20 mins to go . Some fb. Sudden thought : Why not try the vector idea? Nothing to lose.
Attempt-5:
Code code code. I implemented the vetor idea.
for (long i = 0; i < n; i += 1){ ni(x); for (long j = 0; j < cnt[x]; j += 1){ v[factors[x][j]].pb(i); } } long l,r,k,l1,r1; while(tc--){ scanf("%lld %lld %lld",&l,&r,&k); l1=(upper_bound(v[k].begin(),v[k].end(),l-2) -v[k].begin()); r1=(upper_bound(v[k].begin(),v[k].end(),r-1)-v[k].begin()); printf("%lld\n",r1-l1); }
Result: Wtf? Where is the submit button? Contest over! The question is added to practice. Submitted my solution and got a WA. I thought I messed up some where.
I opened the editorial and after a quick glance I saw the number 128. Then I realised my miscalculation-1 and then modified the declaration of factors to factors[100001][150] (from factors[100001][100]) and the result is an AC.
]]>int NumberOfSetBits(int n){ int count = 0; while(n){ count += n & 1; n >>= 1; } return count; }
But we can do better. The following code works perfectly fine for counting the number of bits set.
//for 64 bit numbers int NumberOfSetBits64(long long i) { i = i - ((i >> 1) & 0x5555555555555555); i = (i & 0x3333333333333333) + ((i >> 2) & 0x3333333333333333); i = ((i + (i >> 4)) & 0x0F0F0F0F0F0F0F0F); return (i*(0x0101010101010101))>>56; } //for 32 bit integers int NumberOfSetBits32(int i) { i = i - ((i >> 1) & 0x55555555); i = (i & 0x33333333) + ((i >> 2) & 0x33333333); i = ((i + (i >> 4)) & 0x0F0F0F0F); return (i*(0x01010101))>>24; }
The magic numbers might seem very confusing. But they are not random (obviously!).
Let us start with 0X55555555. Binary representation of 5 is 0101. So 0X55555555 has 16 ones, 16 zeros and the ones,zeros take alternate positions. Similarly 0X33333333 has 16 ones, 16 zeros and 2 consecutive ones, 2 consecutive zeros alternate.
F = 0x55555555 = 01010101010101010101010101010101 T = 0x33333333 = 00110011001100110011001100110011 O = 0x0f0f0f0f = 00001111000011110000111100001111
Now split the number into 16 sets of 2 bits
Ex: 9 = 1001 = 00,00,…..00,10,01
First step that algorithm does is to count the number of set bits in each of these 2-bit numbers. As each 2-bit number can have a maximum of 2 set bits. Each of these counts fit into a 2-bit number. That is what the below snippet does.
i = (i & 0x55555555) + ((i >> 1) & 0x55555555);
‘i & F’ preserves only the odd positioned ones. (i>>1) & F preserves only the even positioned ones and gets the even ones to odd positions. Adding them results in each pair of 2-bits having the number of set bits in their corresponding positions.
Similarly the after next 2 steps we end up having a number which can be seen as 4 8-bit numbers whose sum is the required count.
i = (i & 0x33333333) + ((i >> 2) & 0x33333333); i = (i & 0x0F0F0F0F) + (i >> 4)) & 0x0F0F0F0F);
Multiplying this number with 0X01010101 will result in a number whose most significant 8 bits contain the required count. Finally on summing up all these steps we get the following code.
int NumberOfSetBits32(int i) { i = (i & 0x55555555) + ((i >> 1) & 0x55555555); i = (i & 0x33333333) + ((i >> 2) & 0x33333333); i = (i & 0x0F0F0F0F) + (i >> 4)) & 0x0F0F0F0F); return (i*(0x01010101))>>24; }
Here some more micro optimizations are possible. After giving some thought it can be understood that the 2 lines of code in the below snippet produce the same result.
i = (i & 0x55555555) + ((i >> 1) & 0x55555555); i = i - ((i >> 1) & 0x55555555);
In general ‘&’ operator is not distributive over ‘+’. But in this case it does. In i and i>>4, each of the eight 4-bit numbers has a maximum value of 4 (fits in 3 bits). And adding these numbers will not cause an overflow in any of the 4-bit slots. So it works in this case. (Think of why it will not work int the previous steps).
i = (i & 0x0F0F0F0F) + (i >> 4)) & 0x0F0F0F0F); i = (i & + (i >> 4)) & 0x0F0F0F0F;
So finally we arrive at the code for NumberOfSetBits32 as in the 2nd snippet ( NumberOfSetBits64 can be implemented in similar lines).
]]>Quora says it connects me to everything I want to know about. That is true. But it connects to many other things I don’t need at all. If not facebook in the second half of 2012 it was quora I was browsing all the time. The bad thing about quora is that , it gives me a false feeling that I’m doing some thing useful and that it is not a waste of time. I spent almost 4 hrs/day on quora in week days and even more in weekends.
Twitter is an exception and remains so. I was never addicted to twitter. One good thing about twitter is most of my friends are not active on twitter and nothing inspires me to write a comment (reply). I’m nothing more than a spectator. It is the same with quora as well. But the answers/posts are generally long. It eats a lot of time.
So ‘no shit’ boils down to no posts on facebook, not more than a couple of hours per week on quora. So from now on fb is just a platform for chatting and poking.
Github/BitBucket
Why do I need github/BitBucket
From this year all not-shitty, non-proprietary code I write will be on github/bitbcket.
Coursera/Udacity
How should I spend the time I gain by giving up facebook and quora? Coursera/udacity is the answer. Coursera has some very nice and interesting courses taught by professors from top universities.
Here is my coursera wish list . The target is not to complete all these courses. It is difficult to put in continuous, consistent efforts for them (coursera expects that). Also there is no point in solving assignments on topics with which I’m super familiar. Learning is the only target.
Lose Weight
This needs no explanations!
Happy python’ing !
]]>