Thursday, February 27, 2014

Python in TopCoder 2

The other day someone linked my topcoder preliminary python vs. c++ comparison in reddit (Could you please not *ever* link my blog to reddit? ) Anyway, it reminds me that now that it has been a while since Python was enabled in topcoder SRMs it is a good time to talk more about it.

Poor adoption

It appears that the number of python topcoders stayed as low or lower than when it was released. I would love to have some actual stats, but anecdotally it appears to me C# still has more users than python (And C#'s user share is very low in comparison to Java). Which is a shame, really.

The following is a screenshot of the top 50 in the last SRM. Yellow scores are c++, Green scores are Java, Orange scores are python and White scores are C#:

So yep, it is not statistics, but this is a scenario that tends to repeat itself. The python solutions stand out because they are rare. I was expecting a bit more.

So why is this?

A contest system designed around Java

Let's go to the point, the contest system is just not python-friendly. This all begins because it has been clearly designed with Java solutions in mind. The obvious indication of this is that it requires you to wrap the solution in a class. In Java, this is natural. In C++, it is slightly annoying, but useful because it allows you easy access to the heap. But in python, it is really just a bunch of verbosity that is not helpful at all.

Using classes in python is a bit obnoxious and it gets in the way unless you really need the OOP syntactic sugar. In python, the more declarative/functional approaches tend to be better. This means using functions. But if you put your functions inside the class, that means having to use self. syntax and include self in all your function. Ultimately it means having the self argument in places where it is not really needed. I just wish there was an alternative to this.

What if we wanted to include actual functional languages? You won't be able to force a class in Haskell. And it would be quite cool to have Haskell or even LISP as an option. Actually, I wish we had more languages in general.

But the real aspect of TopCoder that reveals that it is all about Java is the execution time limit. There is a 2.0 time limit for each test case. When problem setting, writers have to provide a Java solution that runs in less than 1.0 seconds. So time constraints are guaranteed for Java and work great for c++ because it is faster. But any language slower than Java is going to have issues and it shows.

The list of Topcoder problems that are impossible in python keeps growing. Even if a problem is possible, if it is not one of the easy problem categories, it will be usually be much harder to come up with a passing python solution than a passing c++ or Java solution. Problems like EllysCandyGame where you can easily pass in c++ with brute force, but need an exponential-time dynamic programming to pass in python.

Poor information

How many people out there know that python is allowed in TC? It seems to still be news to plenty of people. Maybe we need to tell more people about this?

If only

There are a couple of problems that I'd like to mention out of how frustratingly bad they are for python.

EllysPairing an egregious example. This problem has a 16MB memory limit. This means python cannot even run without failing the test case. There is no way around this issue, python simply cannot do anything about it.

What is ironic is that the solution for the problem would be very nicely implemented in python. The problem uses a list of integers that are generated randomly . The memory limit makes you unable to save the list of integers in a list/array/etc. The algorithm that solves it is linear-time majority vote which requires you to visit the sequence twice. Since you cannot save the numbers in a list, this means that you would need to generate the list twice. This makes it very easy to have messy code that repeats the random generation process in two code places. But in python, this wouldn't be a problem, because it has yield.

class EllysPairing:
 def getMax(self, M, count, first, mult, add):
    # generate all the random numbers, return as generator
    def names():
        for i in xrange( len(count) ):
            x = first[i]
            for j in xrange(count[i]):
                yield x
                x = (x * mult[i] + add[i]) & (M - 1);
    p = -1
    q = 0
    for x in names(): #use them
        if (p != x):
            q -= 1
            if (q <= 0):
                p = x
                q = 1
        else:
            q += 1
    
    n = sum(count)
    c = sum( x == p for x in names() ) #use them again
    
    if (c > n / 2):
        return n / 2 - (c - n/2 - n%2)
    else:
        return n / 2

The important part is the for x in names() , that is all that is needed to iterate through all the random numbers in the list without rewriting the code. The top c++ solution by aceofdiamonds repeats the random number generation code.

This cute solution cannot run. Even without the cruel 16 MB limit, it will be too slow in comparison to the Java solution and would time out, taking around 6 seconds in TC's servers. That's too bad.

I also want to talk about a more recent problem: MiningGoldEasy. This problem has a simple dynamic programming / memoization solution. Just notice that if , in each move you limit yourself to rows and columns that contain at least one "event", it should be optimal, so we have an `O(|event|^3)` solution if we remember the current number of step, and the current position. In python, I can code this:

class MiningGoldEasy:
 def GetMaximumGold(self, N, M, event_i, event_j):
    @MemoizedFunction
    def f(x,y,s):
        if s == len(event_i) :
            return 0
        else:
            bx = max( f(nx,y,s+1) for nx in event_i )
            by = max( f(x,ny,s+1) for ny in event_j )
            return max(bx,by) + N + M - abs(event_i[s] - x) - abs(event_j[s] - y)
     
    return max( max( f(x,y,0) for x in event_i) for y in event_j )

The MemoizedFunction is a decorator. A syntactic sugar that means I do f = MemoizedFunction(f) after declaring the function. MemoizedFunction() is a function that takes a function and instead returns a function that behaves exactly the same but uses memoization. It is clever:

def MemoizedFunction(f):
    mem = dict()
    def g(*args):
        if args not in mem:
            mem[args] = f(*args)
        return mem[args]
    return g

Since 90% of Topcoder problems are dynamic programming ones, this idea to automatically turn a function into memoization is very cool. Just take a look to the GetMaximumGold() code up there, it is almost pseudocode. A c++ equivalent looks like this:

int GetMaximumGold(int N, int M, vector<int> event_i, vector<int> event_j)
{
    int t = event_i.size();
    
    int mem[50][50][51];
    memset(mem, -1, sizeof(mem));        
    function<int(int,int,int)> f = [&](int a, int b, int s) {
        int & res = mem[a][b][s];
        if (res == -1) {
            if (s == t) {
                res = 0;
            } else {
                int x = N + M 
                      - abs( event_i[a] - event_i[s])
                      - abs( event_j[b] - event_j[s]);
                res = 0;
                // change row:
                for (int na = 0; na < event_i.size(); na++) {
                    res = std::max(res, x + f(na,b,s + 1) ); 
                }
                // change column:
                for (int nb = 0; nb < event_j.size(); nb++) {
                    res = std::max(res, x + f(a,nb,s + 1) ); 
                }
            }
        }
        return res;
    };
    int mx = 0;
    for (int a = 0; a < event_i.size(); a++) {
        for (int b = 0; b < event_j.size(); b++) {
            mx = std::max(mx, f(a,b,0) );
        }
    }
    return mx;
}

Unfortunately, the python solution above fails system tests whilst the c++ one passes in 0.1 seconds worst-case. Of course, it is likely that the python solution can be optimized to pass. Probably not using dictionaries or adding crops. But this means going lower level and thus losing the advantages python provides.

Google's Code Jam

Again, this all contrasts with the Google Code Jam, in which python has almost as many users as Java. Since in google jam there is no memory limit (as long as it runs in your computer) and the time limit is very generous (4~8 minutes for around 100 test cases as opposed to 2 seconds for each test case). Then python has a more healthy following in Code Jam. Shall TC make changes to become more language agnostic? It is possible for them to do that without losing what makes them unique? I have no idea.

SRM 611-620 Challenge

With all this in mind, let's do something fun! In the next 10 SRMs, I will use only python during each match. Repeat, I will be forbidden from submitting anything in c++. This will be the first time I use a language other than c++ during a match. Let's see what happens... I will try to make detailed blog posts about how it is effecting (helping or being an obstacle) my performance.

If anyone want to join the challenge, just post a comment here or make a blog post of your own.

6 comments :

vexorian said...

I managed to create a MemoizedFunction decorator that uses a single list instead of dictionary if you provide it the bounds for the arguments. This requires the decorator to be a decorator with argument. Which means it is a function that returns a function that takes a function that returns a function, but it works.

http://pastebin.com/6VZFWfuB

This makes the code for that problem twice as fast, but still not fast enough.

Alessandro Stamatto said...

It was me who linked the previous post you're referencing.

I posted because I really liked it. And I wanted to prompt some discussion on making python better on algorithm contests (maybe pypy could smooth a bit the slowness).

Sorry! (Also, I was not expecting for it to bite redditors. For me it was clear that it was in the context of algorithm contests).

There was a presentation on PyCon last year on using PyPy in Google CodeJam:
Slides https://ep2013.europython.eu/media/conference/slides/coding-competitions-with-pypy-python-for-the-win.pdf
Presentation http://www.youtube.com/watch?v=jjXWVhVMR9k

For me C++ still is the best language for those contests, but maybe python can scratch some problems.

Cheers, and keep the good work! (Waiting for your review on using it on 10 algorithm matches).

vexorian said...

Well don't worry. I am always tongue in cheek, remember, so the request wasn't 100% serious. I just didn't like having to remove a couple of insulting comments that appeared.

pypy is fast, but using it in codejam is going to make google feel hurt because their V8 can't defeat it.

Paramdeep Singh said...

Sir after reading your editorial on 'MiningGoldEasy', I came up with an alternate solution which to some testcases is responding wrongly-
https://ideone.com/poHxiw

vexorian said...

https://ideone.com/DLRZqU

Diaa said...

on SRM452 DIV II prob. 500

it seems there is a problem on python

once tray to run this testcase (1000,997) it said there is problem on return

it's ok on ideone and also and on my machine
http://ideone.com/PCJ1pf