Monday, October 30, 2017

How to make your code 80 times faster

I often hear people who are happy because PyPy makes their code 2 times faster or so. Here is a short personal story which shows PyPy can go well beyond that.

DISCLAIMER: this is not a silver bullet or a general recipe: it worked in this particular case, it might not work so well in other cases. But I think it is still an interesting technique. Moreover, the various steps and implementations are showed in the same order as I tried them during the development, so it is a real-life example of how to proceed when optimizing for PyPy.

Some months ago I played a bit with evolutionary algorithms: the ambitious plan was to automatically evolve a logic which could control a (simulated) quadcopter, i.e. a PID controller (spoiler: it doesn't fly).

The idea is to have an initial population of random creatures: at each generation, the ones with the best fitness survive and reproduce with small, random variations.

However, for the scope of this post, the actual task at hand is not so important, so let's jump straight to the code. To drive the quadcopter, a Creature has a run_step method which runs at each delta_t (full code):
class Creature(object):
    INPUTS = 2  # z_setpoint, current z position
    OUTPUTS = 1 # PWM for all 4 motors
    STATE_VARS = 1
    ...

    def run_step(self, inputs):
        # state: [state_vars ... inputs]
        # out_values: [state_vars, ... outputs]
        self.state[self.STATE_VARS:] = inputs
        out_values = np.dot(self.matrix, self.state) + self.constant
        self.state[:self.STATE_VARS] = out_values[:self.STATE_VARS]
        outputs = out_values[self.STATE_VARS:]
        return outputs
  • inputs is a numpy array containing the desired setpoint and the current position on the Z axis;
  • outputs is a numpy array containing the thrust to give to the motors. To start easy, all the 4 motors are constrained to have the same thrust, so that the quadcopter only travels up and down the Z axis;
  • self.state contains arbitrary values of unknown size which are passed from one step to the next;
  • self.matrix and self.constant contains the actual logic. By putting the "right" values there, in theory we could get a perfectly tuned PID controller. These are randomly mutated between generations.
run_step is called at 100Hz (in the virtual time frame of the simulation). At each generation, we test 500 creatures for a total of 12 virtual seconds each. So, we have a total of 600,000 executions of run_step at each generation.

At first, I simply tried to run this code on CPython; here is the result:
$ python -m ev.main
Generation   1: ... [population = 500]  [12.06 secs]
Generation   2: ... [population = 500]  [6.13 secs]
Generation   3: ... [population = 500]  [6.11 secs]
Generation   4: ... [population = 500]  [6.09 secs]
Generation   5: ... [population = 500]  [6.18 secs]
Generation   6: ... [population = 500]  [6.26 secs]
Which means ~6.15 seconds/generation, excluding the first.

Then I tried with PyPy 5.9:
$ pypy -m ev.main
Generation   1: ... [population = 500]  [63.90 secs]
Generation   2: ... [population = 500]  [33.92 secs]
Generation   3: ... [population = 500]  [34.21 secs]
Generation   4: ... [population = 500]  [33.75 secs]
Ouch! We are ~5.5x slower than CPython. This was kind of expected: numpy is based on cpyext, which is infamously slow. (Actually, we are working on that and on the cpyext-avoid-roundtrip branch we are already faster than CPython, but this will be the subject of another blog post.)

So, let's try to avoid cpyext. The first obvious step is to use numpypy instead of numpy (actually, there is a hack to use just the micronumpy part). Let's see if the speed improves:
$ pypy -m ev.main   # using numpypy
Generation   1: ... [population = 500]  [5.60 secs]
Generation   2: ... [population = 500]  [2.90 secs]
Generation   3: ... [population = 500]  [2.78 secs]
Generation   4: ... [population = 500]  [2.69 secs]
Generation   5: ... [population = 500]  [2.72 secs]
Generation   6: ... [population = 500]  [2.73 secs]
So, ~2.7 seconds on average: this is 12x faster than PyPy+numpy, and more than 2x faster than the original CPython. At this point, most people would be happy and go tweeting how PyPy is great.

In general, when talking of CPython vs PyPy, I am rarely satified of a 2x speedup: I know that PyPy can do much better than this, especially if you write code which is specifically optimized for the JIT. For a real-life example, have a look at capnpy benchmarks, in which the PyPy version is ~15x faster than the heavily optimized CPython+Cython version (both have been written by me, and I tried hard to write the fastest code for both implementations).

So, let's try to do better. As usual, the first thing to do is to profile and see where we spend most of the time. Here is the vmprof profile. We spend a lot of time inside the internals of numpypy, and allocating tons of temporary arrays to store the results of the various operations.

Also, let's look at the jit traces and search for the function run: this is loop in which we spend most of the time, and it is composed of 1796 operations. The operations emitted for the line np.dot(...) + self.constant are listed between lines 1217 and 1456. Here is the excerpt which calls np.dot(...); most of the ops are cheap, but at line 1232 we see a call to the RPython function descr_dot; by looking at the implementation we see that it creates a new W_NDimArray to store the result, which means it has to do a malloc():

The implementation of the + self.constant part is also interesting: contrary the former, the call to W_NDimArray.descr_add has been inlined by the JIT, so we have a better picture of what's happening; in particular, we can see the call to __0_alloc_with_del____ which allocates the W_NDimArray for the result, and the raw_malloc which allocates the actual array. Then we have a long list of 149 simple operations which set the fields of the resulting array, construct an iterator, and finally do a call_assembler: this is the actual logic to do the addition, which was JITtted indipendently; call_assembler is one of the operations to do JIT-to-JIT calls:

All of this is very suboptimal: in this particular case, we know that the shape of self.matrix is always (3, 2): so, we are doing an incredible amount of work, including calling malloc() twice for the temporary arrays, just to call two functions which ultimately do a total of 6 multiplications and 6 additions. Note also that this is not a fault of the JIT: CPython+numpy has to do the same amount of work, just hidden inside C calls.

One possible solution to this nonsense is a well known compiler optimization: loop unrolling. From the compiler point of view, unrolling the loop is always risky because if the matrix is too big you might end up emitting a huge blob of code, possibly uselss if the shape of the matrices change frequently: this is the main reason why the PyPy JIT does not even try to do it in this case.

However, we know that the matrix is small, and always of the same shape. So, let's unroll the loop manually:
class SpecializedCreature(Creature):

    def __init__(self, *args, **kwargs):
        Creature.__init__(self, *args, **kwargs)
        # store the data in a plain Python list
        self.data = list(self.matrix.ravel()) + list(self.constant)
        self.data_state = [0.0]
        assert self.matrix.shape == (2, 3)
        assert len(self.data) == 8

    def run_step(self, inputs):
        # state: [state_vars ... inputs]
        # out_values: [state_vars, ... outputs]
        k0, k1, k2, q0, q1, q2, c0, c1 = self.data
        s0 = self.data_state[0]
        z_sp, z = inputs
        #
        # compute the output
        out0 = s0*k0 + z_sp*k1 + z*k2 + c0
        out1 = s0*q0 + z_sp*q1 + z*q2 + c1
        #
        self.data_state[0] = out0
        outputs = [out1]
        return outputs
In the actual code there is also a sanity check which asserts that the computed output is the very same as the one returned by Creature.run_step.

So, let's try to see how it performs. First, with CPython:
$ python -m ev.main
Generation   1: ... [population = 500]  [7.61 secs]
Generation   2: ... [population = 500]  [3.96 secs]
Generation   3: ... [population = 500]  [3.79 secs]
Generation   4: ... [population = 500]  [3.74 secs]
Generation   5: ... [population = 500]  [3.84 secs]
Generation   6: ... [population = 500]  [3.69 secs]
This looks good: 60% faster than the original CPython+numpy implementation. Let's try on PyPy:
Generation   1: ... [population = 500]  [0.39 secs]
Generation   2: ... [population = 500]  [0.10 secs]
Generation   3: ... [population = 500]  [0.11 secs]
Generation   4: ... [population = 500]  [0.09 secs]
Generation   5: ... [population = 500]  [0.08 secs]
Generation   6: ... [population = 500]  [0.12 secs]
Generation   7: ... [population = 500]  [0.09 secs]
Generation   8: ... [population = 500]  [0.08 secs]
Generation   9: ... [population = 500]  [0.08 secs]
Generation  10: ... [population = 500]  [0.08 secs]
Generation  11: ... [population = 500]  [0.08 secs]
Generation  12: ... [population = 500]  [0.07 secs]
Generation  13: ... [population = 500]  [0.07 secs]
Generation  14: ... [population = 500]  [0.08 secs]
Generation  15: ... [population = 500]  [0.07 secs]
Yes, it's not an error. After a couple of generations, it stabilizes at around ~0.07-0.08 seconds per generation. This is around 80 (eighty) times faster than the original CPython+numpy implementation, and around 35-40x faster than the naive PyPy+numpypy one.

Let's look at the trace again: it no longer contains expensive calls, and certainly no more temporary malloc() s. The core of the logic is between lines 386-416, where we can see that it does fast C-level multiplications and additions: float_mul and float_add are translated straight into mulsd and addsd x86 instructions.

As I said before, this is a very particular example, and the techniques described here do not always apply: it is not realistic to expect an 80x speedup on arbitrary code, unfortunately. However, it clearly shows the potential of PyPy when it comes to high-speed computing. And most importantly, it's not a toy benchmark which was designed specifically to have good performance on PyPy: it's a real world example, albeit small.

You might be also interested in the talk I gave at last EuroPython, in which I talk about a similar topic: "The Joy of PyPy JIT: abstractions for free" (abstract, slides and video).

How to reproduce the results

$ git clone https://github.com/antocuni/evolvingcopter
$ cd evolvingcopter
$ {python,pypy} -m ev.main --no-specialized --no-numpypy
$ {python,pypy} -m ev.main --no-specialized
$ {python,pypy} -m ev.main

Wednesday, October 18, 2017

(Cape of) Good Hope for PyPy


Hello from the other side of the world (for most of you)!

With the excuse of coming to PyCon ZA during the last two weeks Armin, Ronan, Antonio and sometimes Maciek had a very nice and productive sprint in Cape Town, as pictures show :). We would like to say a big thank you to Kiwi.com, which sponsored part of the travel costs via its awesome Sourcelift program to help Open Source projects.

Armin, Anto and Ronan at Cape Point

Armin, Ronan and Anto spent most of the time hacking at cpyext, our CPython C-API compatibility layer: during the last years, the focus was to make it working and compatible with CPython, in order to run existing libraries such as numpy and pandas. However, we never paid too much attention to performance, so the net result is that with the latest released version of PyPy, C extensions generally work but their speed ranges from "slow" to "horribly slow".

For example, these very simple microbenchmarks measure the speed of calling (empty) C functions, i.e. the time you spend to "cross the border" between RPython and C. (Note: this includes the time spent doing the loop in regular Python code.) These are the results on CPython, on PyPy 5.8, and on our newest in-progress version:

$ python bench.py     # CPython
noargs      : 0.41 secs
onearg(None): 0.44 secs
onearg(i)   : 0.44 secs
varargs     : 0.58 secs

$ pypy-5.8 bench.py   # PyPy 5.8
noargs      : 1.01 secs
onearg(None): 1.31 secs
onearg(i)   : 2.57 secs
varargs     : 2.79 secs

$ pypy bench.py       # cpyext-refactor-methodobject branch
noargs      : 0.17 secs
onearg(None): 0.21 secs
onearg(i)   : 0.22 secs
varargs     : 0.47 secs



So yes: before the sprint, we were ~2-6x slower than CPython. Now, we are
faster than it!
To reach this result, we did various improvements, such as:

  1. teach the JIT how to look (a bit) inside the cpyext module;
  2. write specialized code for calling METH_NOARGS, METH_O and METH_VARARGS functions; previously, we always used a very general and slow logic;
  3. implement freelists to allocate the cpyext versions of int and tuple objects, as CPython does;
  4. the cpyext-avoid-roundtrip branch: crossing the RPython/C border is slowish, but the real problem was (and still is for many cases) we often cross it many times for no good reason. So, depending on the actual API call, you might end up in the C land, which calls back into the RPython land, which goes to C, etc. etc. (ad libitum).
The branch tries to fix such nonsense: so far, we fixed only some cases, which are enough to speed up the benchmarks shown above. But most importantly, we now have a clear path and an actual plan to improve cpyext more and more. Ideally, we would like to reach a point in which cpyext-intensive programs run at worst at the same speed of CPython.

The other big topic of the sprint was Armin and Maciej doing a lot of work on the unicode-utf8 branch: the goal of the branch is to always use UTF-8 as the internal representation of unicode strings. The advantages are various:
  • decoding a UTF-8 stream is super fast, as you just need to check that the stream is valid;
  • encoding to UTF-8 is almost a no-op;
  • UTF-8 is always more compact representation than the currently used UCS-4. It's also almost always more compact than CPython 3.5 latin1/UCS2/UCS4 combo;
  • smaller representation means everything becomes quite a bit faster due to lower cache pressure.
Before you ask: yes, this branch contains special logic to ensure that random access of single unicode chars is still O(1), as it is on both CPython and the current PyPy.
We also plan to improve the speed of decoding even more by using modern processor features, like SSE and AVX. Preliminary results show that decoding can be done 100x faster than the current setup.

In summary, this was a long and profitable sprint, in which we achieved lots of interesting results. However, what we liked even more was the privilege of doing commits from awesome places such as the top of Table Mountain:


The panorama we looked at instead of staring at cpyext code

Thursday, October 5, 2017

PyPy v5.9 Released, Now Supports Pandas, NumPy

The PyPy team is proud to release both PyPy3.5 v5.9 (a beta-quality interpreter for Python 3.5 syntax) and PyPy2.7 v5.9 (an interpreter supporting Python 2.7 syntax).

  • NumPy and Pandas now work on PyPy2.7 (together with Cython 0.27.1). Many other modules based on C-API extensions work on PyPy as well.

  • Cython 0.27.1 (released very recently) supports more projects with PyPy, both on PyPy2.7 and PyPy3.5 beta. Note version 0.27.1 is now the minimum version that supports this version of PyPy, due to some interactions with updated C-API interface code.

  • We optimized the JSON parser for recurring string keys, which should decrease memory use by up to 50% and increase parsing speed by up to 15% for large JSON files with many repeating dictionary keys (which is quite common).

  • CFFI, which is part of the PyPy release, has been updated to 1.11.1, improving an already great package for interfacing with C. CFFI now supports complex arguments in API mode, as well as char16_t and char32_t and has improved support for callbacks.

  • Issues in the C-API compatibility layer that appeared as excessive memory use were cleared up and other incompatibilities were resolved. The C-API compatibility layer does slow down code which crosses the python-c interface often. Some fixes are in the pipelines for some of the performance issues, and we still recommend using pure python on PyPy or interfacing via CFFI

Please let us know if your use case is slow, we have ideas how to make things faster but need real-world examples (not micro-benchmarks) of problematic code.

Work sponsored by a Mozilla grant continues on PyPy3.5; we continue on the path to the goal of a complete python 3.5 implementation. Of course the bug fixes and performance enhancements mentioned above are part of both PyPy2.7 and PyPy3.5 beta.

As always, this release fixed many other issues and bugs raised by the growing community of PyPy users. We strongly recommend updating.

You can download the v5.9 releases here (note that we provide PyPy3.5 binaries for only Linux 64bit for now):

We would like to thank our donors and contributors, and encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and RPython documentation improvements, tweaking popular modules to run on PyPy, or general help with making RPython’s JIT even better.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 (stdlib version 2.7.13), and CPython 3.5 (stdlib version 3.5.3). It’s fast (PyPy and CPython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

We also welcome developers of other dynamic languages to see what RPython can do for them.

The PyPy 2.7 release supports:
  • x86 machines on most common operating systems (Linux 32/64 bits, Mac OS X 64 bits, Windows 32 bits, OpenBSD, FreeBSD)
  • newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux,
  • big- and little-endian variants of PPC64 running Linux,
  • s390x running Linux

What else is new?

PyPy 5.8 was released in June, 2017.
There are many incremental improvements to RPython and PyPy, the complete listing is here.
 
Please update, and continue to help us make PyPy better.

Cheers, The PyPy team