Some Basic Python Optimisations
WARNING!
For CPython, if you want performance you should prefer algorithmic optimizations (i.e. find better algorithms for what you're doing) or, if that's not possible, rewrite parts of your code in Cython.
You should generally aim for readability, and avoid micro-optimizations - except in cases where it's the same, like comprehension expressions: they are both more readable and faster (special-cased, emit less instructions than loops because there's no possibility of
You should always profile your application before optimizing, and you should profile with real (or realistic) data and in an as complete state as possible - concentrate on optimizing stuff that actually matters, not stuff you think matters.
You have been warned. On to the fun part.
As you can probably guess,
About why
So the only difference is
These instructions are defined in ceval.c - to spare you the reading, they only differ in that for numbers one calls
However, this won't be determined before a couple of pointer dereferences and conditional branches, until execution goes back to the non-inplace version. This is the overhead you are seeing (which is less than 5% on my computer, sometimes I even get the same results).
Using a similar workflow, you can probably answer the rest of your questions yourself - just use the
It's getting late, so I'll just leave this here and go to bed:
For CPython, if you want performance you should prefer algorithmic optimizations (i.e. find better algorithms for what you're doing) or, if that's not possible, rewrite parts of your code in Cython.
You should generally aim for readability, and avoid micro-optimizations - except in cases where it's the same, like comprehension expressions: they are both more readable and faster (special-cased, emit less instructions than loops because there's no possibility of
break
, continue
or return
- compile.c, Ctrl+F, "List and set comprehensions").You should always profile your application before optimizing, and you should profile with real (or realistic) data and in an as complete state as possible - concentrate on optimizing stuff that actually matters, not stuff you think matters.
You have been warned. On to the fun part.
>>> from dis import dis
>>> def fn(a):
b = a
>>> dis(fn)
2 0 LOAD_FAST 0 (a)
3 STORE_FAST 1 (b)
6 LOAD_CONST 0 (None)
9 RETURN_VALUE
>>> dis(compile("b = a", "wtf", "exec"))
1 0 LOAD_NAME 0 (a)
3 STORE_NAME 1 (b)
6 LOAD_CONST 0 (None)
9 RETURN_VALUE
As you can probably guess,
LOAD_FAST
and its counter-part STORE_FAST
are faster than LOAD_NAME
and STORE_NAME
. The reason is that local variable (and parameter) access in functions is optimized - instead of using full dictionary lookups it uses integer offsets; during compilation all local variables must be known for this offset table to work, so you can't dynamically add local variables to functions.About why
a = a + 1
is faster than a += 1
:
>>> def fn(n):
n = n + 1
>>> dis(fn)
2 0 LOAD_FAST 0 (n)
3 LOAD_CONST 1 (1)
6 BINARY_ADD
7 STORE_FAST 0 (n)
10 LOAD_CONST 0 (None)
13 RETURN_VALUE
>>> def fn(n):
n += 1
>>> dis(fn)
2 0 LOAD_FAST 0 (n)
3 LOAD_CONST 1 (1)
6 INPLACE_ADD
7 STORE_FAST 0 (n)
10 LOAD_CONST 0 (None)
13 RETURN_VALUE
So the only difference is
INPLACE_ADD
versus BINARY_ADD
(this has nothing to do with binary numbers, it just means "add
for two arguments").These instructions are defined in ceval.c - to spare you the reading, they only differ in that for numbers one calls
PyNumber_Add()
while the other calls PyNumber_InPlaceAdd()
. Those two are defined in abstract.c, where the main difference is that they call either binary_op1
or binary_iop1
, defined in the same file.binary_iop1
will check whether the object supports in-place operations (the existing object is modified instead of a new copy created and returned), but the answer in the case of integers (intobject.c) and floats (floatobject.c) is "no" (nb_inplace_add
is zero - a new object is created, except for integers with values between -5 and 257, which are reused from a pool, but this is transparent to this code).However, this won't be determined before a couple of pointer dereferences and conditional branches, until execution goes back to the non-inplace version. This is the overhead you are seeing (which is less than 5% on my computer, sometimes I even get the same results).
Using a similar workflow, you can probably answer the rest of your questions yourself - just use the
dis
module to disassemble the Python code, then check out how the instructions are handled in the source code of CPython at GitHub.It's getting late, so I'll just leave this here and go to bed:
static long
int_hash(PyIntObject *v)
{
/* XXX If this is changed, you also need to change the way
Python's long, float and complex types are hashed. */
long x = v -> ob_ival;
if (x == -1)
x = -2;
return x;
}
>>> [hash(n) for n in range(-4, 5)]
[-4, -3, -2, -2, 0, 1, 2, 3, 4]
Guest Author
Comments
Post a Comment