[Feature Request] Profile-guided Brute-force Optimization
This feature request assumes that issue #5 (closed) (Show Us the Bottlenecks) has been resolved. Do that first.
By "brute-force optimization," we mean optimization confined to a single CPU rather than optimization scaling to multiple CPUs and/or GPUs. Both are critical, of course. But the latter requires considerably different techniques than the former and is thus deferred to a subsequent feature request.
Teach a Coder How to Optimize BETSE and You Satisfy Him for a Femtosecond
What kind of a subheading is that? I really don't know. Onward, valiant readers!
Numerous third-party frameworks for brute-force Python optimization exist. Some of them are even exciting. These include:
- Numba, a high-performance Python JIT widely known to have C-like time efficiency. Since Numba is produced by the producers of Anaconda, Anaconda ships Numba out-of-the-box. Ergo, installation is no concern. Feast thy enthusiastic eyes on this brute-force benchmark of LU decomposition of square random matrices of increasing sizes for pure Python, Numba-optimized Python, and C:
Cython, a static compiler for Python also widely known to have C-like time efficiency. That's right. Programmatically compile pure-Python applications into executable machine code without changing a single line of Python code. Not too many lines, anyway. Actually, that's a seductive lie. Many lines would probably need to change. Hear me now; believe me later. The disadvantage with respect to Numba is that Numba is fine-grained (...that is good) whereas Cython is coarse-grained (...that is bad). Specifically:
- Numba can be applied at the callable level for an application; that is, rather than attempting to JIT your entire application in one fell stroke with Numba (that usually fails), you only JIT those callables in your application that are known to play ball with Numba. And you are never required to completely commit to Numba. If something better comes along, ripping Numba out after-the-fact is trivial.
- Cython, however, cannot be applied at the callable level. You either compile your entire application with Cython or you don't. Additionally, Cython is technically a distinct language from Python. Once you Cythonize your Python application by injecting Cython-specific expressions and types into it, you can't go back to Python. Ever. You're stuck forever in Cython Land™. Unless you sequester those expressions and types either (A) into separate
.pxdfiles or (B) as runtime calls against functions imported from the
cythonpackage. This is referred to as pure-Python mode. (A) is unmaintainable, so no one does that. (B) gets real verbose real quick and probably also dramatically reduces the efficiency of pure-Python mode, effectively mandating use of Cython, effectively defeating the purpose of pure-Python mode. Thus, sort-of sucky. Nonetheless, Cython is fast. It's just not sufficiently faster than Numba to warrant the extreme decrease in flexibility. Feast thy still-enthusiastic eyes on a similar benchmark for pure Python, Cython-optimized Python, and C:
- And many, many similar frameworks – including Nuitka, Pyjion, PyPy, Pyston, and the innumerable list goes on and on. Most of these come with significant gotchas in the unreadable fine print, however. For example:
- Pyston only supports Python 2.x. Thanks fer jack, Dropbox.
- Pyjion only supports Windows. Thanks fer jack, Microsoft. Microsoft. Always Microsoft.
- PyPy is fundamentally incompatible with C extensions. Oohpsie. Most BETSE dependencies (e.g., Matplotlib, Numpy, SciPy) are C extensions. This softcore facepalm renders PyPy useless for most real-world applications. Including ours. Why did they even bother?
Raw low-level hard-hittin' architecture-specific optimization comin' atcha.
Wat We Goin Do?
As a first-pass stab at the inky darkness of ignorance, profile-guided Numba optimization seems plausible. Numba provides a
@jit decorator that may be conditionally applied to each callable to be compiled to machine code in a Just-in-Time (JIT) manner: e.g.,
from numba import jit # This function will be JITted with Numba. Supposedly. @jit def get_muh_list_len(das_list: list) -> int: return len(das_list)
LLVM: Bane of My JIT
The principal disadvantage of Numba, as I see it, is its use of LLVM. LLVM is awesome when amortized over long runtimes on large datasets. Nonetheless, LLVM incurs a decidedly not awesome up-front fixed time cost at process startup. This cost is sufficiently bloated that, in the worst case, it could swamp any benefits of using Numba in the first place. This is the principal reason why Julia fails versus Python in numerous benchmarks.
This suggests that the
@jit decorator should only be applied where benchmarks demonstrate tangible benefit to doing so.
Optimizations Must Be Optional
Let's lay the Lawful Good down: optimization-specific third-party packages must never be imported as mandatory dependencies. This includes the
numba package, the
cython package, or whatever other packages are eventually imported to resolve this issue.
Optimization-specific third-party packages must only ever be conditionally imported as optional dependencies. BETSE should still work in the absence of these packages. In the case of Numba, this is probably achievable by defining a BETSE-specific
@optimize decorator internally importing and applying Numba's
@jit decorator if available and otherwise reducing to a noop: e.g.,
from betse.util.py import modules # If Numba is available, define @optimize to JIT the decorated function. if modules.is_module('numba'): from numba import jit def optimize(func: CallableTypes) -> CallableTypes: return jit(func) # it's really that simply, folks # Else, reduce @optimize to a noop. else: def optimize(func: CallableTypes) -> CallableTypes: return func
This machinery has the additional advantage of permitting additional optimizations to be transparently applied to bottleneck callables without modifying any code beyond the body of the
@optimize decorator. This pleases the bearded mongoloid in me.