Python as a Declarative Programming Language
Ben Frederickson
Blog
GitHub
RSS
Blog
Python as a Declarative Programming Language
If you look at the programming languages benchmarks game, Python is one of the slowest commonly used<br>programming languages out<br>there. Typical programs<br>written in pure Python average around 40 times slower than the equivalent program written in C or<br>C++.
Despite the performance penalty, Python is still probably the most popular language choice out there for doing Data<br>Analysis and Machine Learning. Most of the recent Deep Learning frameworks target Python for<br>development: TensorFlow, Theano, and Keras all use Python. Torch originally was written for Lua, which<br>is substantially faster than Python when using LuaJIT - but Torch failed to gain<br>traction until switching to Python with the release of PyTorch.
The reason for this is that the performance penalty in writing programs in Python isn’t as large<br>as the programming language benchmarks game would suggest: Most of the best Python Data libraries<br>have their core routines written as native extensions.
This all means that to get the most out of these libraries, you need to treat Python as a<br>Declarative Language - and push as much control flow as possible down to a native layer, and just<br>let the Python program describe what needs done.
Declarative Programming Languages
Declarative Programming Languages focus<br>on on describing what should be computed - and avoid<br>mentioning how that computation should be performed. In practice this means avoiding expressions of control<br>flow: loops and conditional statements are removed and replaced with higher level constructs<br>that describe the logic of what needs to be computed.
The usual example of a declarative programming language is SQL. It lets you define what data you<br>want computed - and translates that efficiently onto the database schema. This lets you avoid<br>having to specify details of how to execute the query, and instead lets the query optimizer figure out the best index and<br>query plan on a case by case basis.
Python isn’t a pure Declarative Language - but the same flexibility that contributes to its<br>sluggish speed can be<br>be leveraged to create Domain Specific API’s that use the same principles. I thought it would<br>be kind of interesting to look at a couple specific examples of how this plays out.
NumPy
The design of NumPy has a couple neat declarative programming features,<br>that lead to code that is not only cleaner and easier to understand - but also substantially<br>faster.
A simple example of this might be to apply TF-IDF weighting to a sparse matrix.<br>I originally needed to do this in order to add some basic nearest neighbour recommendation<br>code as a baseline for my implicit recommendation<br>library, and I thought it would provide a good example of<br>what I’m talking about here.
In an imperative style, TF-IDF weighting on a sparse matrix can be written like:
def tfidf_imperative(m):<br># count up unique occurrences of each column of the sparse matrix<br>X = coo_matrix(m)<br>df = bincount(X.col)
# calculate inverse-document-frequency = log(N/df)<br>N = float(X.shape[0])<br>idf = zeros(X.shape[1])<br>for i in range(X.shape[1]):<br>idf[i] = log(N / (1.0 + df[i]))
# adjust data by TF-IDF weighting<br>X.data = X.data.copy()<br>for i in range(X.nnz):<br>X.data[i] = sqrt(X.data[i]) * idf[X.col[i]]
return X
There are 2 different for loops in that code - and each of which can be replaced by different<br>NumPy language constructs.
The first for loop calculates the IDF itself. Numpy lets you do almost all operations on arrays of values as well as on single values<br>and its much faster to use the vectorized form: idf = log(N / (1.0 +<br>bincount(X.col))).
This tells NumPy to loop over the array returned from the bincount(X.col) function, and create a new array with the<br>IDF value transformed appropriately. In effect we’re telling NumPy what result we want and letting<br>it figure out how to calculate it best itself.
The final TF-IDF weighting can be done in a similar fashion using NumPy’s array<br>indexing feature. This feature lets<br>you use one array as an index into another array. Going idf[X.col] looks up each column value from X in IDF and returns an<br>array with the IDF weight for that column.
Putting it all together leads to code like:
def tfidf_declarative(m):<br>X = coo_matrix(m)
# calculate IDF<br>N = float(X.shape[0])<br>idf = log(N / (1 + bincount(X.col)))
# apply TF-IDF adjustment<br>X.data = sqrt(X.data) * idf[X.col]<br>return X
Not only is the declarative version shorter and more readable, its also substantially faster. By removing the for loops, the iteration can happen<br>in vectorized C calls and makes this code run 75 times faster than the imperative version<br>on my laptop. By avoiding writing any loops, we’ve declared the operations that need to happen<br>and let the NumPy translate that to control flow.
TensorFlow
One frequently asked question is why Deep Learning frameworks like TensorFlow are written...