Python as a Declarative Programming Language (2017)

Python as a Declarative Programming Language

Ben Frederickson

Blog

GitHub

Twitter

RSS

Blog

Python as a Declarative Programming Language

If you look at the programming languages benchmarks game, Python is one of the slowest commonly used programming languages out there. Typical programs written in pure Python average around 40 times slower than the equivalent program written in C or C++.

Despite the performance penalty, Python is still probably the most popular language choice out there for doing Data Analysis and Machine Learning. Most of the recent Deep Learning frameworks target Python for development: TensorFlow, Theano, and Keras all use Python. Torch originally was written for Lua, which is substantially faster than Python when using LuaJIT - but Torch failed to gain traction until switching to Python with the release of PyTorch.

The reason for this is that the performance penalty in writing programs in Python isn’t as large as the programming language benchmarks game would suggest: Most of the best Python Data libraries have their core routines written as native extensions.

This all means that to get the most out of these libraries, you need to treat Python as a Declarative Language - and push as much control flow as possible down to a native layer, and just let the Python program describe what needs done.

Declarative Programming Languages

Declarative Programming Languages focus on on describing what should be computed - and avoid mentioning how that computation should be performed. In practice this means avoiding expressions of control flow: loops and conditional statements are removed and replaced with higher level constructs that describe the logic of what needs to be computed.

The usual example of a declarative programming language is SQL. It lets you define what data you want computed - and translates that efficiently onto the database schema. This lets you avoid having to specify details of how to execute the query, and instead lets the query optimizer figure out the best index and query plan on a case by case basis.

Python isn’t a pure Declarative Language - but the same flexibility that contributes to its sluggish speed can be be leveraged to create Domain Specific API’s that use the same principles. I thought it would be kind of interesting to look at a couple specific examples of how this plays out.

NumPy

The design of NumPy has a couple neat declarative programming features, that lead to code that is not only cleaner and easier to understand - but also substantially faster.

A simple example of this might be to apply TF-IDF weighting to a sparse matrix. I originally needed to do this in order to add some basic nearest neighbour recommendation code as a baseline for my implicit recommendation library, and I thought it would provide a good example of what I’m talking about here.

In an imperative style, TF-IDF weighting on a sparse matrix can be written like:

def tfidf_imperative(m): # count up unique occurrences of each column of the sparse matrix X = coo_matrix(m) df = bincount(X.col)

# calculate inverse-document-frequency = log(N/df) N = float(X.shape[0]) idf = zeros(X.shape[1]) for i in range(X.shape[1]): idf[i] = log(N / (1.0 + df[i]))

# adjust data by TF-IDF weighting X.data = X.data.copy() for i in range(X.nnz): X.data[i] = sqrt(X.data[i]) * idf[X.col[i]]

return X

There are 2 different for loops in that code - and each of which can be replaced by different NumPy language constructs.

The first for loop calculates the IDF itself. Numpy lets you do almost all operations on arrays of values as well as on single values and its much faster to use the vectorized form: idf = log(N / (1.0 + bincount(X.col))).

This tells NumPy to loop over the array returned from the bincount(X.col) function, and create a new array with the IDF value transformed appropriately. In effect we’re telling NumPy what result we want and letting it figure out how to calculate it best itself.

The final TF-IDF weighting can be done in a similar fashion using NumPy’s array indexing feature. This feature lets you use one array as an index into another array. Going idf[X.col] looks up each column value from X in IDF and returns an array with the IDF weight for that column.

Putting it all together leads to code like:

def tfidf_declarative(m): X = coo_matrix(m)

# calculate IDF N = float(X.shape[0]) idf = log(N / (1 + bincount(X.col)))

# apply TF-IDF adjustment X.data = sqrt(X.data) * idf[X.col] return X

Not only is the declarative version shorter and more readable, its also substantially faster. By removing the for loops, the iteration can happen in vectorized C calls and makes this code run 75 times faster than the imperative version on my laptop. By avoiding writing any loops, we’ve declared the operations that need to happen and let the NumPy translate that to control flow.

TensorFlow

One frequently asked question is why Deep Learning frameworks like TensorFlow are written...

Python as a Declarative Programming Language (2017)

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine