Beating C with Dyalog APL: wc

Beating C with Dyalog APL: wc | wc.apl

wc.apl

wc utility in Dyalog APL

View on GitHub

Background

A recent blog post that bubbled up on r/programming entitled Beating C With 80 Lines Of Haskell: Wc inspired me to see how effective APL would be in solving this problem.

But Why APL?

Why not? I was told by older students during my university years to avoid APL and any internship or co-op opportunities that used it. Turns out that was bad advice, at least for me–I can see how it might not be other folks’ cup of tea, but it is definitely something I like. A few years back I had the opportunity to finally play with an APL derivative K (I got a job at a small K shop). I liked K, but had to move on job wise. I recently downloaded a free personal copy of Dyalog and have been playing around with it while reading through Mastering Dyalog APL by Bernard Legrand; I’m only a few chapters in so far. I find APL friendlier to use than K and enjoy it a great deal, so this seemed like a good excuse to figure out how to do file I/O while searching through the book’s PDF and flexing google-fu.

On with the code

Just like Chris Penner’s original article, I’m benchmarking against the OSX version of wc that shipped with my machine. Just like in the original article, I admit that there are likely faster versions of wc–I’m just comparing what I got.

Counting Words

Counting characters is trivial. Counting lines is almost as trivial, and becomes as trivial if we count CRLF as two lines–I chose this solution out of laziness. CRLF could be counted as one end of line with an additional three lines (warning: guesstimate).

Just as in the original post, the meat of the problem is in accurately counting words. The first step is counting the number of words in a single string. First, let’s make two definitions:

nl←⎕UCS¨10 11 12 13 ⍝ Newline characters<br>sp←nl,⎕UCS¨9 32 ⍝ Space characters

The above doesn’t look like “normal” programming languages because APL doesn’t use the standard ASCII character set for its symbols/terms. This means none of its names are stolen from you–if you name a variable new or this or function it won’t clash with anything. Note that it does use ASCII symbols like + or = but there are no keywords using ASCII numbers/letters.

An explanation of the above:

⍝ denotes a // style comment, it’s called lamp and it illuminates.

nl and sp are variables with nl being newline characters and sp whitespace characters. In APL, ← is assignment. Here we assign everything to the right of nl or sp to the variables nl and sp.

¨ will create a derived function which will map the function immediately to the left of ¨ across the value to the right. By immediately to the left I mean you read as little as possible to the left to figure out what the lefthand argument–that is, take the simple, immediate value to the left but do not compute anything to the left unless it’s an expression in parentheses–but the righthand argument is the result of evaluating the entire expression to the right, and this is how APL works in general–take as little as possible from the left but everything from the right.

⎕UCS is a function in Dyalog for getting unicode characters from numeric values.

The ravel function , will catenate its lefthand and righthand arguments, so sp contains the newlines in nl along with tab and space.

Now that we know what constitutes a newline or a space we can figure out how to get wc results. Assuming the input is stored in the variable s, we can even give it a concrete value for now: s←'I am the model of a modern major general'. To find where the spaces are we use the membership function ∊ with s∊sp, which yields

0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0

telling us the exact position of any whitespace characters. We can negate the vector (turn 1s into 0s and vice versa) and use the partition function ⊆ to collect the words into an array of strings with (~s∊sp)⊆s (~ is the negation function) and we can count the result using ≢(~s∊sp)⊆s. The way the partition function works is it uses the boolean vector to strip out parts of its righthand argument based on runs of 1 in its lefthand argument (thanks to u/olzd for pointing out ⊆ when I posted this article originally on r/programming).

We only have one piece of input in the above calculation and so it is easy to turn it into a function. In APL we can make a direct function by using braces (I won’t go into the differences in the two function types here but know that the book referenced above goes into some detail). So we can store this calculation in a variable with:

words←{≢(~⍵∊sp)⊆⍵}

In APL ⍺ (alpha) and ⍵ (omega) are the names of the left and right arguments to a direct function. We apply a function just like we do the built-in primitive functions; namely, we give it parameters: words s. One thing we can do is make words a bit more general and have it take any boolean vector and count the partition...

Beating C with Dyalog APL: wc

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits