jott - ml

# Machine Learning with Unix Pipes


*by [@bwasti](https://twitter.com/bwasti)*

****

I contribute to an open source programmer-focused machine learning library
called [Shumai](https://github.com/facebookresearch/shumai).
Recently, we added some basic `/dev/stdin` handling,
which makes it possible to compose standard Unix
utilities with machine learning on the command-line.

### Pipes `|` Refresher

<center>
<img src="https://i.imgur.com/5KgEoYL.gif" style="display:inline;width:480px;max-width:80%;"/>
</center>

[Unix pipes](https://en.wikipedia.org/wiki/Pipeline_(Unix)) are a surprisingly flexible and easy to use mechanism for inter-process communication.
For example, we can count how many three letter words start with "a":

```bash
% cat /usr/share/dict/words | grep -e '^a..$' | wc -l
90
```
90 words!  But what's actually happening here?

- The **`|`** operator creates a pipe|ine: chaining the outputs of commands as
inputs to the following commands
- **`cat`** (comes from ["con**cat**enate"](https://en.wikipedia.org/wiki/Cat_(Unix))) prints files out
and `/usr/share/dict/words` is a standard wordlist
- **`grep`** filters text line by line and "`^a..$`" is our [regex](https://en.wikipedia.org/wiki/Regular_expression)
for getting three letter words starting with "a"
- **`wc`** is wordcount and the `-l` flag counts lines instead of words


And that's it!

Another thing to note is that pipes can be extremely quick.
They've been around for decades and have been a key tool in the toolbox of *many* people,
so they're pretty damn optimized.
A good way to measure the performance of a pipe is to use the `pv` utility:

```bash
% yes "hi" | pv > /dev/null
11.3GiB 0:00:02 [5.69GiB/s] [          <=>         ]
```

A raw Unix pipe can hit nearly 6GB/s!
Flexibility, performance and decades of documentation
are super compelling reasons to add
support in Shumai.

### Part 1: Multiply add

Now let's use these pipe for some machine learning!
To start we'll "learn" something very simple:

$$
f(x) = m\cdot x + b
$$

To do so, we first express the function in code.
Parameters like `m` and `b` are typically randomly initialized,
but for now we'll hardcode them to 7 and 3 respectively.

```javascript
// learn.ts
import * as sm from '@shumai/shumai'

const m = sm.scalar(7).requireGrad().checkpoint()
const b = sm.scalar(3).requireGrad().checkpoint()	

export default function f(x) {
  return m.mul(x).add(b)
}

export const backward = sm.optim.sgd // stochastic gradient descent
export const loss = sm.loss.mse // mean squared error
```

This is all functional code, but `checkpoint()` does some fancy
things to ensure that repeated invocations of this script
will cache the values learned.
As a result,
you'll see random files like `tensor_2304823423.fl` generated.
Delete them to reset training.

To test this model out, we pipe a value into an invocation of `shumai infer`:

```bash
% echo 7 | shumai infer learn.ts
52:Float32
```

$7 \times 7 + 3 = 52$. Great!
Since we know the values of $m$ and $b$, this is expected.

What we just did is often called *inference* (which is an overloaded term,
apologies to Bayesians): predict a result based on an input.

We also want to train this model to work for some data we've collected.
Let's say we know an input of `4` should be `19` and `6` should be `25`.
Our current parameters don't capture that at all.

Training involves the creation of input/output pairs
and this can be done on the command-line too.
All we have to do is place a `|` separator between them
and use `shumai train`.

```bash
% echo '4 | 19' | shumai train learn.ts
143.99899291992188
% echo '6 | 25' | shumai train learn.ts
376.3590393066406
```

The [*loss*](https://en.wikipedia.org/wiki/Loss_function) is printed out after each run.
Loss is basically the cost of getting things wrong.
Since we're using mean squared error to measure loss,
huge numbers like this mean we haven't learned much yet.

So let's train it for longer!
First we make a dataset:

```bash
% echo '4 | 19' >> data.txt
% echo '6 | 25' >> data.txt
% cat data.txt
4 | 19
6 | 25
```

And then we train with it for 100,000 steps:

```bash
% yes "$(cat data.txt)" | head -n 100000 | shumai train learn.ts
0.11650670319795609
```

- The **`yes`** command repeats the input infinitely.
It was a cheeky program used to quickly (and dirtily) accept
the terms of installation scripts, which often prompted
things like `[Y/n]?`
- **`$(...)`** runs the command inside it.
- **`head -n 100000`** will chop the infinite output of `yes` to
be only 100k lines long.

So, by piping this into `shumai train learn.ts`,
we've trained a model 100k steps.

To test it out, we use `shumai infer` yet again

```bash
% echo 4 | shumai infer learn.ts
18.49370574951172:Float32
```

Woo!  Pretty close.

### Part 2: Actually Useful Stuff (Heterogenous Load Balancing)


When would machine learning on the command-line 
of all places actually be useful?
I use the command-line to get things done, not play with
images and text generation!

Here's an example: we have a program that uses different amounts of
CPU based on its input.
We've got two machines, a slow one (cheap) and a fast one (expensive).
Our task is to run the program on the
fast machine only when it makes sense.


*For the sake
of this writeup, we'll pretend
the input is a single integer.
More realistically the input could be a much longer
array of bytes and the ideas shown below would still apply.
For injesting arbitrary binary, here's how I might
convert it to a Shumai readable Tensor:*

```bash
% hexdump -C [file] | od -td1 -An | tr -ds '\n' ' ' | sed -e 's/^ */[/g' -e 's/ *$/]/g' | tr ' ' ', '
```

- **`hexdump`** will convert your file to binary
- **`od -td1`** will convert that to base 10, `-An` will remove byte offsets.
- **`tr -ds`** will replace `\n` with spaces, squashing repeat characters
- **`sed -e 's/^ */[/g' -e 's/ *$/]/g'`** will remove the leading and trailing spaces as well as
wrap up the output with `[` and `]` (how Shumai takes in Tensors)
- The final **`tr ' ' ', '`** comma separates everything

Back to our example.
Let's say our program `./prog` prints both the input and the amount
of CPU it used.  In a real-world setting we can filter for this information
with `grep` and `awk`, but let's assume it's the *only* thing printed
for now:

```bash
% ./prog 5
5 0.1234
```

Ideally, *before* we run the program - 
we figure out which machine to
run `./prog` on.  It's not too bad if we get it wrong every so often, but it helps
if we consistently get it right.
This is the typical trade-off you'll face when applying machine learning
in the context of programming.  It's not magic sauce, it's just a convenient
way to avoid spending too much time on heuristics.

So, how do we make this happen?
We do what we did with the other example: feed
training data into a model of our own design.

Before building our model, we'll collect a number of datapoints and save them into a file.

```bash
% seq -f '%1.0f' 10 64 | xargs -L1 ./prog | sort -R > data.txt
```

- **`seq`** above prints out numbers from 10 to 64,
and it formats them as floating point with no decimal.
- **`xargs -L1`** converts every line of the `seq` output into
and invocation of `./prog` with the output as the argument.
These commands combined are really useful for collecting data
over sweeps of numeric inputs.
- **`sort -R`** randomly shuffles the data.

Now that we've got our dataset `data.txt`, we should define a model.
In this case, we'll use a [multilayer perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) (MLP):

```javascript
// learn.ts
import * as sm from '@shumai/shumai'

const l0 = sm.module.linear(1, 64)
const l1 = sm.module.linear(64, 64)
const l2 = sm.module.linear(64, 1)


export default function f(x) {
  x = l0(x).relu()
  x = l1(x).relu()
  x = l2(x).sigmoid()
}

export const backward = sm.optim.sgd // stochastic gradient descent
export const loss = sm.loss.mse // mean squared error
```
So, what's going on in the code above?

We multiply our input into 64 hidden "neurons" by a learned weight
and then clip out all the values less than zero (setting them to zero).
We then do this again.
The process of clamping negative values to zero is extremely
important, as it is a non-linear operation
that gives the neural network the [ability to learn
arbitrary functions](https://en.wikipedia.org/wiki/Universal_approximation_theorem) (given the right number of neurons).
I like to think of it as giving the network
an ability to learn if-conditions.

<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/42/ReLU_and_GELU.svg/1920px-ReLU_and_GELU.svg.png" style="display:inline;width:480px;max-width:80%;"/>
</center>


Then, we add up all the values (again by a learned weight)
and smoothly clamp them between 0 and 1 (using [sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function)).
We'll be predicting if the CPU used for a given input is
over (1) or under (0) a certain threshold, indicating which machine
we should use.

<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/1200px-Logistic-curve.svg.png" style="display:inline;width:480px;max-width:80%;"/>
</center>

And now, with our model and data in hand, we train it:

```bash
% yes "$(cat data.txt)" | head -n 10000 | awk '{print $1 "|" $2}'  | shumai train learn.ts
```
You might notice that this line would be in your bash history, since it's the exact same command
used to train the other model.

Querying the model will look similar as well!

```bash
% echo 4 | shumai infer learn.ts
0.9044327:Float32
```
And we're done. 😁 
A full model trained on the command-line that can be used anywhere `stdin` is supported!
There are other ways to use (and even train) this model (including HTTP and soon WebSockets), but those
are out of scope for this writeup (hint: `shumai serve learn.ts` and then check out `127.0.0.1:3000/{forward,backward}`).

### Going Forward

The `/dev/stdin` API is currenty a work in progress.  I haven't really documented it much and I'm mostly
looking for feedback on the idea itself.

More generally, the [Shumai](https://github.com/facebookresearch/shumai) project
is about 2 and half months old and still experimenting with APIs and ideas.
A primary focus of the project is hackability.
By ditching conventional Python and using a JIT compiled language with native async programming (JavaScript/TypeScript),
Shumai hopes to open up doors for ideas that can be implemented quickly and directly in the host language
rather than built out as a C extension.


If you'd like to learn more or get quick help, please checkout our Discord! https://discord.gg/kXxWyMFQ
Documentation on the operators and API can be found here: https://facebookresearch.github.io/shumai/