The Problem: All build systems suck.

Some may suck less though. I'll point out some of the alternative build systems, and what they do that gets my goat.

Sucky mini-languages, and far too many of them.

But really, all of them have the problem that I have to learn their sucky little language, that is often Turing-complete, but almost always horribly hard to do anything useful with. And really, all I want to do is build something nicely. Why should I have to learn a language?

Why not just create a list of what commands to run? If I need programmability, why can't I create that list with a real language?

Well, there's always the easy route: have the build system just be a set of library calls in a real language, or otherwise embed some other language in the build tool. Popular tools that do this include cons with Perl, and scons with Python.

This does work. But I don't like it. The big problem is that these languages are too powerful. Didn't I just say that I wanted a real language? I do. But I want it sandboxed, and I want it easily checkable and auditable. Machine-checkable, even. I don't want to dig through this code to make sure that building something doesn't do something malicious. The current make-like languages are both ugly and difficult to use, and can hide nasty arbitrary executions because they are Turing-complete, just nasty. Switching to a real language just makes it easier to use, and less ugly. Well, except for Perl, which keeps it ugly. But if all we're doing is generating a list of commands to run, then we have something that is more like data than code.

Having to specify dependencies and targets.

Wouldn't it be nice if you could just feed it a set of commands, and have those commands be run in the proper order? No fooling around defining how to get the various files, defining templates, massaging file names, etc?

So, if we don't want to tell the build system what the dependencies are, and what generates what, it needs some way of figuring that out for itself. There are two basic methods it can use:

  1. ptrace -- use a debugging interface to have the kernel tell us whenever the sub-processes makes system calls.
  2. Shared library shim.

Each has its advantages and disadvantages. The shared library shim will be much faster (and easier to port), but unable to handle statically linked programs, or those that are partially statically linked. Neither can handle set-uid programs.

There are also crazier methods like interpreting the binary.

Then what?

So, obviously, you run the program once, and it builds. You modify a source file and it reruns everything that needs to be run. (Traverses a DAG breadth-first).

Each time it runs a command, it squirrels the dependencies away somewhere. If it doesn't work (yeah it'll break if the commands lie about their exit status), it postpones it and runs any other commands that haven't run and have new dependencies.

Isn't this too simple?

Probably. That's why I also want these features:

Multiple Build Configurations.

This would include adding VPATH type capabilities and putting all derived objects in a separate directory. Various switches to tweak the semantics might be necessary until I figure out what the right thing is. I don't know whether separate configurations would necessarily be separate files, or whether something like environment variable substitution would be acceptable. If it is, it must be taken into account in the next feature.

Lines in the build file are dependencies of the command to be run.

Pretty obvious. Change the command, and it needs to be rerun. Tagging by line number probably isn't quite good enough, so we'll really want to save full text somewhere.

build file can be topologically sorted after one successful run.

Really only useful if it's not autogenerated. Or could it be made part of autogeneration. Hmm. Saves time when rebuilding after "make clean" equivalent.

heuristics for when to rerun commands

It might need to try different orders the first time, possibly giving n^2 performance worst-case as it looks for the next completeable command. Heuristics are possible, of course: "hmm, line 5 failed after not being able to open 'foo/bar', and line 8 just created 'foo/bar'. Perhaps it is time to run line 5 again."

(optional?) fingerprints, to detect modification to files outside of the system.

Fingerprints can probably be as simple as (modtime, size), though adding a hash would probably be better.

Being able to have multiple sub-build-files in sub-directories.

This lets the build system match the directory structure. It'd be nice if the paths could be automatically corrected.

Export script

It should be easy to take this list of commands and create a shell-script that runs everything in the right order. See also the tsorting feature. This means you can see what will be built.

It would be nice if we could leverage this so that people without this build software could still build your code. Making this usable is going to be difficult though -- we really want to support configuration and build-environment checks.

Parallelization

Would be nice if it could figure out what could be run in parallel, though locking may need to be declarable.

Tricky Bits, unclear implementation details.

Piping and redirection are handleable, but will be tricky.

Both the ptrace and shared-library shims will have to follow forks.

The output should depend on the executable, and any shared libraries.

Running on a machine with a different set of binaries should be autodetected. Which means squirreling away fingerprints of the binaries when running them.

shared-library shim runs in the client process, so needs a way to talk back to the server. One idea is to pass the name of an open FD in the environment (command-line?) and protect against close() or dup2() on it, and send messages back. It shouldn't need any changing instructions from the server, so maybe those can be environment as well. Note: also need to override putenv() and friends, to save environment across exec().

Commands may decide what to do based on what source files they can see. readdir() needs to be checked as well as a dependency. (Also needs to be overridden for VPATH functionality).

Environment variable can influence search paths, etc, and possibly quite more. Perhaps start with _empty_ environment, save what gets put in the build list?

This whole idea will probably interact poorly with such programs as hmake, and javac (or its descendants like ant or javamake) that do their own dependency analysis.

Old yaccs will be a problem too: their output file is always the same. The traditional workaround is to then "mv" it to the proper name, but the dependencies should not be on the intermediate y.tab.{c,h}, but the new ones. Following rename for newly created files works, I think, but this could rapidly get complicated. In any case tools that try to be atomic will have the same issue

This also brings up the question of locking. I want to be able to run in parallel, but multiple commands may use some resource (such as certain file-system names) that should be single threaded.

How and where do I store state?

This still needs a good name.

Proposed plan of attack:

Start off with a wrapper that actually gives VPATH functionality to programs, and another variable that says where to dump output files. This is a good way to start IMO, because it provides a fairly small, but useful part, and tests out the different ways (shared library, ptrace, etc) to modify how a program interacts with the system.

Follow with dependency/output dumper.

Build file parser that calls the others.

Squirrel away each line, and condition running it on whether any ingredients are newer than any produced files, if it's a new line, or if what it produces doesn't exist.

Then add DAG following, and other stuff.

I'll probably never actually get around to this though.

Other Software with automatic dependency analysis

Other Build Software

Ant's use of XML is an abomination. While it can technically compile things besides java, it is so java oriented that doing so is a serious pain.

Cook is okay. The points the author of cook makes about recursive make are quite true, and he's thought a bit about supporting builds nicely a fair bit. It's nice that it put the C include scanner in a separate tool, and thought about the ways to have that information read in separately. This means the same basic strategy can be used for other languages (though nothing does it quite like C) and for compilers with different options.

I haven't used jam, nor looked much at it. I dislike the fact that scanning for includes is built in rather than put into a seperate tool that interfaces nicely, like cook's use of c_incl in its default recipes.

"cons" is obviously out, because I don't want to write perl.

"scons" is the same thing, only python. I like python, but...

"tup" so far is the closest to what I want, and the author has thoroughly thought about the domain.

redo is inspired by some notes of djb, and essentially turns the build process inside-out. It's something like cons/scons for the shell.