karya - background.md

This has original motivation and general purpose arm-waving. The intent is that while the other documentation has specific “how to do this” or “how does this work” information, this is a place for bigger picture thoughts about design, and specifically more general problems, along with solutions which seem to fit them well, and those which don’t.

motivation

I was motivated to create a music sequencer because of what I saw as a gap in the capabilities of existing software.

As I see it, there are basically three categories of software systems for music writing: the mainstream MIDI sequencer or DAW (examples are Cubase, Logic, Ardour, Ableton Live), staff notation programs (Finale, Sibelius, Lilypond), and language-oriented systems (csound, CLM, Supercollider). The DAWs and staff notation programs are standardized, monolithic, and more or less equivalent to each other (lilypond is the exception, also being in the language camp), but the language systems are much more variable. Some focus on sound synthesis, some on score representation, and some on real-time synthesis or note production. Especially this last category (“livecoding”) has become popular recently, examples are Tidal, Extempore, Overtone.

For a long time, I’ve been involved with Javanese and Balinese music, and none of the above categories seemed suitable. The mainstream DAWs only really support 12 tone equal tempered tuning, and anything outside that is distinctly second class. MIDI synthesizers can be retuned, but it’s awkward, not standardized, and the sequencer doesn’t understand it. Beyond that, these kinds of music (as it seems are most kinds of music around the world) are based on a few core melodies from which other parts are derived with greater or lesser freedom. A DAW of course doesn’t understand any of this, and since they are monolithic and not programmable, can’t be made to understand any of it.

The notation programs are also out. Primarily because I’m hoping to generate sound in addition to (or instead of) scores, but also staff notation is just not suitable for the kind of music I want to write. It can be forced, but it’s awkward at best, omitting or obscuring important details and cluttering the page with irrelevant or misleading ones. The point of this kind of notation is to give to a musician who will understand it, so the further you deviate from standard practice the less useful it becomes. You could even make a case that staff notation is no longer that suitable for a lot of European-style music of the last hundred years or so, let alone other kinds, but the difficulty of spelling reform shows how strong and valuable a widely recognized convention is.

The various music programming languages could theoretically support the scales and proper relationships of core melody to derived parts. In practice they only support what the original authors wrote libraries for, which is generally not even notes and scales, let alone the higher level stuff. For instance, some will claim that they support “all scales”, when they really mean you can give pitch names arbitrary frequencies and do some modular arithmetic on them. Real scales are more complicated than that, and reach deep into the music theory of whatever tradition they’re embedded in. But the main issue I have with the languages is that they’re text-only, and generally expect you to edit with a standard text editor, which of course has no support for writing music as distinguished from any other kind of text. What this means is entering note lists by hand, and writing times and durations as numbers. It’s hard to write and read that way, especially if there are multiple parts which are aligned in time. To be sure it can be done, and you can look at a large lilypond score to see how it can work, but there’s a reason that almost every kind of historical notation represents time spatially. So it’s also no coincidence that the various languages tend to be used for algorithmic music or sound design, neither of which need note lists.

Later I found that every other kind of music I encountered was not expressible in existing software, for all the same reasons. But even for equal tempered European-style music, mainstream sequencers are not very good. Writing for small ensembles using physical models driven by the Cubase sequencer, even the very modest amount of breath pressure data would quickly overwhelm Cubase’s capabilities, leading to laggy editing. The amount of data is tiny compared to audio, so the problem is merely that the program was not designed to expect any significant amount of control data. Aside from the performance problems, the editing is primitive. You can draw curves by hand (if your hand is steady), or enter basic shapes, and align them by zooming in and squinting. Editing pitches and durations is also low level and imprecise. The design clearly expects you to record a live performance, with minimal editing afterwards. So if you have a MIDI keyboard, and are a skilled keyboardist, you can record keyboard music, or if you have a wind controller and can play it well, you can record woodwind music. Added to that is the fact that quite a lot of European-style music is not actually equal tempered and the situation seems hopeless.¹

Of course what people do in practice is that they mostly don’t try to do those things. They develop new kinds of music for computers and electronics, which make use of their strengths and de-emphasize the weaknesses above. This is the same sort of thing that happens with every other instrument throughout history, so it’s completely reasonable. But, in my eyes, it doesn’t live up to the promise that computers should be able to fulfill, that they should be vehicles for the imagination, rather than another mold into which you can pour your creativity. I would like to write the music in my head, and that’s expressive and high level. I don’t see why electronic instruments can’t be as expressive as a skilled soloist, and I hope they can some day because I don’t have enough time to commit the 20 or so years of practice necessary to get acceptable expression from an instrument multiplied by every instrument I’d like to use.

relationship with Nyquist

At the lowest level, Karya implements some ideas from the Nyquist language as a Haskell library. Nyquist operates at the sound level, in that its “notes” are functions that emit streams of samples. Karya transposes that up a level, in that its notes are functions that emits streams of Note records, but aside from that, the ideas are similar. Nyquist is described in detail at https://www.cs.cmu.edu/~music/nyquist/ or its articles in the CMJ vol 21 #3.

The Karya specific Note (with a capital N to refer to that specific data type) is in essence a triple of a start time, duration, and a mapping from symbolic control names to continuous functions. I use “note” with a small “n” to refer to a the functions that return Notes (or sounds, when discussing Nyquist). Sometimes I use the word “notation”, meaning the same thing. The important thing to remember is that these all refer to a plain function which takes an environment structure (described below) to a stream of Notes. When I refer to a “performer”, or a “backend”, I mean the next stage in the process, which will be something that transforms a stream of those Note records into something else. It might be MIDI and the OS level MIDI scheduler, or a Lilypond code generator, or a standalone FAUST synthesizer.

Nyquist has a concept it calls “behavioural abstraction.” In Nyquist, a note is a function that returns a sound. Instead of changing the volume or transposition by modifying the sound, you set values in a dynamically scoped environment. The note function is free to interpret those values as it sees fit. A function that implements an oscillator could map volume to amplitude, one that implements a bowed string model could map it to bow pressure, one that implements a string section could map it numbers of doubled instruments, and one that implements an entire score could leave the interpretation to the various sections, phrases, instruments, notes, and oscillators, all the way down the stack. This concept is also extended to the treatment of time, which is implemented as a function from score time to real time, called the “warp.” So a note is given a starting time and duration by shifting and stretching the warp, and you can repeat the same sound at different times by calling it once with a warp that maps score time 0 to real time 0, and then again with a warp that maps score time 0 to real time 4. Similar to the volume control, the note function can use the warp as it sees fit, and is not required to always start at score time 0. For instance, grace notes may give themselves a constant duration, while a trill may change add cycles at a constant speed in order to fill the time given, instead of changing the cycle duration. This works becasue any bit of code has access to the score→real warp function to place itself, but also the inverse real→score unwarp function so it can calculate the score equivalent of 1 second of real time at a given score location.

In Karya, I extended Nyquist’s notion of a dynamic environment to the namespace of functions available. This means that an instrument can modify the function namespace to insert its specific kinds of notation, or override existing functions with conventional names, such as a more instrumentally-appropriate “tr” trill. In fact, at the score level, that’s all an instrument is, just a thing that brings certain functions into scope. In the same way, a scale is a thing that brings pitch-producing functions into scope. Of course there is also notation that modifies the meaning of notation under it. An example of where this comes up is some notation may want to redefine what a “plain note” is, to change the behaviour of nested notation. But if its new “plain” note itself wants to generate some plain notes, it will have to restore the original definition to avoid infinite recursion.

Since all of these parameters are passed implicitly in a dynamic environment, even if a bit of notation doesn’t understand a particular value, it will still pass it to its dynamically scoped children, which may. For instance, you can wrap a whole section in a pizz request (implemented by adding a pizz symbol to a set of attributes in scope), and instruments that understand it will change their behaviour… hopefully in a way that the composer will agree is consistent with the word “pizz”!

The only thing special about the “note” functions that make up a score as opposed to any other kind of function is that they all take the same argument type, a dynamic environment record, and all return the same type of result—sounds for Nyquist, or Notes for Karya (actually Karya implements three families: ones that return Notes, ones that return signal samples, and ones that return pitch signal samples, but let’s ignore that for the moment). This uniformity means that they have the closure property. What this means in practice is that a note is syntactically the same as a phrase, which is the same as a whole piece, and they can all be transformed by the same functions. To continue the example above, a pizz marker can be applied to a note or an entire movement, or a triplet can be formed from notes or phrases or other triplets, all nested arbitrarily.

As implied above with the pizz example, beyond the basic mechanics of calling note functions and merging their results, all of the various ways to control them are essentially symbolic and conventional. That is, there is nothing special about a “pizz”, except that it is a symbol that you might put into the environment, and what anyone does when they see it is just down to what they choose to do. In the same way, the pitch signal is a pitch signal because it’s named “pitch” and all the bits of notation will look for something with that name. I think this lack of a direct link between a name for a behaviour and the behaviour itself is essential for a generalized notation. It ensures that the vocabulary is open (we don’t want to be stuck with a hardcoded list of articulations), and that the result it implies is also open (music and instruments are so variable that we don’t want to be stuck with having to always interpret the same symbol as the same action). The downside is that if you write “pazz” on accident, no one can tell you for sure that it’s nonsense, or if two instruments interpret “pizz” in confusingly different ways, then you may get confused by the result. In practice, this latter form of confusion is more likely to arise from the name of functions themselves, e.g. one convention may decide that an “m” annotation means “play muted”, while a drum might already have a stroke named “m”.

code vs data

Since the score is simply an executable program, on the axis between code and data this is a fundamentally code-oriented representation of notation. The essence of this trade-off is that code is opaque: flexible but not easily inspected, while data is transparent: easily inspected and analyzed but generally not extensible. This is a fundamental trade-off, in that the opacity of a function gives it its implementation flexibility. For a general purpose notation, I think the flexibility is worth the hassle.² Haskore, and its descendent Euterpea http://haskell.cs.yale.edu/euterpea/ is an example of a system on the data side, and from the beginning has been built on a single Music data type. It derives much of its power from the ability to analyze and manipulate Music values, but Music itself is not extensible. For instance, from the beginning pitch has been an integer interpreted as an equal tempered pitch, ornaments are limited to a hardcoded list, and even instruments are limited to the General MIDI list of 127. Of course code and data is not a binary distinction. Euterpea can and has added various means of extension, such as a CustomInstument escape hatch, or generalizing Music with a type parameter. But there’s no free lunch, and the power of analysis decreases as the generalization increases.

The music-suite library (https://github.com/music-suite) is an example of a system similar in spirit to Haskore but emphasizing generality and extensible data structures. It’s somewhat generalized beyond my ability to understand it, but as far as I can tell, at the very bottom pitch still reduces down to (pitch class, sharps/flats, octave), which is not able to represent the pengumbang / pengisep distinction in Balinese scales, and is ambiguous for Carnatic ragams with differing arohana and avarohana.

Karya itself has data-oriented sublanguages (notably one oriented around rhythmic structures, specialized to notate Carnatic solkattu), but these are for specific kinds of notation, and all render down to the common notation. But even in a restricted domain it’s hard to design a data type that encompasses everything you may want to write!

karya’s implementation

This section assumes knowledge of some terms from functional programming. There’s plenty of material online describing them, but the important distinction is between lazy and eager evaluation. A simplified explanation is that in eager evaluation, a functions arguments are evaluated before the function is called, while in lazy evaluation the arguments are only evaluated when (or if) they are passed to a function that requires them, which winds up being something low level, like printing text to the terminal or handing MIDI to the MIDI driver. Since data structures in lazy languages are also lazy, laziness goes together with incremental evaluation, which is suitable for music so far as music is something that unfolds over time.

Karya implements something like Nyquist’s system, except where Nyquist deals with sounds and samples, Karya deals with Note records, which are then interpreted by a backend to become either MIDI, or Lilypond, or control signals for FAUST instruments, or whatever else.

In many ways, Haskell is a more suitable host language for Nyquist’s ideas than its original XLisp implementation.

orchestra and score

Another thing Karya inherits from Nyquist is that it has a single unified evaluation step, without an orchestra / score phase distinction.

In the discussion below, I’ll talk about an orchestra or score phase, or score time vs. real time, or score vs. player. It’s important to note that this “real time” or “player” phase is not the same as the “performer” I mentioned above. In Karya, the performer is yet another phase, which occurs after all the Notes are produced, and is not part of the discussion in this section.

The “orchestra / score” terminology comes from csound (and perhaps the “Music N” family before that). Csound has two separate languages, one to describe the instruments and one to drive them. More importantly, it also strictly enforces two separate phases, analogous to compile time vs. runtime. This means that score level notation can’t talk about samples, so e.g. while you could talk about reversing notes, you couldn’t reverse the actual sounds that the notes produce. Even something simple like adding reverb requires awkward hacks. Modern csound has various ways to work around this, so it’s not a hard distinction, but the phase division is very much alive in not just csound, but more modern systems like supercollider or PD, and of course standalone synthesizers and MIDI are yet another manifestation.

For example, in most systems the relatively simple request to put a certain kind of reverb on a single note would require setting up an orchestra level reverb, configuring a special channel for it, then configuring the score so only that one note goes on that channel, and then coordinating the communication so the score turns all the right knobs on the right reverb. Lots of faffing about with plumbing MIDI CC numbers and audio channels and what have you. Wanting to customize the reverb for each note individually would be crazy and no one would do that… and yet, purely score level transformations like transposing each note differently are completely reasonable. So while you can easily express some music with different transpositions, e.g. transpose +1 phrase + transpose +2 phrase, good luck interleaving score level and sample level, e.g. reverse (transpose +1 (reverb phrase)).

Since Karya works entirely at the score level, not sound, it seems silly to talk about it not having an orchestra / score distinction. It’s only really true by analogy, if we lift sound samples up to notes and notes up to high level score notation. And in fact this score / note distinction exists all over again in many scoring systems. For example, Haskore has one language for describing notation, and a separate system of “players” which render the nested score notation down to a flat sequence of notes. Similar to the way a csound score can’t work with sample level synthesis, the note level output of a Haskore player can’t become notation again. So in the same way that csound is a one way notes → sound → speaker pipeline, Haskore is a one way trip: score → notes → sound → speaker. I don’t know if there’s a standard name for this distinction, so I’ll call it a “two phase” system. Of course there are more than just two phases involved in the whole journey from score to sound, but I’m only talking about the top two.

For instance, if you have two separate players that realize staccato in different ways, you would have to put a special key in the score saying which player to use at which point, and then build logic into the players to swap out when they see the key. This is the equivalent of all of that MIDI CC and audio channel plumbing hassle.

Karya, on the other hand, can interleave realization and score, and does so extensively. In Karya, a function transforming notation (which, remember, is itself just a function returning Notes) can call its transformee with an altered environment (which corresponds to passing different arguments to it), or it can evaluate the transformee to the Notes stream, and transform that stream. The former corresponds to a score transformation in any music language, the latter corresponds to directly transforming the samples produced by the score. A delay implemented with the former technique will happen in score time, in that the delayed music will conform to whatever tempo changes are in scope. A delay implemented with the latter technique will be “flat”, in that it will delay a certain number of seconds regardless of the tempo. A two-phase system like Haskore would be able to do the same thing, but the flat time delay would have to be at the player level, so you couldn’t then integrate that into other score-level constructions, so it couldn’t be part of a phrase then used as an argument to something which builds a larger structure. I call this “latter technique” a “postproc” in the general case, since it is a kind of post-processing on the note stream, but I will call it a “realize call” for the specific examples below, where it’s a postproc cooperating with a corresponding bit of notation.

It turns out that the Karya implementation for more complicated kinds of notation begins to resemble the two-phase approach. I’ll illustrate with an example:

There is a fundamental tension between nested symbolic score, and linear notes, in that some things can only be expressed at one level or another. For instance, ngoret is a grace note linking two pitches, whose pitch depends on its neighbors, and also implies a certain damping technique that usually winds up lengthening the previous note. So while the notation exists at the score level, and the timing is possibly defined in score time, the pitch is dependent on both the previous and following pitches. Since the same bit of score can appear in multiple contexts, there may be multiple possible previous or next pitches. Consider the last note before a repeat: you can only know the next note once you have flattened the repeat. So this is a problem where some bits of information are only available at score time, and some are only available at the Note level.

My solution is to split ngoret into two functions. The score level notation will emit a Note with the right timing, and a flag saying to infer the pitch, and to possibly modify the length of the previous note. This in turn relies on a separate realize-ngoret function which post-processes the notes as a stream, at which point it can infer the right pitch, and also modify the length of the previous note. One wrinkle is that once we’re working with a linear stream of notes, the score-level information about what is the previous or next note has been flattened away, which means I have to define a notion of “hand” or “voice” at the score level and annotate the notes accordingly. This is a restriction, but it exposes a fundamental assumption of the instrument that the notation depends on: that each note belongs to a certain hand, and that one hand cannot play two notes simultaneously. In fact, all notation that depends on a previous or next note has this limitation. Anyway, this specific solution for ngoret notation has become a become a general pattern of (notation, realize) pairs.

You might be recognizing this as the two-phase split I was decrying above. Indeed it is, accompanied with the same plumbing hassle, and the same kinds of problems. For example, if you forget realize-ngoret, then things won’t work right, or if you mess up the hand annotations the wrong pitches get inferred. More seriously, as I accumulate more kinds of notation, the various realize calls are not necessarily commutative, which is to say you have to put them in the right order. For instance, a realize that cancels extraneous notes must run before the one that infers things based on neighbors, and one that moves notes around for expressive purposes must run after one which relies on looking for score-level simultaneous notes.

So the phase distinction is still alive and well. Some aspects of this seem to be inherent. For example, since I’m expressing an instrumental technique that references neighboring pitches, that means it naturally requires those things to be defined, which imposes restrictions on the notation. A more specialized notation could eliminate the possibility of messing up hand annotations by syntactically allowing only two monophonic lines per instrument, and eliminate the realize order dependency simply by hardcoding them at the top level… and at this point we’ve arrived back at the explicitly two-phase Haskore style. I have yet to locate my long-awaited free lunch.

Since one of my goals is to have a general notation that can reused across styles, and implement instrumental techniques that assume as little as possible about the instrument for the same kind of reuse, it seems I’m resigned to a somewhat awkward notation and somewhat error-prone usage. I do implement ad-hoc techniques to mitigate it, such as naming conventions (the realize calls are always named realize-something), pre-composed sets of realize calls to avoid the ordering problem, and a general mechanism for calls to store dynamically-typed arguments in a score event: Derive.Call.Post.make_delayed.⁴ The conventions and library support make it easier to write these kinds of ornaments, but they’re not eliminating the complexity, just making it easier to express.

So I think the conclusion is that having an integrated Nyquist-like system does make for more expressive power. It works just fine for simple things like delay. However, as soon as notation gets even a little bit complicated it acquires aspects that require logic at both score time and performance time, which in turn requires careful interleaving. But while the fact of interleaving is still present, it’s still preferable to express them both in the same language. As they get more complicated, I think the usual cross-language coordination would become completely untenable.

I think future work won’t be able to eliminate those requirements, but perhaps it could place them in a more disciplined structure. For instance, I could imagine trying to address the non-commutativity of realize functions directly, with a means to explicitly describe their requirements and allow automatic combination. What that would be exactly I’m not sure, but algebraic systems have long dealt with such restrictions, and category theory and set theory have ways to talk about them.

The compiler level analogy for all of this is compile-time and runtime, and in fact compiler optimization phases are notoriously order dependent and tricky. Usually they are precomposed and hardcoded, but for instance the GHC compiler admits limited extensibility via its rewrite rules. They come with a numeric phase system which, as expected, is brittle and awkward. LLVM also exposes a configurable set of optimizations, so they may have some framework to keep them under control, but it seems that the general problem is really hard. There is a lot of exploration of staged compilation, from Lisps with macro systems (see Racket in particular) to ML and its module system (see 1ML in particular). But while it might be possible to get some inspiration by studying analogous compiler and language work, my feeling is that a solution for a music language will have to exploit music-specific attributes to simplify the problem.

As an aside, the fact that Cubase later added a “note expression” feature which allows per-note curves and provide some editing features shows that they are aware of the problem, and care enough to put some significant work into a solution. I don’t have experience with that feature since I gave up on Cubase and started on my own thing before that version came out, but my impression is that it can’t help but be manual fiddling with curves, even if the specific manual tools are more sophisticated, because they would never give mainstream users a programmable system. My feeling is that anything based on manually tweaking curves in a GUI is going to be too low level to scale, or at least will be so much work that the curves will be relegated to ad-hoc special effects, not pervasive techniques privileged at the same level as the pitches and durations of the notes themselves. But enough years have passed since the introduction of the Cubase feature that it should be possible to get a survey of how people are using it.↩
For an example of where code is a hassle, I sometimes need to find the pitch of neighboring notes. But since pitches are essentially functions, and notes are functions that produce data with pitches inside, I have to evaluate the next note to get its pitch. Not only can this possibly upset requirements about evaluation order (which I have controlled very tightly, but still are present), but can lead to a circular dependency. The root of the problem is that asking the next note to evaluate itself so I can get the pitch out is much more general than just asking for the pitch… so I have to invent a mechanism to just ask for the pitch. Compare this to a data oriented representation, where you just look at the next note and look at its pitch. However, my code oriented pitches support a lot of fancy things, such as retuning dynamically in time or according to context. While a sufficiently fancy data type could also express that, as it increases in complexity it also becomes harder to work with, taking on the same problems as the code one. The interesting thing is that as the data type increases in complexity it approaches a function, but on the other side, as the function accumulates ways to ask for specific things (like pitch), it increasingly resembles the data type. In the limit in both directions, you wind up with a memo table. There really is no free lunch, or at least if there is one I’d like to hear about it.↩
There’s plenty of material online about monads and Reader, but in this context, Reader is really just an implicit argument.↩
The very name hints that this is also yet another facet of the general problem of evaluation order, which turns out to be a huge part of music score evaluation, but that’s another complicated subject.↩