Unit Symbols and the Chain Rule (2)
In the previous post, we explored symbolic representation and the recursive nature of the Chain Rule. We also looked briefly at the identity .
In this post I discuss in more details these concepts by analyzing the Chain Rule for functions of single input.
Abstraction
Let’s look at a concrete example in the following function:
Using the Chain Rule, we can derive with respect to :
Now consider this: symbolically, is just a placeholder for some constant value that we will plugin later. What if we also apply this concept to the symbol ?
If we think about it, any engineering process starts with a tabulated set of data, which is the result of a series of high-ordered actions of measurement. These actions are eventually abstracted into numerical values of some predefined units. For example, distances can be measured in units of the foot or its multipliers such as the yard and mile (or the inverse multiplier inch). However, what actually happens is that the unit foot has to be defined and standardized, and then someone has to physically walk the distance, mark and count the units. All of these are high-ordered actions that are abstracted under a symbolic . Other units of measurement (SI or Imperial) are defined in a similar manner.
A measurement is a series of high-ordered actions collapsed into a value.
Given this premise, let’s assume that is also a placeholder for some constant value that we will plugin later. Then:
We notice that each unit can be expanded only as needed, one symbol at a time. In other words, a unit can be used as is, or decomposed into its constituents.
Symbolic Unit
Thus, the derivative of some functional can be defined recursively as follows:
Again, remember that each partial derivative is a quantity that can be evaluated locally according to its respective unit. For example, given a first-ordered function of some input , the derivative of is simply: the scaled by the quantity . It is entirely possible that can be decomposed into smaller constituents. However, we don’t care about that since our function is defined in terms of at the base. Therefore, we solidify the chain at and call it the total derivative with respect to .
Example
We will evaluate at the following function and its derivative:
1. Chain Rule: standard form
Every partial derivative has to be expanded and evaluated individually. Also note the two sets of parentheses aligned vertically. They start branching out at .
2. Chain Rule: modified form with eager evaluation
We compute the function’s symbols and its respective partial derivatives at :
3. Observations and Notes
There are a couple things to note here in the modified form:
- we evaluate all partial derivatives as soon as possible.
- notice how it takes fewer steps (starting from then) to evaluate .
Because we eagerly evaluated all partial derivatives, we ended up having to do less work computing the total derivative at the end.
Essentially, in the standard chain rule, the evaluation of the total derivative is deferred to the very end. Note that every line (starting from then) represents an expansion step. We recursively expand each symbol, plugging the input in only at the end to compute the total derivative. Because of this deferment, we ended up having to perform two branches of computations. This can easily be traced with a computation graph (left as an exercise for the reader — you can check out the previous post for example graphs).
On the other hand, in the modified version, rather than deferring, we use each input right away. Whenever we evalute a function , we also evaluate its partial derivative at the respective symbolic unit. Therefore, by the time we get to evaluating the total derivative, we will have already passed through all partial derivative quantities. The total derivative is collected at the final computation node.
Summary
In this post, I have shown that symbolic representation helps us frame the Chain Rule recursively. By folding or unfolding symbols, we can simplify and localize the complexity of a problem. In the process, we also noticed a pattern of computation and optimized it.
I find the concept of recursive to be very useful. It can be applied more generally to other aspects of programming.
For example, we know that a running program is treated as a process by the OS and has resources allocated accordingly. The here is process, and the load on the system is the amount of processes running concurrently. That is one way of framing the measurement. On the other hand, we can also go deeper into the program’s source code and map out the function call graph, each function as sub-unit and its resource allocation. In this case, we may frame our measurements in terms of the more detailed resource allocation.
In the next post, we will continue from the single-node graph. We will work through a multi-node graph — it is larger but still a very simple example. We will use it as a scaffold to bind the abstract concepts together.
Until next time.