Jun 30, 2024

Some Notes on ANSI C

This is going to be a short one.

RTL Syntax

Take a look at the example below. The first statement is the most popular way to declare immutable objects in C. However, the second statement is the actual legal way to do it.

//----> <-------------
  const int x = 0x55e;

//<-------------------
  int const y = 0x55f;

The intended flow of syntax in ANSI C is Right-To-Left (RTL).

This is actually self-evident in the language. We read every C function definition beginning with the code block on the right, then parameters in parentheses, then name of the function, and finally terminating with the return type on the very left. Another example is the conditional statements. The conditional logic is rightmost, then the condition in parentheses, and finally the if keyword. And — as we have already seen above — a data object is assigned to some symbol in RTL manner.

Let’s go through a few more examples to see why having a consistent RTL flow is beneficial.

Example 1

long x = 10000;

It’s easier to see the RTL flow when we model it into a graph:

The above statement reads as: the integer value 10000 is assigned to the symbol x, which covers an object of the type long. An object is a bounded memory space. A type denotes the $\mathbf{unit}$ size in physical memory.

This is simple enough. However, it gets confusing when we throw in more semantic specifiers.

Example 2

long              x[2] = { 0x55A, 0x55B };
long const *const y    = x;

Again, we model the syntax components:

the first statement reads:
- the data space of two hexadecimal values is wrapped under the symbol x[2], which denotes a 2-cell array of the type 64-bit integer .
The second statement reads:
- the data space under the symbol x now has another alias in the symbol y.
- The symbol y comes with a new set of constraints:
  - it is a non-modifiable pointer to a non-modifiable 64-bit integer object.

What does that mean? Why do we need the second statement?

Imagine a situation where we want access to the data space without changing it (viewing or copying). Additionally, we might repeat this action a few more times, so we want to maintain the same access link. This situation calls for a way to limit the scope of these actions on the raw data.

We accomplished that with these semantic specifiers. The RTL flow makes interpreting the statement straight-forward.

Function Macros

Now, let’s move on to another topic. Consider the following definition of a ring buffer structure:

typedef struct ring_buffer {
	i32 *const buffer;      // actual operational data space
	u32        front;
	u32        back;
	u32  const capacity;
} RingBuffer;

This structure is a thin wrapper around the actual data space buffer, which is an array used for queuing instructions or events.

In general, being a kind of queue, a ring buffer should be reasonably sized so that it can be deployed on the stack. This way it can take advantage of the fast stack performance. For that reason, we constrain the maximum capacity of the ring with const, specifying that we don’t expect its size to change. We also expect to use the same data space as buffer, so we make its pointer const.

Now, if we need to allocate the ring on the heap for some special reasons, we can’t. This is exactly because of the constraints we have just defined. By definition, a malloc call will map a specified structure over some memory space, which basically means: set aside the number of bytes equal to the size of the structure, and then align and tag the byte blocks according to each attribute’s size. By the time malloc is done, the memory space is aligned and all const attributes tagged.

Therefore, doing a heap allocation is breaking the constraints — we can’t do that, unless with some unsafe actions.

For experimenting’s sake, below is the constructor’s code for heap allocation using unsafe actions:

typedef RingBuffer Ring;

Ring *new(u32 size) {
    Ring *ring               = malloc(sizeof(Ring));
    i32  *buffer             = malloc(size * sizeof(i32));
    *(i32 **)ring->buffer    = buffer; // unsafe
    *(i32 * )&ring->capacity = size;   // unsafe
    return ring;
}

The unsafe actions here involve: performing a type cast to nullify const and then dereferencing, in that order, to modify each constant object. Unsafe because we broke the promise of const.

This is a recipe for memory leaks.

The reason is that we have performed two separate memory allocations: one for the ring structure and the other for the buffer space. When we call free on the ring, all fields inside it will be freed. However, the field i32 *buffer is a pointer. The pointer itself will be freed — the address value stored in it will be gone — but the object it points to at another location on the heap will remain. The object then becomes orphaned and continued to sit in system’s memory.

This is a tricky situation. We need to free exactly as many times as we malloc.

There is a way around it: function macros .

Let’s go back to the basics of ring buffer. Recall that it’s supposed to have a reasonable size. So, if we really have to allocate it on the heap, then we should probably consider another data structure. That being said — since we insist on having the operational buffer on the heap — we notice that other than the buffer pointer, all other attributes are point-valued and not iterable structures. They are small and fix-sized, so they can stay on the stack. We only need to track the buffer pointer and free it at the end of the lexical scope.

A function macro offers an elegant solution to that.

#define MAKE_RING(name, size, buffer_ptr)   \
    Ring name = {                           \
        .buffer   = buffer_ptr,             \
        .front    = 0,                      \
        .back     = 0,                      \
        .capacity = size }                  \

u32  size = 2;
i32 *buffer = malloc(size * sizeof(i32));
MAKE_RING(ring, size, buffer_ptr);
// now the symbol `ring` is available for use without pointer indirection.
printf("capacity=%d", ring.capacity);
// end
free(buffer);

Now stack allocation has been generalized into a reusable constructor macro. There are less heap objects to manage, and thus less chance for errors to creep in. We also didn’t break the const strictness, so we don’t have to perform double indirection on the buffer object, which is a local extremum — it’s harder to read and more error-prone. Overall, it takes up a smaller cognitive footprint compared to the heap constructor code.

I leave a link to the complete source code with some tests at the end of the blog.

Final Thoughts

The more I learn about ANSI C, the more I like it.

It offers a nice balance between freedom and constraints thanks to its semantic simplicity. Following a consistent RTL syntax flow, interpreting the meaning of the code becomes straight-forward. Thanks to its minimalistic data model, we maintain the freedom to access and operate on raw data. At the same time, we retain the rights to limit the action scopes as needed.

It’s simple and yet powerful. Obviously, with great powers come great responsibilities. But,

ANSI C, I like what I am seeing baby!

Until next time.