Instruction Set Architecture (ISA) is considered to be esoteric even among software developers. I feel it’s because majority of the developers don’t even need to look at, let alone write code in assembly language (a language that has a direct correspondence to the ISA). It isn’t their fault either, the compiler does the heavy lifting of converting the higher level language (generally ISA-agnostic) into its assembly equivalent.

While the compiler does a great job in generating an (nearly) optimal assembly code, hand written assembly code dominates when programming performance critical routines, time critical interrupt handlers or interacting directly with the hardware. Certain architectures have some special DSP/MAC instructions that the compiler is unaware of, so DSP algorithms are also partially developed in assembly. Having a knowledge of the underlying architecture can, at times, help maximize the performance.

Real programmers wrote in machine code1

In this post, I will be covering certain sections of Xtensa ISA, specifically windowed registers, calling convention and stack layout.

Overview of Xtensa:


Xtensa is a post-RISC ISA2 i.e it derives most of its features from RISC but also incorporates certain features where CISC is advantageous. Xtensa has 24-bit instructions (few are even 16 bits!), unlike the conventional 32-bit instructions, to have code compactness.

Registers:

PC = Program Counter
AR = General purpose registers
SAR = Shifts and the Shift Amount Register

PC essentially holds the address which is going to be executed.

AR is general purpose registers, there are 64 32-bit registers however only 16 of them are visible/accessible at a time. This is windowed register.

SAR register holds the shift amount required for shift instructions (logical left, logical right). Xtensa does not provide single-instruction shifts in which the shift amount is in general register (ar) operand. The shift amount is in SAR register.

Windowed Register:


General purpose registers (GPR) are used to store data temporarily for CPU while performing various operations. These registers are blazing fast but are limited in number (8 ~ 32).

Typically, the number of registers present in the register file are equal to the registers directly accessible by the core. The Xtensa core can only access 16 GPR, namely a0 - a15. So the register file contains 16 registers.

Xtensa also has a Windowed register option, which when enabled, extends this register file to contain 64 registers. Essentially, the register frame (a0 - a15) acts as a window, through which only 16 registers are visible, that slides on this large register file having 64 registers. And hence the name: Windowed register.

Which 16 registers are visible is controlled by the WindowBase register. WindowBase register indicates where the window starts in the register file. Also, the shifting/rotation of this window occurs in units of 4. That means, the window starts at (WindowBase x 4)th position in the register file

Windowed register

So…how is this helpful ?

The answer lies in function calls and for that we will have to understand the calling convention.

Calling convention:


Applications are often broken down into various functions/subroutines for reusability, maintainability and scaleability. Each subroutine can have its own set of arguments, local variables and a return type. In practice, there are many such subroutines and nested subroutines, so how should we pass arguments to a subroutine ? What happens to the parameters/context of the caller subroutine ? To answer these questions and various other questions, the architecture defines an application binary interface (ABI) which also includes the calling convention.

Calling convention is basically a set of protocol that delineates how arguments are passed to a function, how the stack is managed inside the function, how is data returned to the caller and some more intricacies.

Xtensa supports two different ABIs.

  1. Windowed register ABI
  2. Call0 ABI

I will cover only Windowed register ABI.

Windowed register calling convention:

Return address is stored in a0 and the stack pointer is store in a1

Arguments to the functions are passed in both, registers and memory (stack). The first six arguments are passed in the registers and remaining go on the stack.

As for return values, they are returned in registers beginning from a2 till a5. If there are more than 4 values to be returned, the caller passes a pointer which is then populated by callee with all the return values.

Register Use
a0 Return address
a1 Stack pointer
a2 - a7 Incoming arguments

In Xtensa, subroutine calls are initiated using CALLN and CALLXN instructions. N is the windowed register option that specifies the amount by which the register window needs to be rotated for the callee. N can take values from 0, 4, 8 and 12. (call0/callx0 does not follow windowed register convention so further explanation does not apply for N = 0)

What does “rotation of window for the callee” exactly mean ?

When a subroutine is called using callN/callxN, WindowBase register is incremented by (N/4), so the registers visible when inside callee, through the window, would be different from caller because the register frame (a0 - a15) would have moved.

In general, for a windowed register call callN/callxN

aN of caller will be a0 of callee
a(N+1) of caller will be a1 of callee and so on…

So the caller needs to put the first argument of the callee in a(N+2), second in a(N+3) and so on..

While returning from the callee function, WindowBase register is decremented by (N/4) to keep the caller function registers same.

Let’s take an example:

/*
 * void func(a,b)
 * {
 *     ...
 *     foo = bar(x,y);
 *     ...
 * }
 */

func:
    ...
    mov         a10, x    // a10 is bar’s a2
    mov         a11, y    // a11 is bar’s a3
    call8       bar	
    mov         foo, a10  // a10 is bar’s a2 (return value)
    ...

Do you see the advantage of using windowed registers ? When a function calls another function, it does not have to store its own arguments somewhere else to accomodate the arguments for the callee since the arguments of the callee is at a different physical location. The callee function internally will still use a2 to access its first argument but as you can see, a2 of the caller is at a different physical location than a2 of callee. If there was no windowing and the number of physical registers would be exactly 16 then a2 of caller and callee would be same. Thus for each function call, the data in these registers would have to be stored at some other memory location (stack) before calling any function and restore again after returning.

Accessing any memory location, other than register, is very slow and as a result this saving/restoring will have a negative impact on performance. So using windowed register convention saves us the overhead of such stores/restores and also reduces the code size.

With that said, few questions pop up in my head

  • What happens when there is a function that calls another function that calls another and so on such that the window keeps on rotating and wraps around pointing to the first function’s frame ?
  • Does it overwrite the data and corrupt the first function’s data ?
  • If not, then where is the first function’s data stored ?

When the program tries to write to a register that already has the data of one of the parent routine, a window overflow exception is generated and it’s the responsibilty of the window overflow exception handler to ensure that the data in the overlapped registers is saved on the stack before it gets overwritten.

And where exactly is the data in these registers stored on the stack ?

For that, let’s see how the stack is laid out.

Stack Layout:


As mentioned, the stack pointer resides in a1 register. This stack pointer always points to the bottom of the stack!

Usually, function prologue sets up the stack for a function.

In Xtensa, ENTRY instruction is the function prologue

ENTRY instruction primarily does two things:

  1. Allocates the stack frame for the function and sets the stack pointer.
  2. Moves/rotates the register window by n as specified in the calln/callxn instruction.

Stack layout is always better explained through an illustration

Windowed ABI stack layout

For clarity, I will use sp as stack pointer instead of a1.

Like most architectures, in Xtensa too, stack grows downwards. If there are outgoing arguments, apart from the first 6 arguments, then they will go on the positive offset from sp. i.e 7th argument on sp, 8th on sp + 4 and so on. Above the outgoing arguments, local variables of that function are stored.

The region underneath the stack pointer, called Base Save Area, is of 16 bytes and reserved for saving the a0 - a3 of the caller (previous frame) when the window overflow exception occurs. If more registers of the caller are required to be saved then it is stored in the Extra Save Area at the top of the caller (previous) stack frame. The location of saving registers of the caller (i-1) frame is highlighed in the image.

With all the necessary points covered, let’s take an example and connect all the dots.

Suppose, each function call is carried out using call8 and we start with WindowBase = 4

Function A calls B, B calls C, C calls D… till I i.e

Functions:  A -> B -> C -> D -> E -> F -> G -> H -> I
WindowBase: 4    6    8    10   12   14   0    2    4

On each function call, the WindowBase will be incremented by 2 because call8 is used.

No. of bits in WindowBase register = log2((No. of registers in register file)/4) = log2(64/4) = 4. Thus the max value of WindowBase is 15.

As you have noticed, on the 9th function call the window wraps around to a point where the frame contains the data of a parent function, i.e a0,a1.. contains data of A. It implies that a8,a9.. of H are a0,a1.. of A.

A window overflow exception will be generated when H tries to modify a8,a9.. since it originally contains the context of A, so these must be saved to accommodate arguments of I. At this point, in the window overflow exception handler we must rotate the register window to frame A (WindowBase = 4).

  • a0 - a3 are stored in the Base Save Area of B’s stack frame. B’s stack frame is accessible since a9 is a1 of B, which is B’s stack pointer.
  • a4 - a7 are stored in the Extra Save Area of A’s stack frame.

Now whenever B returns, window underflow exception will be generated and we need to make sure that the corresponding exception handler would restore these values back into the registers.

I hope now you have some basic understanding of the Xtensa ISA, especially the windowed register ABI. There are some aspects of the windowed register ABI which I have simplified. If you want to delve deeper, refer to the Xtensa ISA reference manual.

Thanks for reading!


  1. The Story of Mel ↩︎

  2. Xtensa ISA: Reference Manual ↩︎