TI’s new Cortex-M4 line versus the rest

Recently I received an email from Texas Instruments (TI) in which they announced the availability of their new Cortex-M4 line. I really like many products TI makes. When I spoke to TI people at Nürnberg’s Embedded World fair last March, I was thrilled to hear TI would release a Cortex-M4 part in autumn 2011.

I have used their LM3S9B96 processor and am very satisfied with it – excellent peripherals, very good DMA controller, lots of memory and the inclusion of Stellarisware and SafeRTOS in ROM in the chip. I sort of expected the extension of their Cortex-M3 family with Cortex-M4 based microcontrollers that would be faster and  just as well, or better, equipped. To my disappointment, that is not the case:

  • The new Cortex-M4 line (Blizzard) is not faster than the Tempest class Cortex-M3.
  • The new Blizzard class has less internal memory (32K vs 96K).
  • The new Blizzard class has no external bus interface.
  • The new Blizzard class has no Ethernet MAC/PHY like the Tempest class.

The only things going for it seem to be the internal EEPROM, the DSP additions in the core and the fact that the IC is produced using a 65 nm process.

So, why would I use the new Blizzard class instead of the Tempest class? I see no major compelling reasons to migrate. In fact, I think that many if not most DSP style calculations can easily be done on the Tempest class if you use the IQMath library, which is used to map floating point calculations onto fixed-point ones.

Also, when comparing the Blizzard class to competing offerings from other manufacturers, it doesn’t really impress me:

  • Freescale has been offering their Kinetis controller for a while. Running at 150 MHz (the K70 type) and equipped with many useful peripherals (among them, a LCD controller) this seems better equipped than the Blizzard class, and almost twice as fast.
  • NXP will soon be releasing the LPC4000 family, which will run at 180 MHz and which has a Cortex-M4 together with a Cortex-M0 on board. Again, in terms of raw performance, this one leaves the Blizzard class kicking in the dust.
  • ST recently released their STM32F4 series. This stunning Cortex-M4  microcontroller runs (from flash) at 168 MHz – more than double the speed of the Blizzard class. It has loads of memory, many peripherals, an external bus and so on. I seriously doubt TI’s Blizzard class will beat this one.
I hope TI will follow suit and release a Cortex-M4 offering with a performance similar to or better than the ones above. If not, I think they will not compete well in the Cortex-M4 class arena.

Using new and delete in C++ on Cortex-M3

A nice way to precisely time the instantiation of an object in C++  is to use the new and delete operators. Normally, an global object is instantiated at a undefined time – the compiler decides on this, not a language standard.

Whilst delete would not be of great use in an embedded system without good memory management (the easiest implementation would be the  BGET package), new can be used efficiently to cater for a linear instantiation where the user decides when to instantiate.

If you want to use C++ without the bloat, you’re probably forced to provide your own new and delete routines. On Cortex-M3, this can be tricky.

Although the core supports unaligned memory access, unaligned multiple LDM, STM, LDRD and STRD calls will cause a Usage Fault exception. Also, using unaligned data decreases performance. When instantiating a class with a multitude of variables listed in the class, the allocated memory block could well be unaligned. Also, code could conceivably be generated that will execute a number of LDM or LDRD instructions, for example, in the constructor of the object.  It is therefore imperative that a new call will always return an aligned block of data.

A simple way to implement new and delete would be like this :

static void * toewijzing(const size_t size) /* let op: size geeft het aantal bytes aan */
{
    static UInt8 *heap_ptr = (UInt8 *)(__bss_end);
    const UInt8 *base = heap_ptr;       /*  Point to end of heap.  */
    const UInt8 *eindeSRAM = (const UInt8 *)__TOP_STACK;        

    if ((heap_ptr+size) < eindeSRAM)
    {
        heap_ptr += size;       /*  Increase heap met size aantal bytes  */
    }
    else
    {
        while(1);
        KiwandaAssert(0);
    }

    return((void *)base);               /*  Return pointer to start of new heap area.  */
}

void * operator new(size_t size) /* let op: size geeft het aantal bytes aan */
{
    return(toewijzing(size));
} 

void * operator new[](size_t size)
{
    return(toewijzing(size));
} 

void operator delete(void * ptr)
{
    // free(ptr);
} 

void operator delete[](void * ptr)
{
    //free(ptr);
}

The linker will provide the location of the end of the used SRAM space (__bss_end) and the end of the SRAM block (__TOP_STACK).  An example would be the following Linker Load file, used for the Energy Micro EFM32G880F64 processor:

************************************************************************
ARM Cortex M-3 standaard microcontroller loadscript.
Memory I/O adressen, aangepast aan eigenschappen
van de Energy Micro EFM32G880F64 Microcontroller

Alle aanpassingen in dit bestand zijn
(c) 2008-2011 Kiwanda Embedded Systemen
Voor alle overige software geldt het copyright van Atmel

$Id: efm32g880f64.ld 3041 2011-08-28 20:29:52Z oldenburgh $
************************************************************************/

OUTPUT_FORMAT("elf32-littlearm", "elf32-littlearm", "elf32-littlearm")
OUTPUT_ARCH(arm)

MEMORY
{
  FLASH (rx)   : ORIGIN = 0x00000000, LENGTH = 64K
  DATA (rw)    : ORIGIN = 0x20000000, LENGTH = 16K
}

__TOP_STACK    = ORIGIN(DATA) + LENGTH(DATA);

/* Section Definitions */

SECTIONS
{
  /* first section is .text which is used for code */

  . = ORIGIN(FLASH);
  .text :
  {
        KEEP(*(.cortexm3vectors))      /* core exceptions */
        KEEP(*(.geckovectors))          /* efm32 exceptions */
        . = ALIGN(4);
        KEEP(*(.init))
        CREATE_OBJECT_SYMBOLS
        *(.text .text.*)                   /* C en C++ code */
        *(.gnu.linkonce.t.*)
        *(.glue_7)
        *(.glue_7t)
        *(.vfp11_veneer)
        *(.ARM.extab* .gnu.linkonce.armextab.*)
        *(.gcc_except_table)
        . = ALIGN(4);
   } >FLASH

__exidx_start = .;
.ARM.exidx   : { *(.ARM.exidx* .gnu.linkonce.armexidx.*) }
__exidx_end = .;

   . = ALIGN(4);

        .rodata : ALIGN (4)
        {
                *(.rodata .rodata.* .gnu.linkonce.r.*)

                . = ALIGN(4);
                KEEP(*(.init))

                . = ALIGN(4);
                __preinit_array_start = .;
                KEEP (*(.preinit_array))
                __preinit_array_end = .;

                . = ALIGN(4);
                __init_array_start = .;
                KEEP (*(SORT(.init_array.*)))
                KEEP (*(.init_array))
                __init_array_end = .;

                . = ALIGN(4);
                KEEP(*(.fini))

                . = ALIGN(4);
                __fini_array_start = .;
                KEEP (*(.fini_array))
                KEEP (*(SORT(.fini_array.*)))
                __fini_array_end = .;

                . = ALIGN(0x4);

                __CTOR_LIST__ = .;
                LONG((__CTOR_END__ - __CTOR_LIST__) / 4 - 2)
                *(.ctors)
                LONG(0)
                __CTOR_END__ = .;
                __DTOR_LIST__ = .;
                LONG((__DTOR_END__ - __DTOR_LIST__) / 4 - 2)
                *(.dtors)
                LONG(0)
                __DTOR_END__ = .;

                . = ALIGN(0x4);

                KEEP (*crtbegin.o(.ctors))
                KEEP (*(EXCLUDE_FILE (*crtend.o) .ctors))
                KEEP (*(SORT(.ctors.*)))
                KEEP (*crtend.o(.ctors))

                . = ALIGN(0x4);
                KEEP (*crtbegin.o(.dtors))
                KEEP (*(EXCLUDE_FILE (*crtend.o) .dtors))
                KEEP (*(SORT(.dtors.*)))
                KEEP (*crtend.o(.dtors))

                *(.init .init.*)
                *(.fini .fini.*)

                PROVIDE_HIDDEN (__preinit_array_start = .);
                KEEP (*(.preinit_array))
                PROVIDE_HIDDEN (__preinit_array_end = .);
                PROVIDE_HIDDEN (__init_array_start = .);
                KEEP (*(SORT(.init_array.*)))
                KEEP (*(.init_array))
                PROVIDE_HIDDEN (__init_array_end = .);
                PROVIDE_HIDDEN (__fini_array_start = .);
                KEEP (*(.fini_array))
                KEEP (*(SORT(.fini_array.*)))
                PROVIDE_HIDDEN (__fini_array_end = .);

                . = ALIGN (8);
                *(.rom)
                *(.rom.b)

        } >FLASH

   . = ALIGN(4);
   _etext = . ;
   PROVIDE (etext = .);

  /* .data section which is used for initialized data */

   _bdata = . ;
   PROVIDE (bdata = .);

  .data : AT(_bdata)
  {
    _data = . ;
    *(.data)            /* normale data */
    *(.data.*)        

    SORT(CONSTRUCTORS)
    . = ALIGN(4);
  } >DATA

   _edata = . ;
   PROVIDE (edata = .);

  . = ALIGN(4);

  /* .bss section which is used for uninitialized data */
  .bss(NOLOAD) :
  {
    _bss = . ;
    __bss_start = . ;
    __bss_start__ = . ;
    *(.bss)
    *(.bss.*)
    *(COMMON)
   __bss_end__ = . ;
   __bss_end   = . ;
   PROVIDE (_ebss = .) ;
  } >DATA

  . = ALIGN(4);

   _end = .;
   PROVIDE (end = .);    /* einde intern geheugen */

  /* Stabs debugging sections.  */
  .stab          0 : { *(.stab) }
  .stabstr       0 : { *(.stabstr) }
  .stab.excl     0 : { *(.stab.excl) }
  .stab.exclstr  0 : { *(.stab.exclstr) }
  .stab.index    0 : { *(.stab.index) }
  .stab.indexstr 0 : { *(.stab.indexstr) }
  .comment       0 : { *(.comment) }
  /* DWARF debug sections.
     Symbols in the DWARF debugging sections are relative to the beginning
     of the section so we begin them at 0.  */
  /* DWARF 1 */
  .debug          0 : { *(.debug) }
  .line           0 : { *(.line) }
   *(.data)            /* normale data */
    *(.data.*)        

    SORT(CONSTRUCTORS)
    . = ALIGN(4);
  } >DATA

   _edata = . ;
   PROVIDE (edata = .);

  . = ALIGN(4);

  /* .bss section which is used for uninitialized data */
  .bss(NOLOAD) :
  {
    _bss = . ;
    __bss_start = . ;
    __bss_start__ = . ;
    *(.bss)
    *(.bss.*)
    *(COMMON)
   __bss_end__ = . ;
   __bss_end   = . ;
   PROVIDE (_ebss = .) ;
  } >DATA

  . = ALIGN(4);

   _end = .;
   PROVIDE (end = .);    /* einde intern geheugen */

  /* Stabs debugging sections.  */
  .stab          0 : { *(.stab) }
  .stabstr       0 : { *(.stabstr) }
  .stab.excl     0 : { *(.stab.excl) }
  .stab.exclstr  0 : { *(.stab.exclstr) }
  .stab.index    0 : { *(.stab.index) }
  .stab.indexstr 0 : { *(.stab.indexstr) }
  .comment       0 : { *(.comment) }
  /* DWARF debug sections.
     Symbols in the DWARF debugging sections are relative to the beginning
     of the section so we begin them at 0.  */
  /* DWARF 1 */
  .debug          0 : { *(.debug) }
  .line           0 : { *(.line) }
  /* GNU DWARF 1 extensions */
  .debug_srcinfo  0 : { *(.debug_srcinfo) }
  .debug_sfnames  0 : { *(.debug_sfnames) }
  /* DWARF 1.1 and DWARF 2 */
  .debug_aranges  0 : { *(.debug_aranges) }
  .debug_pubnames 0 : { *(.debug_pubnames) }
  /* DWARF 2 */
  .debug_info     0 : { *(.debug_info .gnu.linkonce.wi.*) }
  .debug_abbrev   0 : { *(.debug_abbrev) }
  .debug_line     0 : { *(.debug_line) }
  .debug_frame    0 : { *(.debug_frame) }
  .debug_str      0 : { *(.debug_str) }
  .debug_loc      0 : { *(.debug_loc) }
  .debug_macinfo  0 : { *(.debug_macinfo) }
  /* SGI/MIPS DWARF 2 extensions */
  .debug_weaknames 0 : { *(.debug_weaknames) }
  .debug_funcnames 0 : { *(.debug_funcnames) }
  .debug_typenames 0 : { *(.debug_typenames) }
  .debug_varnames  0 : { *(.debug_varnames) }

} /* einde secties */

When the size of the allocated block is not a multiple of 4, the next allocated block will be unaligned. In order to prevent this, allocation must always be done at multiples of 4 bytes. The next example will show this:

static void * toewijzing(const size_t size) /* let op: size geeft het aantal bytes aan */
{
    static UInt8 *heap_ptr = (UInt8 *)(__bss_end);
    const UInt8 *base = heap_ptr;       /*  Point to end of heap.  */
    const UInt8 *eindeSRAM = (const UInt8 *)__TOP_STACK;        

    if ((heap_ptr+size) < eindeSRAM)
    {
        heap_ptr += size;       /*  Increase heap met size aantal bytes en eindig op 32 bit boundary */

        while (((unsigned int)heap_ptr & 0x3) != 0)   /* bit 0 en 1 = 0 --> stappen in 4 * byte = 32 bytes */
            heap_ptr++;
    }
    else
    {
        while(1);
        KiwandaAssert(0);
    }

    return((void *)base);               /*  Return pointer to start of new heap area.  */
}

void * operator new(size_t size) /* let op: size geeft het aantal bytes aan */
{
    return(toewijzing(size));
} 

void * operator new[](size_t size)
{
    return(toewijzing(size));
} 

void operator delete(void * ptr)
{
     /* not applicable in embedded application */
} 

void operator delete[](void * ptr)
{
    /* not applicable in embedded application */
}

Advancing the heap_ptr to the next aligned address will prevent unaligned allocation to take place. The price paid for this is the loss of, at most, 3 bytes in between two allocated blocks. Since most systems have SRAM memory equal to or in excess of 16 KBytes, this won’t pose a major problem. During design of a class, careful placement of the data members will also prevent unaligned variables. Simply stated, group your equal variables together (for example, all unsigned char or unsigned short) and pad them, if necessary,  at the end with a dummy variable to grow to the 32 bit boundary:

UInt8 var1;
UInt16 var2[3];

/* total = 8+2*16 = 40 bits  -> we need 24 bits padding */
UInt8 padding[3];

This will prevent costly unaligned acces.

Private