Example POWER 5/5+ Implementation for C/C++ Memory Model

ISO/IEC JTC1 SC22 WG21 N2745 = 08-0255 - 2008-08-22 (REVISED for POWER 5/5+)

Paul E. McKenney, paulmck@linux.vnet.ibm.com

Introduction

This document presents an implementation of the proposed C/C++ memory-order model for the POWER 5/5+ family of computer systems, which require either usage restrictions or special code sequences to implement the proposed C/C++ sequentially consistent atomic operations.

The POWER 5/5+ family of computer systems successfully run parallel programs containing atomic operations as long as at least one of the following conditions is met:

  1. Traditional synchronization primitives such as locking or read-copy update (RCU) are used instead of the proposed C/C++ sequentially consistent atomic operations. Note that the proposed C/C++ acquire/release atomic operations may use the standard PowerPC code sequences, as shown in the table below.
  2. Simultaneous multi-threading is disabled so that only one hardware thread is active per core (as is often done for computationally intensive numerical workloads).
  3. Operating-system thread-affinity facilities are used so that any given multithreaded application has at most one thread active on any given core.
  4. Each multi-threaded application is confined to a single core (which may have both hardware threads enabled).
  5. The code sequences from the following table are used to implement the C/C++ sequentially consistent atomic operations.

Please note that other members of the Power family, for example, Power 6 and Power 7, need not adhere to any of the above conditions.

Operation POWER 5/5+ Implementation
Load Relaxed ld
Load Consume ld
Load Acquire ld; cmp; bc; isync
Load Seq Cst (POWER5/5+) hwsync; larx; cmp; bc; isync
Store Relaxed st
Store Release lwsync; st
Store Seq Cst hwsync; st
Cmpxchg Relaxed,Relaxed (32 bit) _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; _exit:
Cmpxchg Acquire,Relaxed (32 bit) _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; isync; _exit:
Cmpxchg Release,Relaxed (32 bit) lwsync; _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; _exit:
Cmpxchg AcqRel,Relaxed (32 bit) lwsync; _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; isync; _exit
Cmpxchg SeqCst,Relaxed (32 bit) hwsync; _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; isync; _exit
Acquire Fence lwsync
Release Fence lwsync
AcqRel Fence lwsync
SeqCst Fence (POWER5/5+) for (i=0;i<8;i++) { dcbf junk; hwsync; ld junk; }

The variable junk may be any memory location. It is permissible to use junk as the loop control variable, as long as that loop control variable is assigned to a memory location.

It is legitimate (but usually unnecessary) to replace sync, lwsync, and eieio instructions with the code sequence shown above for “SeqCst Fence (POWER5/5+)”.