Alexy Torres : Etude de la mise en œuvre de support à la cohérence des données dans un système manycore avec caches non cohérents

De Ensiwiki
Aller à : navigation, rechercher
Titre du projet Etude de la mise en œuvre de support à la cohérence des données dans un système manycore avec caches non cohérents
Cadre IRL

Équipe SLS
Encadrants Frédéric Pétrot

Introduction and context

Cache memories have been in every computer since 1980, time at which the speed of the CPU surpassed the speed of the memory. Since then, this speed gap has kept increasing continuously. With the advent of new technologies and new architectures such as many-core architectures, memories hierarchies added to the processors may not be coherent anymore, making shared data access a new challenge in terms of efficiency and production cost. As a result, in many architecture hardware cache coherency protocols as we know them are no longer hardware-supported, leading to the development of software-based methods to maintain cache coherency. The work presented gives a comparison of different methods used in the DNA/OS operating system kernel to maintain cache coherency with a software-based approach.

Cache coherency protocols

Most of the protocols used to ensure cache coherence are based on hardware support or, when non-existent, on software support. Hardware-based protocols can be divided into two main groups: snooping-based protocols and directory-based protocols.

Snooping-based protocols rely on a bus interconnection between different processors. The protocols use the cache controller of each processor connected to the bus. The controllers watch the traffic occurring on the bus and modify their state according to a finite state automaton. These protocols are relatively simple thanks to the use of the shared bus connection that acts as a broadcast mechanism. However the method is not adapted to many-core architectures because of the use of the bus itself.

Directory based protocols maintain coherency using main directories. These memories keep track of the different states (dirty, shared, exclusive, ...) of a cache line. Depending on the protocol, one or multiple directories may be used to create redundancy in order to minimise the access time to these memories. During each memory access (in case of cache-miss), the processor that makes the access has to read the directory to know where the data is stored. Then, depending on the protocol, the processor will have to invalidate a line of cache, modify or add the sate of the cache line in the directory or simply access the data. Directory based protocols ensure a great scalability[1] in the system. Each processor must be linked to the directory but multiple bus can be used, which addresses the snooping base protocols bus bandwidth issue.

Multiple works have been done on software-based cache coherence protocols. Most of the solutions designed are inspired by hardware protocols, taking the hardware finite state machine to the software level in order to manage the cache. The compiler and operating system are usually involved in the protocol. However using software protocol requires special access to the memory (un-cached access) and optimal design to avoid the software processing overhead.

State of the art protocols hierarchy

A more detailed list of the different protocols families is given in the report.

Design protocols for DNA/OS

Kalray architecture constraints

The K1 architecture is composed of 16 clusters. Our work only concerns one cluster. Each cluster is composed of 16 processors with 8KB level 1 data/instruction cache. The level 2 cache is referred as the main memory in which DNA/OS and the user software are injected. This memory has a capacity of 2MB which is the most difficult constraint to overcome when designing the cache coherency protocol. The number of processors in the cluster being relatively high, the protocol should also be scalable.

Lock based protocol

This protocol is based on the entry/release memory consistency model. Each time a lock is acquired, the kernel will invalidate all the L1 data cache lines of the processor that acquired the lock. During the lock release process, all the L1 cache lines will be purged in the main memory. This process is done by purging the write-buffer.

The main limit of this method is that it does not ensure coherence with programs allowing race condition on data manipulated by multiple processors concurrently.

Trace analysis

Based on the work of Cunha~[2], that advocates to simply invalidate a shared line prior to reading it, knowing that the write will eventually reach memory, the code of DNA/OS was modified to maintain cache coherence in the kernel. Each time a potential cache coherence violation is detected in the execution trace of the kernel, a cache line invalidation/purge instruction has to be added to the source code.

Software buffer based cache coherency protocol

Execution instance of the protocol

The protocol we implemented is inspired by the method proposed by J. Cai et A. Shrivastava[3].

To ensure cache coherence, we rely on three elementary operations during program execution: lock acquisition, data writing and that lock release.

  • Lock acquisition: before acquiring the lock, the system will invalidate the cache line that contains data modified by other processors before the lock is actually acquired.
  • Data writing: when a data is modified by the processor, a write-notice is created and inserted in the write-notice buffer. The following process will be further described.
  • Lock release: before releasing the lock, all cache lines containing dirty data will be purged to the main memory.

Two data structures have been created to register dirty data addresses that need to be purged when the lock is released. These structures are also needed during the lock acquisition process in order to check whether a cache line has to be invalidated or not.

  • The write-notice is an entry containing the address of the modified data. The structure also contains an invalidation bit vector. The vector allows the protocol to keep track of which processor has already invalidated the data in its cache since the last modification. This method avoid falses invalidation during lock acquisition process. The write notice structure is a node in a linked list, allowing to store them in the memory.
  • The write notice buffer is contained in the main memory and accessed in an uncached way. This is not an actual entity but a linked list header pointing to the first write notice.


To compare the result obtained with the different protocols we developed two simple programs. The aim of these was to validate the protocols and analyze their execution.

Due to a lack of time only two programs were used to analyze our protocol. Further tests are done to compare the trace analysis approach with the lock based approach.

Write notice based protocols

The two first programs are:

  • Mat_mult, multiplying two 100x100 matrices with up to 16 threads (one per processor).
  • RGB2YUV, converting RBG format images to YUV format with up to 16 threads (one per processor).
Write notice based protocols execution time results

Trace analysis VS Lock based

The following results were obtained with two programs from the Splash2[4] benchmark. Due to the highly constrained environment of the MPPA-256 two other programs (OCEAN from Splash2 and ParallelMJPEG) were to be ported but remain a work in progress.

Both Water-Spatial and Water-Nsquared presented in this section are ran 20 times. The graph represents the mean time of all the executions.

Trace analysis time results

Discussion and further enhancements

These results show the inefficiency of the protocols. These performance issues are explained by the very specific architecture of the MPPA-256 that does not fit the DNA/OS structure well. But the tests also allowed to validate the protocols.

To address the performances issues due to the structure of DNA/OS we modified its code to limit the overhead generated by the kernel. The second set of tests realized with Water-Spatial and Water-Nsquared show better result and even slight amelioration when enabling the caches.

To address the kernel overhead issue further work may rely on new lightweight kernels specially developed for the K1 architecture. This method would allow to analyze the cache protocols efficiency more precisely.


The work presented the work achieved to develop and analyze multiple cache coherence protocols designed for the many-core architecture developed by Kalray[5]. It shows the increasing complexity of developing methods to ensure cache coherency at the cluster level and the incapacity of state of the art methods to produce efficient execution time on general purpose operating systems.


[1] Xiuzhen Lian, Xiaoxi Ning, Mingren Xie, and Farong Yu. Cache coherence protocols in shared-memory multiprocessors. In International Conference on Computational Science and Engineering, ICCSE ’15,2015

[2] Marcos Aurélio Pinto Cunha, Omayma Matoussi, and Frédéric Pétrot. Detecting sw cache coherence violations in mpsoc using traces captured on virtual platforms. ACM Trans. Embedded Comput. Syst., 16(2):30:1–30:21, 2017.

[3] Jian Cai and Aviral Shrivastava. Software coherence management on non-coherent cache multi-cores. In29th International Conference on VLSI Design and 15th International Conference on Embedded Systems, VLSID 2016, Kolkata, India, January 4-8, 2016, pages 397–402. IEEEComputer Society, 2016.

[4] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder PalSingh, and Anoop Gupta. The splash-2 programs: Characterization and methodological considerations. SIGARCH Comput. Archit. News,23(2):24–36, May 1995.

[5] B. D. de Dinechin, R. Ayrignac, P. E. Beaucamps, P. Couvert, B. Ganne,P. G. de Massas, F. Jacquet, S. Jones, N. M. Chaisemartin, F. Riss, andT. Strudel. A clustered manycore processor architecture for embedded and accelerated applications. In High Performance Extreme Computing Conference (HPEC), 2013 IEEE, pages 1–6, Sept 2013.


Fichier:Cache coherency on mppa.pdf Pour le rapport. Fichier:Presentation.pdf Pour la soutenance.