Embedded Multicore Building Blocks

Boost your applications' performance


Unlike most existing libraries, EMB² has been specifically designed for embedded systems. This includes fine-grained control over the hardware, support for task priorities and affinities, which is important for (soft) real-time applications, predictable memory consumption (no dynamic memory allocation after startup), lock- and wait-free data structures that guarantee, among other things, thread progress and signal / interrupt / termination safety, support for heterogeneous systems (SoCs), and independence of the hardware architecture.
While OpenMP is useful for exploiting data parallelism (loops), it lacks higher-level patterns, e.g. for stream processing, and does not provide any concurrent data structures. Moreover, having its origins in high performance computing, most implementations do not take into account requirements from the embedded domain.
MTAPI is a standard for task management in embedded multicore systems defined by the Multicore Assocation. EMB² builds on MTAPI but provides more features, namely parallel algorithms, patterns for stream processing, and concurrent containers.
The base library, which abstracts from the underlying platform, and MTAPI are implemented in C. For better usability, EMB² also provides comfortable C++ wrappers for both. The parallel algorithms, dataflow patterns, and concurrent containers are implemented in C++.


  • Operating systems: EMB² runs on POSIX-compliant platforms as well as on Windows. Since all platform-dependent code is encapsulated in an abstraction layer (base library), the whole library can be ported to other operating systems with moderate effort (or even run bare metal).
  • Compilers and build environment: EMB² uses CMake and requires a C/C++ compiler supporting at least the standards C99 and C++03. We regularly build with GCC, Clang, and MSVC.
  • Processor architectures: With a recent compiler supporting at least C11/C++11, EMB² can be built on most hardware platforms (using option -DUSE_C11_AND_CXX11=ON). Also, EMB² provides an own implementation of atomic operations for x86 and ARM to be usable with older compilers (C99/C++03).
While Windows is not very common in embedded systems, it is often used for development, server applications, and human machine interfaces (e.g., panels).
Platform-specific code is contained in the base library and fenced using EMBB_PLATFORM_* defines. To port the code, add appropriate implementations for your platform. Please see CONTRIBUTING.md for more details.
Although EMB² has been designed for embedded systems, it is not restricted to small controllers or the like. You can also use it to get the most out of “big irons”.


You can download the latest release from GitHub. Note that it is recommended to build from a release file and not from a repository snapshot in order to get the documentation out-of-the box.
All you need is CMake (version 2.8.9 or higher) and a C/C++ compiler such as GCC or Microsoft's Visual Studio.
Please see the Get Started page or the README.md file for more detailed information.


The doc folder in the root directory contains a tutorial (doc/tutorial/tutorial.[pdf|html|epub]), the reference manual (doc/reference/index.html, doc/reference/reference.pdf), a number of simple examples (doc/examples), as well as a more complex application (doc/tutorial/application). Note that the documentation is only available in the release files. If you pull from the repository, you have to build them on your own (see the README.md file for more information).
Make sure that you link all necessary libraries in the following order:
  • Windows: embb_dataflow_cpp.lib, embb_algorithms_cpp.lib, embb_containers_cpp.lib, embb_mtapi_cpp.lib, embb_mtapi_c.lib, embb_base_cpp.lib, embb_base_c.lib
  • Linux: libembb_dataflow_cpp.a, libembb_algorithms_cpp.a, libembb_containers_cpp.a, libembb_mtapi_cpp.a, libembb_mtapi_c.a, libembb_base_cpp.a, libembb_base_c.a
To avoid dynamic memory allocation during operation, the number of threads EMB² can deal with is bounded by a predefined but modifiable constant (see functions embb_thread_get_max_count(), embb_thread_set_max_count() and class embb::base::Thread). As usual in task-based programming models, however, explicit thread creation is only recommended in rare cases, e.g., for I/O or graphical user interfaces. For all other purposes, it is most efficient to rely on the implicitly created worker threads of the task scheduler.
First of all, make sure that your application initializes the task scheduler explicitly. Otherwise, automatic initialization will take place which results in significant overhead during the first call of many EMB² functions, and thus, distorts timing measurements. Secondly, take into account that the speedup is limited by the sequential parts of an application according to Amdahl's Law. For example, even if 75% of your application (in terms of sequential runtime) are parallelized, the theoretical maximum speedup is four. Thirdly, check whether the parallel parts are CPU or memory bound. Typical examples for the latter are simple vector operations where each arithmetic operation involves a memory access. In such cases, the speedup is limited by the memory bandwith of the hardware.
An execution policy specifies a task's priority and affinity. The latter can be used to restrict the set of cores on which a task may be executed.
Create an issue on GitHub labelled with 'question' (preferred way if you want to let the community know) or contact us directly.


Please report bugs, feature requests, etc. via GitHub. Alternatively, e.g. in case of vulnerabilities, send an email to [email protected]. Bug fixes, extensions, etc. can be contributed as pull requests via GitHub or as patches via mail. If possible, refer to a current snapshot of the master branch and create pull requests against the development branch. More detailed information can be found in CONTRIBUTING.md.
Due to a complex configuration supporting different compilers and operating systems, specialized tools for verification, various hardware platforms for testing (from small ARM boards to x86-based servers), and CI runtimes of several hours per night, we use a customized Jenkins server running in our internal network for most of the work. Additionally, basic builds and tests are done using Travis CI. On the long run, we would love to 'outsource' all CI jobs but this will take some time.
Besides static analysis, selected parts of the code have been formally verified using the Divine model checker. Moreover, we employ a linearizability checker to verify that our concurrent data structures behave in the same way as their sequential counterparts. However, verifying the complete source code of EMB² is not feasible with the given tools. For an overview of our code quality measures, see also this post.


A Job is a piece of work, e.g. a function, with a unique identifier. An Action is an implementation of a Job and may be hardware or software-defined. Each Job can be implemented by one or more Actions. A Task represents execution of a Job resulting in the invocation of an Action with some data to be processed.
Plugins are a technique provided by EMB² to deal with heterogeneous systems in a flexible and transparant way. They are similar to device drivers in the sense that they abstract from the hardware via unified interfaces.
At the time of this writing, EMB² provides plugins for OpenCL, CUDA, and distributed systems connected over network (sockets). We are continuously working on additional plugins—please contact us if your hardware is not yet supported.
No, the main purpose of the network plugin is to enable seamless computing on systems consisting of a moderate number of devices without shared memory. Sample use cases include interacting controllers, e.g., in building or industrial automation, local meshes of IoT devices, and many others.
Yes. The task scheduler is implemented in mtapi_c/src/embb_mtapi_scheduler_t.c. Currently, there are two task stealing strategies (from different queues):
  • embb_mtapi_scheduler_get_next_task_vhpf: high priority first
  • embb_mtapi_scheduler_get_next_task_lf: local queues first
A new strategy can be implemented by extending embb_mtapi_scheduler_mode_enum in mtapi_c/src/embb_mtapi_scheduler_t.h and by adding a call to the corresponding function in embb_mtapi_scheduler_get_next_task. Distribution of tasks between multiple nodes is currently implemented in a round-robin fashion (see embb_mtapi_scheduler_schedule_task).