This is an attempt to train a deep learning model on a microcontroller using 32-bit floating precision. We have implemented this by building a robot that learns how to follow the nearest obstacle at a minimum distance using deep reinforcement learning. The demonstration of the training process can he found here.
The demonstration of the training process can he found here. This project has been inspired by the work done by Pete Warden (Google TensorFlow), Neil Tan (ARM) on running neural networks on microcontrollers.
Real-time machine learning (RTML) is refers to proactively interpreting and learning from data that arrives in real time. It requires the acting agent to solve unfamiliar problems based on previous knowledge and often make decisions in real time. We discuss RTML in the context of development and deployment on embedded systems.
Currently, there are nearly 15 billion embedded processors active on the planet, collecting and sometimes processing huge amounts of data. Most of this data is discarded because these tiny computers often don't have the hardware capabilities to run complex interpretation task on them. The data is sent over wireless networks for processing to much more powerful centralized computers for this reason. But this form of cloud-based architecture wherein edge devices only perform data collection can be disadvantageous for the following reasons:
- Sending the data requires communication between edge devices and central computer. This process is very energy intensive on the processor and hence is only done intermittently
- Latency-guarantee trade-off while transferring the data is unavoidable because of the way wireless communication protocols are designed. This can introduce significant delays in taking critical actions from the data.
- Wireless communication protocols also have non-deterministic behaviour with respect to the time is can take for data packets to arrive from and to be received by the microprocessor, which are again highly undesirable for tasks that require critical responses.
- This non-deterministic communication link requires the microprocessor to be active almost all of the time, which is one of the major sources of idle power draw.
This is where the power of deep learning comes in. The largest of most intensive of operations in the implementation of a deep learning algorithm is matrix multiplications, for which hardware efficient implementations already exist and is entirely arithmetic-bound. Microprocessors and DSPs are able to process tens or hundreds of millions of calculations for under a milliwatt, even with existing technologies, and much more efficient low-energy accelerators are on the horizon. Once trained, these models are capable of running complex inference tasks and maintain robustness to noise, which is typical of real-time data. Making edge devices smarter by incorporating deep learning capabilities in them would solve all the issues originating from data exchange and make it an apt and reliable candidate for taking quick and consequential decisions from real world data. Devices can also make continuous improvements after they are deployed in the field.
The following are the some of the broader ways of how deep learning frameworks can be developed and run on resource constrained embedded systems:
- Memory efficient implementation of existing algorithms
- Low precision quantized training/inference
- Low precision multiplication during training
- Compressing neural networks with pruning and Huffman coding
- Making floating point math highly efficient for AI hardware
In this project, we focus on the first method and briefly touch upon the second method. To the best of our knowledge, this will be the first deep learning library developed to train and run inference on embedded systems with platform independence.
Refer to this tutorial for a great introduction to policy gradient networks. All libraries for matrix manipulations and implementing deep learning algorithms have been custom written in C++.
For this project, the network is constructed as follows:
- The input layer is of dimension 1x1. It carries the distance to the nearest obstacle measured by the ultrasonic sensor.
- The hidden layer has a dimension of 3x1, with RELU activation.
- The output layer has dimension 5x1, with softmax activation, corresponding to 5 possible velocities at which the bot can travel.
Training is performed after each episode, where the network samples 5 actions, executes them and stores the calculated reward.