Voice Control for a Magic Mirror

2019-08-20 09:06

Voice Control with Amazon Alexa, Intel Movidius and Raspberry Pi

The Occasion

MagicMirror with welcome screen — "You look great today."

At our last day of an "open house" we wanted to offer our company visitors something very special in order to present voice control as clearly as possible. From our point of view a "MAGIC MIRROR" based on the idea of Michael Teeuw was especially suitable.

Our Magic Mirror should stand out from the normal ones through various technologies and requirements:

Cloud-based, real-time speech recognition
Machine Learning
Neural networks
Real-time image classification using deep neural networks
Information display (e.g. weather, calendar, news)

All necessary components should be installed in one housing. The magic mirror should be able to speak with a user and recognize pictures (e.g. like, "What is that what I've got in my hand?" - answer: "That's a bicycle.").

At the end we decided for the following components:

Cloud-based speech recognition Amazon Alexa
Raspberry Pi, connected with a display behind a semi-tranparent mirror
Microphone for transmission of audio input to Alexa
Camera for object recognition (as soon as a person steps in front of the magic mirror, it is recognized and the screen is switched on)

Alexa Skill, Speech Recognition and Network Connections

We defined an Alexa skill the usual way:

We first chose a wake-up phrase and appropriate example sentences that our mirror should display by default.

Then we defined use-cases for our magic mirror. It should for example:

Make selfies and e-mail it to a colleague (it´s a mirror, there´s a camera: the perfect use-case)
Say what´s in front of the mirror (object classification)
Control the hardware (e.g. shutdown, reboot). There were no buttons to do so, so we created voice commands for that.

The skill should be running on a server which is contacted by Alexa on each user request. Amazon offers several ways to host the skills, e.g. the Amazon Lambda functions, or the newer Alexa-Hosted Skills. Both these ways are easy to setup and they are free. But our application needed input from the object detection running on the device itself. As we had to have some kind of application serving this information to the skill, we have decided to host the whole skill server on the Raspberry Pi. We have created the skill in JavaScript and run the nginx server.

Now Alexa could contact the Raspberry Pi directly and get the information that it needs. First necessary information was its IP address. In our case the mirror should be able to be taken to any home and connected to the Wi-Fi there, getting a new IP that Alexa needed to know. We have worked around this by using ngrok, a service 'creating tunnels to localhost'. It gave us a URL on their domain and this URLwas always forwarded to our device, wherever it was. A minor issue was that the free version couldn't use the same URL between runs (reboots) and we had to change the skill settings on each boot, but a few lines of code on startup fixed this (updating the skill settings on each start). All other services that we have tested were not reliable enough for our needs.

For the skill itself, instead of writing the whole code on our own, we have decided to use the Jovo framework. This framework helps writing skills that are compatible with both Alexa and Google's Assistant at the same time.

Finally, after the skill was done, an Alexa client application had to be deployed. If we were using Amazon's Echo, this functionality would have already been included, but as we had our own device, something needed to take the microphone input and stream it to Alexa. Also, the wake-up phrase should be recognized on-device, so there was a need of an on-device speech recognizer dedicated to recognizing only one phrase. After this 'simple' recognizer recognized the phrase ('Alexa'), the input from the microphone was sent to Alexa. Fortunately, Amazon has already made such sample applications available. Gone are the days when we needed to implement the raw HTTP/2 connections (with multiple streams) on our own by following the specification. First we deployed the sample Java application, but we collided with some issues (like blocking for some, to us unknown reason). At the end we took the then-new SDK application, which was running flawlessly (Actually, it had one issue where it was interrupting itself when it was uttering a word containing 'Alexa'. For example, when it was saying 'OK, I'm sending your selfie to Alexander', it was recognising this 'Alexa' in 'Alexander' and was starting to listen again. But let's assume that our visitors won't send selfies to Alexander for now and leave the topic of handling barge-in for another time.)

Object Recognition, Deep Neural Networks and Intel Neural Compute Stick

The idea behind this work package was simple:

Using the input signal of the Raspberry PI camera and detecting the main objects in focus
When the user asks for an object classification, the recognized objects should be named

The object recognition is already a well-developed field, with many available training libraries and pre-trained models, mainly based on deep neural networks (DNNs).

It turned out that object recognition was also useful for saving energy: the microphone and display were only activated when a person stood in front of the magic mirror.

We decided to use pre-trained models for object classification. There are a lot of choice of freely available models, trained on different objects sets and having different complexity. As we wanted them to run on Raspberry Pi, we needed only the simplest ones. But no matter how small and simple the models were, we couldn't reach acceptable speed. It always took a few seconds for each camera frame, and that was a considerable delay. Note that together with this classification there was an additional communication with the Alexa server, and that took a fraction of a second to complete (and the skill server was running on Raspberry Pi on Wi-Fi - which was not the fastest setup).

We needed a way to speed up object detection without losing accuracy at the same time. The Intel (Movidius) Neural Compute Stick (NCS) came to our aid. It accelerated the detection remarkably and made the delay absolutely acceptable.

However, as Rosebrock later remarked, the NCS is not completely trouble-free to set up:

"The install process is not entirely isolated and can/will change existing libraries on your system."

Nevertheless, we followed the original instructions, which worked sufficiently well for our application.

E-mailing Selfies

Taking selfies was straight-forward. We took the camera image and sent it to our mail server. However, it was also important to us that the magic mirror confirmed the successful sending of the selfies. But this was not quite so easy - it turned out that our mail server did not always instantly confirm the delivery. And Alexa´s skills have a timeout until which the skill must respond or the user will be told there is something wrong with the skill. This timeout was something we had not considered (and didn't know, honestly, as most of our skills usually return a response practically immediately).

Another thing that we had not considered at first either: if a person ca see himself/herself in the mirror, it does not always mean that the mirror can see the person, too. A wider camera lense made this task easier.

MagicMirror

To display some useful information on the screen, we used the "MagicMirror" software. Practically all of the information on the display is shown by this application. It has a large selection of customisable extensions. We decided for a personal calendar, listed the local news and the local weather forecast.

Hardware: Overheating Raspberry Pi

Finally, we had one more challenge to overcome: all the above mentioned components run together and their temperature was displayed on the screen by means of a red thermometer.

In the worst case we had 85 °C. At this temperature the processor started to throttle. Both were disadvantageous for our talking mirror: the constant high temperatures could damage the hardware, the throttling reduced the performance. And we were already at the limit of acceptable delays.

This is where our hardware team came in. First we found out, that we did not have much choice of heatsinks in our storage, as our own hardware boards never reach temperatures that high. We researched in the internet and found with "ExplainingComputer" a very informative video for the assembly of passive coolers and the achievable temperature differences. Due to the complex work involved, we decided on a very simple solution: thermal adhesive tape and a small heat sink led to a comparable increase in performance.
The thermals were not considered when we designed the frame and the air was not circulating well enough around the heatsink. Fortunately, we managed to find a small, almost silent fan helping the hot air to go out of the frame. We could connect it directly to the GPIO of the Raspberry Pi.

The Result

Despite some trial and error phases we managed to get all features running on a small Raspberry Pi on time. The speaking Magic Mirror was very well liked by our visitors and was in demand all day long.

Add a comment

Go back

More blogposts

2025-08-08 13:33

Crane intercom with vicCOM IP

A successful project: reliable communication between crane, control centre and hall floor.

Just Three Simple Steps to an Emergency Call System

Anyone can configure an emergency call system from just three components. Find out how in our blog post.

2024-11-13 10:05

Mounting microphones and loudspeakers in intercom systems

Best Practice Tips

There are a number of things to consider when installing microphones and loudspeakers. How do you choose the right ones? What do you need to consider when placing them? Our blog post answers these questions.

Questions?

+49 351 40752650
Send email