Foundation AI helps IoT platform develop an application to analyze video and display results in real-time

Foundation AI is an Artificial Intelligence Solutions Provider. We help organizations process, manage, and leverage their unstructured data to automate labor-intensive tasks, make better data-driven decisions, and drive real business value.

Zededa is a scalable cloud-based IoT edge orchestration solution that delivers visibility, control, and security for the distributed edge with the freedom of deploying and managing any app on any hardware at scale and connecting to any cloud or on-premises systems.


  • Build a hybrid cloud-edge real-time facial recognition detection solution that takes full-faced (front facing) images of people of interest as input.


  • Extract configured using OpenFace to generate a 128-vector representation of input faces.
  • Decoupled report generation and processing, so that computationally intensive tasks could be offloaded to the cloud.
  • Used Histogram of Oriented Gradient and face landmark estimation to conduct facial recognition and match to input face.


  • The application ran smoothly with minimum latency and without overloading either the network or the edge hardware.
  • The facial recognition achieved high accuracy when presented with video that contained a full-faced perspective of a person of interest.


Zededa is an IoT platform company that helps companies deploy cloud-native applications to the edge. Their platform helps to increase control, visibility, and security in edge applications. Zededa wanted to demo their platform’s ability to analyze video and display results in real-time at a leading Internet of Things conference, but they did not have a demonstratable computer vision solution currently deployed on their platform.


Zededa approached Foundation AI to configure its Extract Vision Platform to perform video streaming and facial recognition with a hybrid cloud-edge architecture. They wanted the facial recognition solution to be deployed on their IoT platform to display its capabilities.

Zededa’s objectives were to:

  • Build a hybrid cloud-edge real-time facial recognition detection solution that takes full-faced (front facing) images of people of interest as input.

  • Decouple report generation and processing, so that computationally intensive tasks can be offloaded to the cloud.

  • Sync logs to AWS S3 buckets to generate real-time reports and visualizations for users.


To start development, the solution was de-coupled into two different modules: video streaming and facial recognition. The customer-facing streaming (Video Streaming) module had to be extremely lightweight because of the minimum processing power available to the webcam Zededa wanted to use on the edge. Since this hardware is incapable of doing the bulk of the processing, the facial recognition module had to be deployed to the cloud.

Video Streaming

In the real-world, video streaming for facial recognition is done by a webcam or a security camera. Since the hardware supporting these cameras is usually very basic, we chose to push the data from this module to a cloud service. Video from these cameras is often captured at a higher frame rate than is needed for facial recognition, so we only needed to transmit every other frame to the cloud. This helps reduce the load on the cloud service as well as on the camera’s network. In addition to video frames, the camera module also sends a customer ID and camera ID to the cloud. This enables the reporting function to differentiate between footage from different cameras and different customers.

Facial Recognition

Zededa provided us with a directory of full-faced images of representatives that would be attending the IoT conference, so that these representatives could be recognized in real-time. The facial recognition module was trained on the images in this directory. We developed a RESTful API server using the lightweight Flask Framework to make these trained images available over the network. Incoming frames are processed, and the results are logged with accompanying confidence scores and labels in appropriate logs based on the customer ID and camera ID passed along with the footage. These logs are synced with a centralized store at regular intervals using an asynchronous cron job.



Face Recognition is achieved through a 2-step process. The first step is face localization, which uses the Histogram of Oriented Gradient (HOG) method invented in 2005. This process begins by converting images to grayscale and is followed by gradient direction detection. Gradient direction detection (identifying the directional change in the intensity or color in an image) is superior to pixel intensity detection (identifying how much color is in each individual pixel in an image) because it helps negate the effect of variable illumination on the face. To make this process more robust, gradient detection is done in squares of 16x16 with individual pixels taken into account to determine the overall gradient direction. Once the image is converted to the HOG format, it is compared with a pre-trained HOG face pattern.

The second step is done using face landmark estimation, a method developed around 2014. In this method, we define 68 landmarks on a face (i.e. top of the chin, outside edge or the eye, inside edge of the eyebrow, etc.) Once the detection on the image is done, we use affine transformation to rotate, scale and sheer the image to make sure that the face is centered. This process is important because it discourages complicated 3D warping, preventing the loss of parallel lines and the introduction of image distortion. Each face during the training phase is then encoded and saved to a central repository.

The objective of the application is to recognize different people, then log this event with a timestamp in a synchronized central repository (AWS S3). The simplest way of doing this would be to use a one-to-one comparison with encoded faces in the repository. This method however would become increasingly inefficient as you increase the number of faces in the database.

We circumvented this issue by using an open-source model developed by OpenFace. This model is trained to generate a 128-vector representation of an input face. During the training phase, all the saved images are passed through the network and the encodings are tagged and trained using a simple classifier like SVM or logistic regression. Whenever a new face is detected, it is passed through the same encoding network and then through the classifier, which outputs the identity of the person. Based on the classifier’s confidence, we decide to report or discard the results. Additional people of interest can be added to the system by adding new full-faced images to the input direction. Our approach requires retraining whenever new faces are added to the database but enables much faster processing than other approaches.


The deployed application runs smoothly with minimum latency and without overloading either the network or the edge hardware. This application was designed to operate in scenarios where good networking infrastructure is available. When the network speeds are sub-optimal, processing latency will become more apparent. This solution was designed, at the request of our client, to use full-face images as input. The resulting solution is not effective at recognizing side profiles of individuals. If we were to design a solution where side profile recognition was necessary, instead of full-face images as input, we would use video input where the person would be guided to record his face from different angles.

If you are interested in deploying a solution built on our Extract Vision Platform, contact us to see what Foundation AI can do for you.
Artificial Intelligence for the Real World
© Foundation AI