Foundation AI is an Artificial Intelligence Solutions Provider. We help organizations process, manage, and leverage their unstructured data to automate labor-intensive tasks, make better data-driven decisions, and drive real business value.
COVID-19 has hit correctional facilities hard. While there are limited vectors for infection (inmate transfers and staff), once COVID-19 has entered a facility, it can be difficult to contain because of the close proximity at which inmates are kept. Prison administrators have a number of tools to control infections, including mass testing, contact tracing, and quarantining.
Prison populations are tested regularly for SARS-CoV-2 and observed for symptoms. Once a potential case of COVID-19 is discovered, the individual is quarantined and contact tracing begins. Through a manual review of inmate schedules and security camera footage as well as interviews with regular contacts and staff, contact tracers can identify additional inmates that may have become infected through contact. Those individuals can then be quarantined, observed for symptoms, and tested.
While manual contact tracing works well when cases are limited, it is extremely time consuming and can miss potential infections. This State Correctional System needed a solution that could provide faster and more accurate results without the need to manually comb through days of video footage.
This Correctional System approached Foundation AI to configure its Extract Video Search product to help automate contact tracing in their facilities. The Correctional System feeds their security video feeds and inmate scheduling data into Extract Video Search. When prompted with a suspected case of COVID-19, Extract Video Search is configured to:
Comb through video footage over the previous 14 days and identify footage with the suspect individual.
Identify if anyone spent at least 10 minutes within 10 feet of the suspected individual (the 10/10 rule).
Compare footage of the individuals who meet the 10/10 rule with prison photo records to identify the names of those individuals.
With Covid-19 already a common concern inside prisons, most prisoners are wearing masks while in public areas. This added an additional challenge as most facial recognition solutions rely on parts of the face that are covered when wearing a mask.
Once an inmate has tested positive for Covid-19 or presented symptoms and staff want to initiate contact tracing, they open Extract Video Search and input the inmate’s name or inmate identification number. The system then processes the previous 14 days of footage to identify other inmates or staff who have violated the 10/10 rule. When the system has finished processing, it generates a report for staff including the names of contacts, the date-time stamp and duration of extended contact, and links to the video footage of the contact.
The Correctional System wanted to optimize for hardware cost, but if they had wanted to optimize for speed, we could have configured the system to process video footage on capture instead of when queried. This facial recognition pipeline processes video at a ratio of 1:1 (1 second of video every second). Multiple video files can be processed in parallel depending on the hardware available.
The WiderFace dataset was used to train a deep learning model RetinaFace to conduct Facial Detection (identifying when human faces appear in digital images) and Facial Alignment (automatically rotating an identified face in 2d and 3d space, so that it appears as a full-faced image). The WiderFace dataset consists of 32,203 images and 393,703 face bounding boxes with a high degree of variability in scene types (scale, pose, expression, occlusion, and illumination). The WiderFace dataset was split into three subsets, training (40%), validation (10%) and testing (50%), by randomly sampling from 61 scene types. When labeling, bounding boxes are drawn around the human faces and objects that aren’t human faces in each image. Labels for every face and non-face is stored in a text file for each image. All the images and labeled text files are used to train the RetinaFace model.
CASIA WebFace was used to train a Facial Recognition model FaceNet. Facial Recognition systems match a human face in a digital image or video frame to a database of faces. This training set consists of a total of 453,453 images of over 10,575 identities. We improved model performance by filtering the dataset by face alignment, brightness, and contrast before training.
Photographs of each inmate are taken at admission and stored in the facility’s Offender Management System (OMS). These photographs and the corresponding names and inmate identification numbers for each inmate are used to recognize each individual.
The Correctional Facilities’ security video feeds are synced periodically to our cloud service. The video footage is captured at 25 frames a second, however not all 25 frames are needed for Facial Detection, Alignment, and Recognition. Our server processes 15 frames per second from the source video in order to reduce computational cost and increase processing speed. We use an open-source computer vision and machine learning software library called OpenCV.
Each frame is processed through our cutting-edge Facial Detection and Recognition pipeline. RetinaFace detects and aligns each face (using each face’s eyes and ears to re-orient the face to be front-facing). This is done to increase the accuracy when matching faces in video that are at a different angle than the input images. FaceNet extracts embeddings from each detected face (converting all the features in a face to a mathematical representation). This mathematical representation is then compared with the embeddings of each photograph in the OMS to find the most similar one.
To accommodate inmates that were wearing masks, we processed each input photograph from the OMS. First, we overlaid an image of a mask over the mouth and nose in each image. Then we ran these new images through FaceNet to extract an alternate “masked embedding” for each inmate. We then added these new embeddings to our database, so that when faces are detected in security video footage, they can be compared to both “non-masked” and “masked embeddings”.
The last step is to compute a Bird's Eye View distance calculation between each pair of people to identify if anyone is violating the 10/10 rules. As the input video may be taken from an arbitrary perspective, we must compute the transformation (more specifically, the homography) that will morph the camera’s view into a bird’s-eye (top-down) view. This process is called calibration. As the input frames are monocular (taken from a single camera), the simplest calibration method involves selecting four points in the perspective view and mapping them to the corners of a rectangle in the bird’s-eye view. This assumes that every person is standing on the same flat ground plane. From this mapping, a transformation that can be applied to the entire perspective image can be derived. During this calibration step, the scale factor of the bird’s eye view is also estimated, e.g. how many pixels correspond to 10 feet in real life. People whose distance is below the minimum acceptable distance are highlighted with a red line connecting them.
All the deep learning models used in this pipeline were retrained using pre-trained weights.
RetinaFace is a robust single-stage face detector, which performs three different face localization tasks together: face detection, 2D face alignment, and 3D face reconstruction based on a single shot framework.
FaceNet is a Unified Embedding for Face Recognition system developed in 2015 by researchers at Google that achieved state-of-the-art results on a range of face recognition benchmark datasets. FaceNet provides a unique architecture for performing tasks like face recognition, verification, and clustering. It uses deep convolutional networks along with triplet loss to achieve state-of-the-art accuracy.
The application runs smoothly using real-time footage processing at a rate of 15 frames per second.
The facial recognition pipeline achieves high accuracy on digital videos when presented with video footage that contained full-faced perspectives of people.
The system can be improved with side profile images of inmates and staff.