Getting started with face detection

Today I felt like experimenting with ARKit and I couldn’t figure out a sample project to start with. When I want to start exploring a new framework or a new API, my approach is usually to start with something simple, but starting with an actual use case.
So I though, what about a video player that pauses the video while I’m not in front of my device and resumes when I look at the screen again?
Sounds interesting? Keep reading, I’ll show you how to do it with ARKit. What about supporting devices without the TrueDepth camera on the front? No problem, I have a fallback implementation using AVFoundation.

The sample project

Here’s the link to the sample project on GitHub.
The app is very simple, a button starts the video player and face tracking at the same time. If ARKit is not available on the front camera, it uses the AVFoundation class. The ARKit and AV classes implement a protocol, so you can use both of them exactly the same way. Instead of a delegate, I preferred a Combine publisher, so the caller can subscribe and get status update from the face tracker.


protocol FaceTracker: UIViewController {
    var trackingStatus:AnyPublisher < Facetrackingstatus, Never > {get}
    static func isAvailable() -> Bool
    func start()
    func stop()
}

This is the protocol describing a FaceTracker.

As I said, we first check if the ARKit version is available, if not we fallback to the AVFoundation.
The FaceTracker is added as a child view controller. In many ARKit examples you see AR applied to a SceneKit view, in this case there is no view added to the hierarchy, you don’t see your face mirrored on the screen, but I found out you do have to add the ARSessionDelegate as a child view controller in order for the methods to be called. Not a big deal, but if you try and implement something in ARKit and your delegate is never called, you may try to use a ViewController. This is why the protocol requires the class to be a UIViewController.
Next, let’s see how to get updates via Combine. The protocol requires a trackingStatus var to be a publisher, it is a publisher that never fails and publishes FaceTrackingStatus, an enum describing the detection status.


private func startFaceTracking() {
    let tracker:FaceTracker = FaceTrackerAR.isAvailable() ? FaceTrackerAR() : FaceTrackerAV()
    addChild(tracker)
    tracker.start()
    cancellable = tracker.trackingStatus
        .debounce(for: 1.0,
                      scheduler: RunLoop.main)
        .sink(receiveValue: {[weak self] value in
            self?.processTrackingStatus(value)
    })
}

private func processTrackingStatus(_ value:FaceTrackingStatus) {
    switch value {
    case .faceDetected:
        playVideo()
    case .noFaceDetected:
        pauseVideo()
    // ...
}

This is the sample code. We instantiate the correct tracker so AV is the fallback if AR is not available. Then we start the tracker and use Combine to get updates.
If you’re not familiar with Combine you can check my previous post about it. For now let’s just say you need to store the object in order to interact with the publisher, that’s why we have the cancellable property in our class. Then, before receiving the value with sink, I put a debounce so if the tracker changes status the video is paused/resumed after a second.
Now that you know how to get status updates, let’s finally talk about ARKit.

Face tracking with ARKit

In order to use ARKit we need to create a ARSession. Before we start using it though let’s see how we can make sure our device has a TrueDepth camera on the front.


extension FaceTrackerAR: FaceTracker {
    static func isAvailable() -> Bool {
        ARFaceTrackingConfiguration.isSupported
    }
    
    func start() {
        let configuration = ARFaceTrackingConfiguration()
        session.run(configuration, options: [])
    }
    
    func stop() {
        session.pause()
    }
}

ARFaceTrackingConfiguration.isSupported is false if we can’t perform face tracking with ARKit, so we can easily perform this check before trying to start a ARSession.
ARKit doesn’t just perform face detection, there is much more, so when you start a session via run you have to provide a configuration. In our example we use ARFaceTrackingConfiguration, but you can track images, body, GPS and so on.
Once you run the session, you should start receiving data in your ARSessionDelegate


func session(_ session: ARSession, didUpdate anchors: [ARAnchor]) {
    guard let faceAnchor = anchors.first as? ARFaceAnchor else { return }
    
    if faceAnchor.isTracked == true && isTracking == false {
        isTracking = true
        status = .faceDetected
    } else if faceAnchor.isTracked == false && isTracking == true {
        isTracking = false
        status = .noFaceDetected
    }
}

The delegate receives update about detected ARAnchor which represents the position and orientation of something of interest in the observed environment. We actually want to detect faces, so we expect a ARFaceAnchor which contains information about the pose and expression of a face detected by ARKit. When using the TrueDepth camera, only one face at a time is detected, so you’ll get the anchor about only one of them if you point the device to multiple people.
ARFaceAnchor carries a lot of information, there is geometry describing the face and you have transform matrixes about both eyes, and even the position describing the direction of the face’s gaze, so you know where the user is looking.
As I stated at the beginning, let’s start simple.
We only need to know if a face is detected, and we can use the .isTracked property on the ARFaceAnchor. If isTracked is false, no face is currently being tracked, so we can send this update to our subscriber. If is true, there is a face pointing at the screen.
The variable status is @Published, so when we update it Combine will send the updated value to our subscribers, we don’t need to do anything more than setting the variable’s value.

Detect eye blink

I had another idea, wouldn’t it be cool to skip forward 30 seconds after I blink my right eye? And go backwards when I blink the left one?
It is possible and I’ll show you how.
First we need to add another value to our enum so we can handle the blink. We can then use seek on AVPlayer to change position in the video timeline going forward or backwards.
That is the easy part, now let’s see how to actually detect the blink with the help of ARKit.


var blinkStatus:BlinkStatus = .none
// check blink status
// the values are mirrored, .eyeBlinkLeft is referred to the right eye
if let leftEyeBlink = faceAnchor.blendShapes[.eyeBlinkLeft] as? Float {
    if leftEyeBlink > blinkThreshold {
        blinkStatus = .rightEye
    }
}
if let rightEyeBlink = faceAnchor.blendShapes[.eyeBlinkRight] as? Float {
    if rightEyeBlink > blinkThreshold {
        blinkStatus = blinkStatus == .rightEye ? .bothEyes : .leftEye
    }
}
if blinkStatus != .none {
    status = .blink(blinkStatus)
}

this is the ARSessionDelegate method we saw previously. This piece of code was added at the beginning, so if there is a blink we report this status, otherwise we execute the code to decide whether a face was detected or not. If we detect a blink, of course a face was detected as well.
How does ARKit help us detecting a blink? We can use the blendShapes property, which is a dictionary containing the detected facial expressions. There is eye blink (left and right) and we’re only interested in those values for the sake of the example, but there are a lot more involving eyes, mouth, nose, even the tongue.
I decided to set a threshold of 0.5, if the coefficient is lower than that I don’t report the value. The values from the dictionary are from 0.0 to 1.0
In this example, using the front facing camera, it looks like the image we have is mirrored, so if I blink my right eye the coefficient is higher on the .eyeLeftBlink key of the dictionary and viceversa.

Tracking with AVFoundation

What about older devices, or maybe not so old but without a TrueDepth camera?
Turns out, AVFoundation can do face detection. I’m not going into much details about using a AVCaptureSession here, you can refer to my previous article about detecting barcodes where I explained how to implement a barcode and QR code reader in Swift.


private func configureCaptureSession() {
    if captureSession != nil {
        return
    }
    let objectTypes:[AVMetadataObject.ObjectType] = [.face]
    let session = AVCaptureSession()
    let deviceDiscoverySession = AVCaptureDevice.DiscoverySession(deviceTypes:[.builtInWideAngleCamera],
                                                                  mediaType: AVMediaType.video,
                                                                  position: .front)
    
    guard let captureDevice = deviceDiscoverySession.devices.first,
        let videoDeviceInput = try? AVCaptureDeviceInput(device: captureDevice),
        session.canAddInput(videoDeviceInput)
        else { return }
    session.addInput(videoDeviceInput)
    
    let metadataOutput = AVCaptureMetadataOutput()
    session.addOutput(metadataOutput)
    metadataOutput.setMetadataObjectsDelegate(self, queue: queue)
    metadataOutput.metadataObjectTypes = objectTypes
    
    captureSession = session
}

This is the code to configure a AVCaptureSession. First you configure the device, you can use all the device’s camera, but in this example we want the front facing camera so we set position .front.
Then we need to look for metadata, in this case we need the AVMetadataObject .face but we can look for barcodes or QR codes or a human body, or even a dogs and cats.
Once we configured a session and started it, our delegate should start receiving values.


func metadataOutput(_ output: AVCaptureMetadataOutput,
                           didOutput metadataObjects: [AVMetadataObject],
                           from connection: AVCaptureConnection) {
    let isFaceDetected = metadataObjects.contains(where: { $0.type == .face })
    if isFaceDetected == true && isTracking == false {
        isTracking = true
        status = .faceDetected
    } else if isFaceDetected == false && isTracking == true {
        isTracking = false
        status = .noFaceDetected
    }
}

If the metadata detected is of type .face, we have a face detected, so we can change the status and notify our subscriber.
I haven’t tested it much, but I guess AVFoundation can detect multiple faces, while ARKit only detects one at a time when using the TrueDepth camera, so if two people are looking at the screen and one of them turns his face, the video should keep playing.
There is no blink detection here, AVFoundation doesn’t offer a live tracking comparable to ARKit via the TrueDepth camera.

Conclusion

This was just a quick introduction to face detection. Obviously, there are privacy implications about constantly detecting the user’s face, and of course you need to set the Privacy – Camera Usage Description key in your Info in order to use the front facing camera, otherwise the app will crash as soon as you start the tracking session.
As you can imagine, you can’t test this kind of stuff in your simulator so you need an actual device.
I do plan to keep playing with ARKit in the future, so stay tuned for more.

Update: I experimented with ARKit on another project to detect nodding and shaking.
You can find it here.

Happy coding 🙂