Text recognition with Vision

In this post I’m going to show you how I implemented OCR, both on an image or live from the camera feed, by using Vision, and I’ll use the same view controller to perform live scan of barcodes. I already wrote about barcodes in the past so I won’t go into details about that part here.
My package, called GFLiveScanner can be found on GitHub and there is a sample project I made to import the package and experiment with it.

Vision

The Vision framework was introduced in iOS 11 and enables developers to perform face and text detection, barcode recognition and tracking. It can be use with CoreML to provide a model to recognise specific objects, but this is not the purpose of this post. If you’re interested in those topics, I suggest watching this WWDC video from 2019.

What we need from Vision, is its ability to perform OCR on an image, and we’ll use VNRecognizeTextRequest, available in iOS 13. The code in this chapter can be found in this class of my package.


private func processOCRRequest(_ request:GFOcrHelperRequest) {
    var requestHandler:VNImageRequestHandler
    if let orientation = request.orientation {
        requestHandler = VNImageRequestHandler(cgImage: request.image,
                                               orientation: orientation,
                                               options: [:])
    }
    else {
        requestHandler = VNImageRequestHandler(cgImage: request.image)
    }
    let visionRequest = VNRecognizeTextRequest(completionHandler: recognizeTextHandler)
    visionRequest.recognitionLevel = useFastRecognition ? .fast : .accurate
    do {
        try requestHandler.perform([visionRequest])
    } catch {
        print("Error while performing vision request: \(error).")
        currentRequestProcessed(strings: nil)
    }
}

This is the function responsible for processing a request. There is a struct containing the image, its optional orientation and a callback to receive the recognised text.
First, a VNImageRequestHandler is created. This is the object responsible to perform requests on an image. In our example we’ll pass an OCR request, but as I mentioned before there are many more.
As you can see, we can initialise the request handler with an image orientation, so if we know our image is upside down, we can specify that and help the recognition process.
The handler can perform multiple requests on the same image, that’s why we can pass an array of VNRequest.
In our example, we only need OCR, so we create a VNRecognizeTextRequest with a completion handler you’ll see below. The request has a recognitionLevel option you can set to .fast if you want the result as quickly as possible (necessary if you want a live capture) or .accurate if you can wait for a little longer and get a better result.
It is also possible to specify an array of languages, so the OCR will prioritise the language you set.


private func recognizeTextHandler(request: VNRequest, error: Error?) {
    guard let observations = request.results as? [VNRecognizedTextObservation] else {
        currentRequestProcessed(strings: nil)
        return
    }
    let recognizedStrings = observations.compactMap { observation in
        return observation.topCandidates(1).first?.string
    }
    currentRequestProcessed(strings: recognizedStrings)
}

This is the completion handler we set on the VNRecognizeTextRequest, and is called once the request has been fulfilled, either with success or with an error.
The results array is then processed via compactMap, and for each observation we get the top candidate as a string. The topCandidates function can be called with a parameter up to 10, so you can have an array of possible strings from a single observation, by calling it with 1 we only get the most likely string from a single observation so we don’t have to deal with multiple possible values.
Finally, a callback is called with the array of detected strings from the given image.

Live OCR

My package can be used to perform OCR on static images, you could call the helper class with an image obtained by UIImagePicker for example, but I called it GFLiveScanner as its main purpose it to perform live scanning to provide OCR or barcode recognition.
To perform a live scan, you need to present GFLiveScannerViewController or add it as a child to your view.
The view controller can either be fullscreen, or be configured to have a toolbar with a close button. Examples on how to use the VC can be found in this project.
I won’t describe the entire view controller here, but I’ll only talk about the live scanning part, so how to get the feed from the camera, pass it to the OCR and show a preview on screen, so the user can point the camera in the right direction to detect text.

Access the camera with AVFoundation

In order to access the camera for a live capture, we need to import the AVFoundation framework.
There is a lot we can do with AVFoundation, it isn’t just for images but for audio as well, for example you can perform text to speech, record and play audio, and even edit video.
Our focus is on the camera capture, we’ll setup a capture session, get a preview layer and then get images from the live capture in order to perform OCR or barcode recognition.
In this post I’ll focus on the OCR scan more than the barcode, there are two classes implementing the GFLiveScanner protocol and this chapter is about GFLiveOcrViewController.


private func configureCaptureSession() {
    guard let device = AVCaptureDevice.default(.builtInWideAngleCamera,
                                               for: .video,
                                               position: .back),
          let input = try? AVCaptureDeviceInput(device: device) else {
        return
    }
    let session = AVCaptureSession()
    session.addInput(input)
    let output = AVCaptureVideoDataOutput()
    output.videoSettings = [String(kCVPixelBufferPixelFormatTypeKey): Int(kCVPixelFormatType_32BGRA)]
    output.setSampleBufferDelegate(self, queue: DispatchQueue.main)
    session.addOutput(output)
    self.captureSession = session
}

First, we need to get the input device. I use the wide camera, but you may chose to capture from the ultra wide or the telephoto for the models with those camera. You may even capture from the front facing camera, or capture from the TrueDepth camera if you need to capture depth information.
Next, we create the AVCaptureSession, with the camera we chose before as our input.
We then need to add a AVCaptureVideoDataOutput to the session, and we’ll set our class to be the SampleBufferDelegate of it. This delegate is called every time there is something available for us to process, in the form of a CMSampleBuffer. This object may contain video, audio or images depending on the session, in our example we’ll get an image back from a CMSampleBuffer and try to perform OCR on it.

Get an image from the capture session

Once we set up the capture session, we’ll start receiving CMSampleBuffer in our delegate function captureOutput


func captureOutput(_ output: AVCaptureOutput,
                   didOutput sampleBuffer: CMSampleBuffer,
                   from connection: AVCaptureConnection) {
    guard let image = GFLiveScannerUtils.getCGImageFromSampleBuffer(sampleBuffer) else {
        return
    }
    let orientation = GFLiveScannerUtils.imageOrientationForCurrentOrientation()
    ocrHelper.getTextFromImage(image, orientation:orientation) { success, strings in
        if let strings = strings {
            self.delegate?.capturedStrings(strings:strings)
            self.capturedStrings = strings
        }
    }
}

Once we have the image, we can call the OCR helper class to get text from it, so let’s see how to get a CGImage from the CMSampleBuffer. The code is in GFLiveScannerUtils.


class func getCGImageFromSampleBuffer(_ sampleBuffer:CMSampleBuffer) -> CGImage? {
    guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else {
        return nil
    }
    CVPixelBufferLockBaseAddress(pixelBuffer, .readOnly)
    let baseAddress = CVPixelBufferGetBaseAddress(pixelBuffer)
    let width = CVPixelBufferGetWidth(pixelBuffer)
    let height = CVPixelBufferGetHeight(pixelBuffer)
    let bytesPerRow = CVPixelBufferGetBytesPerRow(pixelBuffer)
    let colorSpace = CGColorSpaceCreateDeviceRGB()
    let bitmapInfo = CGBitmapInfo(rawValue: CGImageAlphaInfo.premultipliedFirst.rawValue | CGBitmapInfo.byteOrder32Little.rawValue)
    guard let context = CGContext(data: baseAddress, width: width,
                                  height: height, bitsPerComponent: 8, bytesPerRow: bytesPerRow,
                                  space: colorSpace, bitmapInfo: bitmapInfo.rawValue) else {
        return nil
    }
    let cgImage = context.makeImage()
    
    return cgImage
}

It took me a while to find the right settings, so I’m happy to share them with you.
I was able to make this work with the parameters you can see above, and by setting kCVPixelFormatType_32BGRA as the video format for AVCaptureVideoDataOutput.

Show a preview

If our use has to scan something for OCR or barcodes, he obviously need to see a live preview of the camera feed in order to point the phone to the object he’s interested into.
AVFoundation give us the ability to get a CALayer from a AVCaptureSession, so we can add this layer to one of our views and show the preview.


private func configurePreview() {
    guard let session = captureSession else {return}
    if self.previewLayer == nil {
        let previewLayer = AVCaptureVideoPreviewLayer(session: session)
        previewLayer.frame = cameraView.layer.bounds
        previewLayer.videoGravity = .resizeAspectFill
        cameraView.layer.addSublayer(previewLayer)
        self.previewLayer = previewLayer
    }
}

This is how we can get a CALayer from the session. The videoGravity property can be set to resizeAspect or resizeAspectFill. Both will preserve the aspect ratio and fit or fill the video in the view bounds.

Scan for barcodes

I took most of the code for the barcode scanner from a previous project, please refer to this post if you want to find out more.
The main difference is in setting up the capture session


let metadataOutput = AVCaptureMetadataOutput()
session.addOutput(metadataOutput)
metadataOutput.setMetadataObjectsDelegate(self, queue: queue)

in addition to the video output, I add a AVCaptureMetadataOutput to the session.


public func metadataOutput(_ output: AVCaptureMetadataOutput, didOutput metadataObjects: [AVMetadataObject], from connection: AVCaptureConnection) {
    let codes = getBarcodeStringFromCapturedObjects(metadataObjects: metadataObjects)
    delegate?.capturedStrings(strings: codes)

private func getBarcodeStringFromCapturedObjects(metadataObjects:[AVMetadataObject]) -> [String] {
    var rectangles = [CGRect]()
    var codes:[String] = []
    for metadata in metadataObjects {
        if let object = metadata as? AVMetadataMachineReadableCodeObject,
            let stringValue = object.stringValue {
            codes.append(stringValue)
            if drawRectangles {
                rectangles.append(object.bounds)
            }
        }
    }
    if drawRectangles {
        drawRectangles(rectangles)
    }
    return codes
}

The delegate metadataOutput is called every time there is new metadata available from the AVCaptureSession. As you can see I can iterate through each metadata, try to cast is as a AVMetadataMachineReadableCodeObject and then append its string value to the results.

Hope you found this post interesting, feel free to import my package via Swift Package Manager. This is my article about using SPM in your project if you need help.
Happy coding 🙂