Predicting demographics from meibography using deep learning
Development and evaluation datasets
This study utilized a meibography image dataset from a previously published work20 along with corresponding subject demographic information for deep learning algorithm development and evaluation.
Subject recruitment and imaging
Adult human subjects (age \(\ge\) 18 years) were recruited from the University of California (UC), Berkeley campus and surrounding community for single-visit ocular surface evaluations at the UC Berkeley Clinical Research Center during the period from 2012 to 2017. Eligible subjects were free of any eye conditions contraindicating meibography, not currently taking medications with effects on the anterior eye or adnexa, and with no history of ocular surgery. The research protocol adhered to the tenets of the Declaration of Helsinki and was approved by institutional review board (UC Berkeley Committee for Protection of Human Subjects). Informed consent was obtained from all subjects after being informed of the goals, procedures, risks, and potential benefits of the study. Meibography images of the upper eyelids for both eyes were captured with the OCULUS Keratograph 5M (OCULUS, Arlington, WA), a clinical instrument that uses an infrared light with a wavelength of 880 nm for Meibomian gland imaging27. During image capture, the ambient light was off with the subject’s head positioned on a chin rest and forehead strap apparatus. A total of 750 images were collected and pre-screened to rule out images that did not capture the entire upper eyelid (61 images or 8.90%); the remaining 689 images were used in the analysis.
Demographics
Subject demographics were documented during the visit. Three demographic characteristics were studied in this work, namely age, gender and ethnicity. Histograms depicting the distributions of these demographic features are presented in Fig. 1. The lack of sufficient subjects of some ethnicities for adequate training of the model allowed us only to make accurate predictions for our two largest groups: Caucasians and Asians. The total number of images used for ethnicity prediction is thus 421, while all 689 images were used for age and gender prediction.
Morphological features
The development of an interpretable deep learning model for predicting demographic characteristics requires morphological features such as gland length and tortuosity as data sources. Eight morphological features were quantified for each meibography image as in our previous work: number of glands, gland density, percent area of gland atrophy, gland local contrast, gland length (mm), gland width (mm), gland tortuosity and percentage of ghost glands20. Histograms of these morphological features are presented in Fig. 2.
Data partitioning
Meibography images were partitioned into two mutually exclusive subsets for training and evaluating the deep learning model. Images collected from years ranging from 2015 to 2017 were combined to constitute the development set, while those collected from years ranging from 2012 to 2013 were combined to constitute the evaluation set. All images were taken with the same instrument under the same protocol. The development set was further divided randomly into 2 subsets for training and validating the model. Specifically, the validation set was used to fine-tune the model hyperparameters (e.g., model learning rate) for the model that was trained on the training set. The evaluation set was used for evaluating and testing the performance of the model. Subject demographics stratified on development and evaluation datasets are shown in Table 1. Different subsets had similar demographic feature distributions, so that the distributional shift between the training and evaluation sets was minimized.
Algorithm design and training
The overall goal is to design an interpretable deep learning model that can predict the demographic characteristics of a subject. Interpretability requires the model to be able to identify the most highly weighted morphological features used by the algorithm to predict the demographic characteristics of a subject directly from their meibography image. A two-stage model was designed with a first stage attribute learning model to identify and quantify morphological features from input meibography images, and a second stage demographic prediction model to predict subject demographic features from meibography images and corresponding first-stage morphological features. Figure 3 depicts the overall pipeline.
Deep attribute learning
In the first stage, a deep learning model was developed to predict and quantify the morphological features of a given meibography image (first part of Fig. 3). The primary goal of the attribute learning model is to provide value ranges rather than exact values of morphological features for the final demographic predictions. There are two underlying reasons for this: (1) predicting coarser value ranges is easier than predicting precise values for the deep learning model, especially since the dataset (689 images in total) was not sufficiently large-scale to learn precise morphological feature values. (2) Morphological attribute prediction was an intermediate result, with the major purpose of interpreting relationships between demographic features and morphological features. Predicting value ranges is adequate for the purpose. For example, it would be acceptable for predicting gender to find that females exhibit a high probability of having > 15 glands rather than a high probability of having exactly 16 glands. Therefore, our first stage deep learning model predicts morphological features to fall within ordinal ranges (or, in the case of ghost gland percentage, binary classes).
The morphological attribute learning model specifically predicts a ternary level rather than an exact numerical value for each morphological feature. As depicted in Fig. 4, the model predicts each morphological feature value to fall below \(\mu -\sigma\) (level 1), between \(\mu -\sigma\) and \(\mu +\sigma\) (level 2), or above \(\mu +\sigma\) (level 3), where \(\mu\) and \(\sigma\) refer to the mean and standard deviation of the morphological feature predicted value distribution. Table 2 provides the \(\mu\) and \(\sigma\) for all morphological features investigated. In the case of percentage of ghost glands, 77.1% images (or 531 images) have 0 ghost glands. Therefore, for the percentage prediction, a binary class was used (percentage of ghost glands = 0 or > 0).
Specifically, for each morphological attribute (e.g., gland length), a meibography image was fed to a ResNet18 (a residual neural network of 18 convolution layers)28 to obtain a 64-dimensional vector. The vector was made available by directly adding a fully connected layer after the last convolution layer of ResNet18. The 64-dimensional feature vector was then fed to another fully connected layer for classifying the corresponding attribute (e.g., ternary level of the gland length). The process was the same for all 8 morphological attribute prediction models, meaning that for each meibography image, there were 8 64-dimensional vectors with each one indicating the corresponding morphological attribute.
Demographic feature prediction
In the second stage, a deep learning model was developed to predict demographic features from both meibography images and corresponding attributes from the attribute learning model in stage one (second part of Fig. 3). Specifically, a given image was input to ResNet1828 to obtain a 64-dimensional vector. The vector can be considered as an embedding that encodes information of the image. The vector was combined with 8 predicted vectors of morphological features from the stage one deep attribute learning model. All vectors are of the same dimension. The combined 9 vectors were input to a fully convolutional layer for predicting the demographic features.
Among the three demographic features to be predicted, gender and ethnicity are categorical, while age is continuous numerical. Following Dana et al.29 based on dry eye prevalence, subject age was stratified into three categories: (1) ≤ 39 years old, (2) > 39, < 50 years old, and (3) ≥ 50 years old.
The final output of the demographic prediction model can be interpreted by analyzing the learned coefficients of the morphologic features used to predict the demographic characteristics. Higher coefficient values indicate a stronger weighting of a morphological feature in predicting a demographic feature.
Evaluation metrics
The model was trained on the training set with varying hyperparameters (e.g., different learning rates) and the highest performance model on the validation set was selected for final evaluation on the evaluation set. The highest performance models were selected for attribute learning and demographic prediction, and were evaluated by their classification accuracy.
Classification evaluation with tolerance threshold
The evaluation technique was used for evaluating deep attribute learning performance. As described in the previous section, the stage one deep attribute learning model predicts the trinary level of each morphological feature (or binary, for percentage of ghost glands). However, near the transition limits of different levels (\(\mu -\sigma\) and \(\mu +\sigma\)), the morphological features may be very similar and difficult to classify. A similar technique described in Wang et al.30 was applied here. A tolerance threshold near the grading transition limit was necessary. As illustrated in Fig. 5, the tolerance threshold was set at 0.03\(\sigma\), and classifying morphological feature values within \(\mu -1.06\sigma\) to \(\mu -0.94\sigma\), and \(\mu +0.94\sigma\) to \(\mu +1.06\sigma\) either to their ground-truth or adjacent level were both considered as correct predictions. Note that the tolerance threshold does not apply to predictions of the percentage of ghost glands as that is a binary classification.
Five-fold cross-validation
For evaluating both attribute learning and demographic prediction performance, in addition to reporting classification accuracy on the evaluation set with the best performing model on the validation set, five-fold cross-validation accuracy is also reported. First, the entire dataset (including both development and evaluation subsets) was randomly partitioned into 5 folds. Second, 5 iterations of training and evaluation were conducted. At each iteration, 4 folds were used for training and the remaining fold for evaluation. The mean and standard deviation of the classification accuracy on each fold were reported as five-fold cross-validation accuracy.