Rapid bacteria identification using structured illumination microscopy and machine learning

Yingchuan He*, Weize Xu, Yao Zhi, Rohit Tyagi†‡, Zhe Hu‡||†† and Gang Cao†‡¶**†† *College of Engineering Huazhong Agricultural University Wuhan 430070, P. R. China College of Veterinary Medicine Huazhong Agricultural University Wuhan 430070, P. R. China State Key Laboratory of Agricultural Microbiology Huazhong Agricultural University Wuhan 430070, P. R. China Bio-Medical Center Huazhong Agricultural University Wuhan 430070, P. R. China ¶Key Laboratory of Development of Veterinary Diagnostic Products Ministry of Agriculture, College of Veterinary Medicine Huazhong Agricultural University Wuhan 430070, P. R. China ||huzhe@mail.hzau.edu.cn **gcao@mail.hzau.edu.cn


Introduction
Bacteria are microorganisms with typical length of several micrometers and di®erent shapes (sphere, rod, spiral, etc.). 1 Some bacteria are harmful to man by causing serious infections and diseases (thus called pathogenic bacteria). Bacteria detection and identi¯cation are critical for the diagnosis and treatment of infectious diseases. Currently, pathogenic bacteria are usually identi¯ed by: morphological features, physiological and biochemical characteristics (such as nutritional type and antibiotic sensitivity), immunological markers (bacterial antigen, capsular antigen, etc.), chemical composition characteristics (for example, fatty acid composition, ribosomal protein) and genetic markers (such as 16S rDNA). 2 With the advancement in biochemical analysis technology and the progress of nucleic acid sequencing, a number of bacteria identi¯cation methods have been commercialized, leading to generation of commercial products such as assay kits, equipment and technical services. Although the accuracy of bacteria identi¯cation has been improved drastically, there are still some limitations in the practical applications of these methods. For example, it requires the use of microbiological techniques for isolation of pure culture before applying the physiological and biochemical identi¯cation methods. On the other hand, the methods based on highthroughput sequencing technology are usually expensive, complicated and time-consuming. Hence we want to eliminate these issues by directly applying optical microscopy for simplicity and cost e®ective procedure with high accuracy.
The traditional method based on microscopic morphology seems to be a simple, fast and economical way for bacteria identi¯cation, especially for some bacteria with unique structural features. 3 However, the development of this method is slow, mainly due to the limited morphological features visualized by conventional optical microscopy and the absence of standard pattern image database. Furthermore, this microscopy-based method usually relies on manual bacteria identi¯cation which su®ers from time-consuming and trainingdependent identi¯cation.
In recent years, the advent of super-resolution microscopy techniques, such as Stimulated Emission Depletion (STED) Microscopy, Stochastic Optical Reconstruction Microscopy (STORM), Photoactivated Localization Microscopy (PALM) and Structured Illumination Microscopy (SIM), has extended the application range of conventional optical microscopy beyond the di®raction limit and achieved more structural details for di®erent applications. 4 It is noteworthy that SIM technology is advantageous for imaging bacterial morphology without any further requirements of biological sample preparation. In the meantime, the rapid development of machine learning is helpful for a lot of applications, including but not just limited to the applications in the bio-medical¯eld. 5 Therefore, it is highly possible that combining SIM technology with machine learning could provide a rapid and automatic way for bacterial identi¯cation with higher accuracy than the conventional microscopy-based method.
Here we reported a pilot study of combining SIM technology with machine learning for rapid bacteria identi¯cation. We¯rstly used SIM technology to image the¯ne structures of three model bacteria, including Escherichia coli (E. coli), Mycobacterium smegmatis, and Pseudomonas aeruginosa. Then, we applied classical algorithms in the¯eld of machine learning to extract morphological features of these bacteria. Finally, we established a machine learning system for rapid bacteria detection and identi¯cation. This study might open a new avenue for rapid clinical diagnosis of pathogenic bacteria by addressing the limitation in available morphological features and identi¯cation accuracy. Cells were grown overnight to attain OD600 of 0.5, then 200 l of broth culture harvested, and suspended in 50 l PBS. Staining of the bacterial membrane was performed by incubating with NanoOrange (Invitrogen, 1/10 v/v) for 30 min at room temperature. 6 Because NanoOrange exhibits very-weak°uorescence when it is not binded to membrane, we directly spot 3 l of this suspension onto a poly-L-lysine-treated glass coverslip without washing.

SIM imaging
A Nikon Structured Illumination Microscope (N-SIM) was used for super-resolution microscopy imaging of the bacteria. Images were captured with an EMCCD camera (Andor iXon DU-897) and a 100 Â 1.49 NA TIRF objective (Nikon CFI Apo TIRF). The°uorescence was excited by a 488 nm laser and cleaned by a bandpass emission¯lter (500-545 nm). Image acquisition and reconstruction were performed with Nikon NIS-Elements software in SIM and wide-¯eld mode, respectively.

Image segmentation and negative samples generation
Firstly, we used the watershed algorithm 7 in the open-source computer vision library -OpenCV 8 to segment the SIM images into several target bacterium regions. Then, we reproduced standard images with a size of 250 Â 250 pixels, consisting of a segmented target bacterium in a noise-free background, for model training. In addition, since some nonbacterial images were needed in the negative regions during the model training process, we manually selected some sub-regions in the SIM images as a reference area. These sub-regions included as much noise types as possible and do not contain any bacteria. Finally, a su±cient number of negative samples with the same size (250 Â 250 pixels) were generated by random selection from these sub-regions.

Algorithm for feature extraction
Feature extraction determines the e±ciency of model selection. If an image is input directly as a vector rather than extracted features in the classier training process, extra computing time and resources will be required due to the high data dimension. And, for most classi¯cation models, high data dimension usually reduces the e±ciency of classi¯cation. After considering the characteristics of our SIM images, we selected Principal Component Analysis (PCA) 9 method to extract the algebraic features of the images. This method reduces the dimension of the SIM images to acceptable sizes for classi¯er training.

Algorithm for classi¯cation
We selected three classi¯er models in this study: Support Vector Machine (SVM), K-Nearest Neighbors (KNN) and Random Forest. SVM is a widely used classi¯er model in computer vision with excellent classi¯cation performance. 10 KNN is a relatively simple classi¯er, where the main idea is to classify a new data point from the nearest K data points. 11 Random Forest is based on voting from a combination of multiple decision trees, 12 and is capable of reducing the impact of noise and the possibility of over-¯tting. 13 Among the three classi¯ers, Random Forest and KNN support multi-classi¯cation, while the standard SVM is a two-class model and thus needs to combine with a suitable strategy to become applicable for multi-classi¯cation tasks. Here we apply one-vs-rest 14 strategy to SVM for this purpose. All of the three classi¯ers are derived from the opensource Python machine learning librarysklearn 15 that provides the classi¯er codes.

Evaluating the classi¯cation models
Accuracy and F1-Score were used to evaluate the classi¯cation performance of the classi¯er models. For binary-classi¯cation, the Accuracy and F1-Score were calculated from Eqs. (1)-(4), where \TP", \FP", \TN", and \FN" represent True Positive, False Positive, True Negative, and False Negative, respectively: Precision ¼ Recall ¼ TP TP þ FN ; ð3Þ For multi-classi¯cation, the evaluation can be derived from the binary-classi¯cation. Assuming that k types of data need to be classi¯ed, and that \TP i ", \FP i ", \TN i ", and \FN i " represent the True Positive, False Positive, True Negative, and False Negative of the ith data, respectively, we can calculate Accuracy, Precision and Recall with Eqs. (5)-(7), and then F1-score using Eq. (4): Precision ¼ Recall ¼ 3. Result and Discussion 3.1.

Resolution estimation
We used 140 AE 5 nm \GATTA-SIM" nanorulers (Gattaquant) to characterize the performance of SIM (Fig. 1). This kind of nanorulers carries two°u orescent markers at each end and is an ideal sample to quantify the lateral resolution of our SIM system. As shown in Figs. 1(a) and 1(b), SIM can clearly resolve the¯ne structure of the nanorulers. In contrast, conventional°uorescence microscopy provides only blurry, undistinguishable images ( Fig. 1(c)). The distance between the two°uorescent spots in Fig. 1(b) was estimated to be 138 nm ( Fig. 1(f)), which is consistent with the size of the nanoruler (140 AE 5 nm). We also performed direct experimental comparison between the SIM and conventional°uorescence microscopy imaging of a bacterium, and observed signi¯cant improvement of resolution in the SIM image (Figs. 1(d)-1(e)). With SIM imaging, we can obtain more morphological features from SIM images which are bene¯cial for subsequent machine learning.

Standard images for machine learning
The SIM images of three types of bacteria, including E. coli MG1655 (178 images), Mycobacterium smegmatis MC155 (168 images), Pseudomonas aeruginosa PAO1 (202 images), were acquired with a Nikon N-SIM with 50-100 ms exposure and 100 EM gain. Representative SIM images are shown in Fig. 2(a). The bacteria in the SIM images were segmented into individual positive images containing only one bacteria (Figs. 2(b) and 3). Negative images ( Fig. 2(b)) were also generated using the procedures described in Sec. 2.3.1. Both the positive and the negative images had the same size of 250 Â 250 pixels. Table 1 shows the number of raw images and positive images in this study.

Structural features for bacteria identi¯cation
In this study, we used PCA to extract the structural features of the bacteria and obtained the eigenvectors for each type of bacteria. Figure 4(a) shows four of the most important eigenvectors with the   largest contribution to the variances during PCA. The eigenvectors for each type of bacteria are different, and thus can be used to identify the type of the bacteria. Furthermore, to¯nd out the best number of eigenvectors for bacteria classi¯cation, we quanti¯ed the dependence between the number of eigenvectors and the classi¯cation accuracy ( Fig. 4(b)). The classi¯cation accuracy improved rapidly by increasing the number of eigenvectors, and then became stable after the eigenvectors increased to 40. In this study, we determined to set the number of eigenvectors to 100 to obtain highly stable results.

Bacteria identi¯cation strategy
For the bacteria identi¯cation, it is important to determine an optimal algorithm for feature extraction and a suitable classi¯cation classi¯er. Figure 5(a) shows the strategy to classify bacteria image: Firstly the structural features of the image is extracted, then the features are sent to a¯rst clas-si¯er which is used to determine whether the image belongs to any kind of bacteria. If the conclusion is \Yes", the structural features are further sent to a second classi¯er for further determination of the types of the bacteria. The pipeline shown in Fig. 5(b) is used for clas-si¯er training. First of all, SIM images of bacteria were segmented and labeled as positive samples. Then, negative samples are generated from the same raw images. Finally, the structural features for both positive and negative samples were extracted and used for classi¯er training.

Identi¯cation performance
The strategy of cross-validation 16 was used to test the e®ect of classi¯er. We found that the SVM algorithm used in this study (Classi¯er one) was su±cient to distinguish the positive images from the negative images. In a¯ve-fold cross validation testing, the accuracy and F1-score of classi¯cation were both above 99%. Classi¯er two was responsible for identifying the type of di®erent bacteria. We tested the iden-ti¯cation performance of three classi¯ers: SVM, KNN and Random Forest. The results for multi-classi¯cation were shown in Table 2. The parameters used in the classi¯er models were presented in Table 3. After carefully optimizing the parameters, all of the classi¯ers provided excellent Accuracy and F1-Score (> 95%), while SVM presents the best performance.
The confusion matrices for a representative multi-classi¯cation test are shown in Fig. 6(a), which allows a clear visualization on the identi¯cation performance of the classi¯ers. To further understand the classi¯ers' capability on di®erentiating the bacteria types, we performed a binary-classi¯cation test. From the results in Fig. 6(b), we concluded that the classi¯ers have no speci¯city for the bacteria.

Time performance
The time performance is also an important factor for choosing a good classi¯er. Here, with the same training and test datasets, the time performance of the classi¯cation models is similar (shown in Table 4), but Random Forest seems to be less e±cient than the other two classi¯ers.

Robustness performance
In real applications, the SIM images of the bacteria may contain di®erent level of noises. In this regard, we tested the robustness of the classi¯ers under three types of noises: bar mask, square mask and Gaussian noise (shown in Fig. 7(a)). We¯rstly added these noises to original images and then performed the same bacteria identi¯cation processes to the new images containing noises. Figure 7(b) shows the testing results.
We observed that SVM was sensitive to the bar and Gaussian noises, KNN was sensitive to the Gaussian noise, and Random Forest was robust to all of the noises.

Comparison of bacteria identi¯cation performance
To investigate the superiority of SIM's high resolution, we trained classi¯cation model using the images captured by normal°uorescence microscopy. Figure 8 shows that for original image SIM improved accuracy slightly. However, for the de¯cient images, SIM is better than normal°uorescence microscopy in most instances. This shows that SIM images improved the robustness of all three kinds of classi¯cation models and the classication accuracy.

Cost performance
The cost performance is an important reference factor for rapid clinical diagnosis of pathogenic bacteria. Based on local market, we provided a comparison between biochemical, genotypic analysis method and the SIM-based method (shown in Table 5). It shows that this study is cost e®ective, less time-consuming and less technological demanding.

Conclusion
In this study, we report a new method for bacterial identi¯cation. This method is based on SIM technology which is capable of providing more morphological features than conventional°uorescence microscopy. After applying a machine learning strategy to the SIM images, we obtain an identi¯cation accuracy up to 98%. This study opens new possibility for rapid bacteria identi¯cation, especially after further training of more bacteria types and optimizing labeling strategies and machine learning algorithms.