Exploring the Impact of Batch Size on Deep Learning Artificial Intelligence Models for Malaria Detection

Introduction Malaria is a major public health concern, especially in developing countries. Malaria often presents with recurrent fever, malaise, and other nonspecific symptoms mistaken for influenza. Light microscopy of peripheral blood smears is considered the gold standard diagnostic test for malaria. Delays in malaria diagnosis can increase morbidity and mortality. Microscopy can be time-consuming and limited by skilled labor, infrastructure, and interobserver variability. Artificial intelligence (AI)-based tools for diagnostic screening can automate blood smear analysis without relying on a trained technician. Convolutional neural networks (CNN), deep learning neural networks that can identify visual patterns, are being explored for use in abnormality detection in medical images. A parameter that can be optimized in CNN models is the batch size or the number of images used during model training at once in one forward and backward pass. The choice of batch size in developing CNN-based malaria screening tools can affect model accuracy, training speed, and, ultimately, clinical usability. This study explores the impact of batch size on CNN model accuracy for malaria detection from thin blood smear images. Methods We used the publicly available “NIH-NLM-ThinBloodSmearsPf” dataset from the United States National Library of Medicine, consisting of blood smear images for Plasmodium falciparum. The collection consists of 13,779 “parasitized” and 13,779 “uninfected” single-cell images. We created four datasets containing all images, each with unique randomized subsets of images for model testing. Using Python, four identical 10-layer CNN models were developed and trained with varying batch sizes for 10 epochs against all datasets, resulting in 16 sets of outputs. Model prediction accuracy, training time, and F1-score, an accuracy metric used to quantify model performance, were collected. Results All models produced F1-scores of 94%-96%, with 10 of 16 instances producing F1-scores of 95%. After averaging all four dataset outputs by batch size, we observed that, as batch size increased from 16 to 128, the average combined false positives plus false negatives increased by 15.4% (130-150), and the average model F1-score accuracy decreased by 1% (95.3%-94.3%). The average training time also decreased by 28.11% (1,556-1,119 seconds). Conclusion In each dataset, we observe an approximately 1% decrease in F1-score as the batch size was increased. Clinically, a 1% deviation at the population level can create a relatively significant impact on outcomes. Results from this study suggest that smaller batch sizes could improve accuracy in models with similar layer complexity and datasets, potentially resulting in better clinical outcomes. Reduced memory requirement for training also means that model training can be achieved with more economical hardware. Our findings suggest that smaller batch sizes could be evaluated for improvements in accuracy to help develop an AI model that could screen thin blood smears for malaria.


Introduction
Malaria is a major public health problem and a leading cause of death in developing countries worldwide [1].According to the WHO, there were approximately 247 million cases of malaria in 2021 across 84 countries.Mortality has steadily decreased since 2000.However, in 2020, malaria cases resulting in death increased to 625,000 and remained at 619,000 in 2021.While malaria is a global disease, many cases can be localized to specific world regions because 29 of the 84 countries account for 96% of the cases globally.The WHO Africa region was estimated to account for 234 million of the 247 million cases globally in 2021 [1,2].
Unicellular protozoa of the Plasmodium genus are responsible for the pathogenesis of malaria.Some species of Plasmodium include Plasmodium falciparum, Plasmodium vivax, Plasmodium ovale, and Plasmodium knowlesi [3].Transmission of Plasmodium in humans is mainly via a bite from a female Anopheles mosquito [4].When an infected female Anopheles mosquito bites a human, sporozoites from its salivary glands enter the bloodstream [5].Plasmodium parasites ultimately invade erythrocytes, where growth and replication occur [6].

Diagnostic tests for malaria
For a patient suspected of having malaria, the two diagnostic tests for confirmation are real-time diagnostic tests or light microscopy of peripheral blood smears, with the latter considered the gold standard for malaria diagnosis [7].Two types of blood smears can be prepared: thin and thick blood smears.Thin and thick blood smears both involve a drop of blood on a slide, where the thin blood smear is spread into a thinner layer across a larger area and is fixed in methanol [8].Thin and thick smears are subsequently stained with Giemsa stain before observation [7].Through microscopic visualization, the species can be identified, infection staging can be assessed, and parasite density can be determined.The major disadvantages of microscopybased diagnosis are the lack of available skilled technicians and infrastructure to support the preparation and analysis of blood smears [9].Immunochromatographic rapid diagnostic tests (RDTs), which fall under the category of real-time diagnostic tests, on the other hand, test for antigens or enzymes specific to Plasmodium species.Antigenic testing for P. falciparum histidine-rich protein 2 (pfHRP2 ) is specific for P. falciparum, P. falciparum histidine-rich protein 2 (pLDH) can be specific for P. falciparum and P. vivax, and the aldolase enzyme is a pan-malarial antigen [7].
BinaxNOW is an FDA-approved RDT (BinaxNOW™, Abbott, Chicago, IL) that detects HRP2 and Aldolase to identify P. falciparum and generic Plasmodium, respectively.Data from a trial with BinaxNOW suggested an overall sensitivity of 82% for pan-Plasmodium detection and a sensitivity of 95% for P. falciparum [10].In a different study with patients in Cameroon, SD Bioline, a histidine and LDH-based RDT, had an overall diagnostic sensitivity of 95.33%, with a specificity of 94.34%.In the same study, manual light microscopybased diagnosis was determined to have a sensitivity of 94.86%, with a specificity of 94.34% as well.PCR detection of malaria was the control methodology to assess diagnostic accuracy [11].In developing and evaluating new screening and diagnostic methodologies, these specificities and sensitivities can therefore be used as benchmarks.
The clinical presentation of malaria typically includes recurrent fever, headache, malaise, and muscular pains among other symptoms that can be mistaken for gastrointestinal infection or influenza [12].Rapid diagnosis and treatment are therefore important in the management of malaria because delays in the diagnosis and treatment of malaria result in increased patient morbidity and mortality [13].Delayed diagnosis and treatment of malaria can lead to serious complications such as cerebral malaria, renal failure, and pulmonary edema, elevating both mortality and morbidity risks [4].Microscopy, the gold standard diagnostic modality, can be time-consuming, and, as discussed above, is limited by skilled labor and infrastructure.Additionally, inter-technician variability in the interpretation of blood smears impacts interobserver reliability across time and geography [14].

The role of artificial intelligence (AI) in diagnostic screening
AI-based tools for diagnostic screening allow for an automated and scalable approach to analyzing blood films without the need for a trained human microscopy interpreter [15].Computer vision is a field within the broader AI umbrella that focuses on allowing computers to analyze and obtain information from images and videos [16].In medicine, computer vision can be applied to various types of images to identify specific image features for diagnostic or screening purposes [17].One key computer vision-based method is to use a convolutional neural network (CNN), which is a form of deep learning neural network architecture that can be used to identify visual patterns.CNNs attempt to learn features, including local relationships and patterns, about an image from its pixels and their relationship to one another.CNNs are thus being explored for their application in medical image analysis, specifically use cases such as abnormality detection and disease classification.CNNs can be used to explore different image modalities such as X-rays, MRIs, and CTs [18].CNNs are also being applied to screen and identify pathology on blood films, such as the study conducted by Torres et al. in Peru that assessed the potential for a CNN-based malaria detection device [14].
Among the many parameters that can be optimized in CNN model building, one such hyperparameter is batch size, which is the number of images used during model training at once in one forward and backward pass [19].The choice of batch size has a profound impact on model accuracy, the degree of overfitting, time to convergence, and training speed [20].In addition to the quality of the trained model, batch size directly affects the amount of memory that the model requires during training because of the number of concurrent images that are fed into the model for training [21].Thus, factors such as hardware constraints are considerations when selecting a batch size for a particular use case.Additionally, depending on the type of image dataset being used, there may be an optimal batch size to maximize model accuracy.At the intersection of optimizing for accuracy and working within hardware constraints sits a potential range of ideal batch size parameters for a particular problem.In this study, we develop a CNN model and train it on single-cell thin blood smear images.The AI model is similarly trying to identify Giemsa-stained blood smear images as would be analyzed in light microscopy.We then tested the model against different blood smear image datasets, with varying batch sizes to control for the impact that batch size may have on outputs such as model accuracy.In this study, we thus sought to explore the impact that batch size ultimately has on CNN model training with single-cell thin blood smear images via test accuracy and the potential clinical impact that these variations can have at scale.

Dataset description
The "NIH-NLM-ThinBloodSmearsPf" dataset, a publicly available dataset courtesy of the United States National Library of Medicine, consists of a set of blood smear images for P. falciparum.The dataset was developed by the National Library of Medicine of the National Institutes of Health.This dataset was originally used for cell detection in the publication by Yasmin et al. [22].The dataset consists of thin blood smear microscopy images that were obtained originally from 193 patients (148 P. falciparum-infected and 45 uninfected control patients).Single-cell images from the microscopy images, categorized as "parasitized" and "uninfected," were available for download and published on the National Library of Medicine Malaria data website (https://lhncbc.nlm.nih.gov/LHC-research/LHC-projects/image-processing/malariadatasheet.html).
Per the dataset documentation, the microscopy images were obtained from patients at Chittagong Medical College Hospital in Bangladesh.Thin blood smears were Giemsa-stained, and images were captured using a smartphone camera.Images were then annotated and reviewed manually by an expert.The original study from which these images were captured was approved under the Institutional Review Board (IRB) at the National Library of Medicine, National Institutes of Health (IRB#12972).
The single-cell image collection consists of 13,779 single-cell images categorized as "parasitized" and 13,779 categorized as "uninfected" (Figure 1).

Analysis tools
CNN models were developed with the Python programming language (Python version 3.10.12),which is an open-source language often used for data analysis and computing.Some of the Python libraries used in this study include pandas (v1.5.3), numpy (v 1.23.5),tensorflow (v2.13.0), and keras (v2.13.1).A complete list of Python libraries and versions is included in Appendix 1.
Google Colab was used as the development environment.Randomized data was stored on Google Drive for direct access from the Google Colab development environment.Outcome metrics were recorded in Microsoft Excel, and subsequent analysis was performed in Microsoft Excel.Additional analysis and visualization of data were done using the R programming language, which is an open-source programming language used for statistical analysis and data visualization.Analysis in R statistical software (version 2023.06.2+561;R Foundation for Statistical Computing, Vienna, Austria) was done using R studio.

Data randomization
The single-cell images were downloaded from the National Library of Medicine's Malaria Data Website.
Using a Python script (Appendix 2), four different randomized datasets were created: Randomization_1, Randomization_2, Randomization_3, and Randomization_4.For each dataset, the script randomly selected 1,300 of the original images labeled as "parasitized" (P.falciparum-infected cells) and 1,300 of the images categorized as "uninfected" and delegated them into a "test" folder which would then be used to assess the performance of trained CNN models.The remaining images were saved into a "train" folder and used to

Analysis
For each randomized dataset, images were imported from the corresponding zip file, resized to 64x64 images, and normalized by dividing by 255 prior to use in any of the models.Four identical 10-layer CNN models were created consisting of "Conv2D," "MaxPooling2D," "Dropout," "Flatten," and "Dense" layers.Each model was similarly trained with a validation split of 0.2, splitting 80% of the images set aside for training into actual training images and 20% into validation during model fitting.Models were fitted without callbacks and trained for 10 epochs.The differentiating feature of the four models was that they were trained with batch sizes of 16, 32, 64, and 128, respectively.These four models, parameterized with different batch sizes, were then trained and tested with the four randomized datasets prepared during data randomization, yielding a total of 16 sets of results iterated across the four models and four datasets (Figure 3).For each of the 16 sets of model training and testing outputs, the following metrics were collected as data variables for analysis: F1-score: The F1-score is an accuracy metric used to quantify the performance of binary medical tests and is the harmonic mean of precision and recall [23].Mathematically, it can be calculated as follows: F1-score = 2 x (Precision x Recall)/(Precision+Recall) [24].
True positive: Total number of instances when a CNN model correctly predicted "parasitized" from test images.This was computed and visualized using the "confusion_matrix()" Python script from the sklearn library.
True negative: Total number of instances when a CNN model correctly predicted "uninfected" from test images.This was computed using the "confusion_matrix()" Python script from the sklearn library.
False positive: Total number of instances when a CNN model incorrectly predicted "parasitized" from test images, when the image was actually classified as "uninfected."This was computed using the "confusion_matrix()" Python script from the sklearn library.
False negative: Total number of instances when a CNN model incorrectly predicted "uninfected" from test images, when the image was actually classified as "parasitized."This was computed using the "confusion_matrix()" Python script from the sklearn library.
Duration of training: The total amount of time, in seconds, taken for the model fitting to occur with training data, with 10 epochs.This was computed using the "confusion_matrix()" Python script from the sklearn library.
Training time: How long does the model take to complete training with 10 epochs.
In this study, when we refer to the "confusion matrix," we are referring to the true-positive, true-negative, false-positive, and false-negative predictions by a CNN model.
The impact of batch size on model performance and clinical outcomes was assessed by comparing outcomes variables' data when simulating against test images for each randomization and batch size combination.Comparative analysis and graphical interpretation of the data variables were conducted in Microsoft Excel and using R.

Results
The four models with batch sizes of 16, 32, 64, and 128 were run against four sets of randomized data to produce 16 sets of output data.All the models produced F1-scores of 94% to 96%, with 10 of the 16 instances producing F1-scores of 0.95 and an average F1-score across all scenarios of 94.75% (Figure 4).The sum of the false negatives and false positives across all 16 instances ranged from 110 to 163, with a mean value of 138.125 (Figure 4).After averaging all four dataset outputs by batch size, we observed that from batch sizes of 16-128, the average combined number of false positive plus false negative instances increased by 15.4% (130-150), which correlated with an overall model F1-score accuracy decrease of 1% (95.3%-94.3%).Training time also decreased by 28.11%, from 1,556 seconds to 1,119 seconds (Table 1).However, it should be noted that average false negatives individually did not increase as steadily as the combined false negatives plus false positives, as batch size increased (Figures 5-6).Total training time for model fitting with varying batch size, plotted for each dataset, demonstrated that training time also decreased as batch size increased (Figure 7).Ultimately, it took less time to iterate and train a model with the randomized blood smear datasets included in this study over 10 epochs as batch size increased.

Discussion
Training a CNN model is an iterative statistical exercise, and many parameters can be adjusted in the hyperoptimization of a model.The goal in optimizing a model ultimately is to increase accuracy.Modifying batch size as a parameter not only affects performance but also affects the memory usage and training speed of a CNN.In this study, modifying batch size from 16 to 128 did not have a large impact mathematically on overall model accuracy.In all four datasets, we observed a 1% decrease in F1-score as the batch size was increased, with the Randomization_3 dropping from 96% to 95%, and the remaining three datasets dropping from 95% to 94%.
Clinically, however, a 1% deviation at the population level can make a relatively significant impact on outcomes.A 1% deviation across 1,000,000 patients, for example, can result in 10,000 people with incorrect screening results.A major goal of developing a malaria screening tool with a CNN is to introduce scalable screening in regions where malaria is endemic and access to physicians may be restricted.An applied solution leveraging AI for screening might include a process wherein local skilled labor can obtain blood samples, prepare slides, and upload the images to the cloud where a trained CNN model can screen in realtime whether it believes the sample to indicate that the patient is infected or uninfected.The implementation of such a solution reduces the need to locally maintain hardware for the model to run, which in some parts of the world may be costly and hard to do.If a screening tool is used to flag patients as at risk for being infected and speed up the time to diagnosis of malaria, false positives in a model can be managed by subsequently checking all positive screening blood smears manually.The false negatives, however, would fall through the gap as the screening tool would never indicate that the patient does in fact need treatment.
While the 16 sets of batch size vs. datasets that were run demonstrated an overall increase in CNN model accuracy and an increase in combined false positives and false negatives with decreasing batch size, the trend was less visible with just false-negative model predictions (Table 3).
According to Kandel et al.'s study exploring the effect of batch size on the generalizability of CNNs, they recommend using small batch sizes with low learning rates [19].Additionally, Masters et al. in their study state that small batch sizes had the best generalization performance for a given computational cost [25].
Results from this study suggest that the improved accuracy with smaller batch sizes in models with similar and cumbersome, but this can be managed in clinical solutions by managing the trade-off with less frequent retraining of a model.Once a model is trained and implemented clinically, the subsequent retraining is something that needs to be managed over time.Retraining is the idea that as new images are collected over time and datasets get larger, models can be updated to reflect the more current or comprehensive datasets.
While training or retraining a model with smaller batch sizes might take longer, once a model is trained the implemented solution and subsequent screening for each patient is very quick and rapid.It would typically just involve loading one or a few images into the trained model for assessment, so there is ultimately no increase in time for screening experienced by the patient.

Limitations of the study
Limitations of the study include the availability of diverse images, while there were a lot of images the data all came from one source and was already processed and cleaned, whereas clinically single-cell data might not be as quickly and easily available.This dataset also represented a relatively small number of patients overall.Additionally, in a clinical context, blood smear images that are most readily available would have many cells on display as opposed to single cells, which have to be localized on the image and cropped.
Images with more cells may present a different set of optimization parameters for model development and may have varying results for accuracy.
Another limitation of the study was the hardware and software constraints.We used Google Colab as our environment for coding and analysis and were thus limited in computational resources available for free to use in the study.More powerful computers with dedicated GPUs could be useful in expanding the scope of the study and the complexity of the models used.

Implications for future research
Future research for CNN models needs to incorporate prospective real-time model performance in an applied clinical setting.Much of the CNN research that is being conducted is on retrospective datasets with images set aside for simulated real-time testing.The future of AI research needs to consider true prospective real-time performance to provide greater feedback on clinical applications and performance in a clinical setting.Training for these models should consider varying batch sizes for model optimization.

Conclusions
The results from our study suggest that the increased optimization of the proposed model can be achieved when training with a batch size of 16 and that model accuracy may be increased with decreased batch size.
Our findings thus suggest that smaller batch sizes could be evaluated for improvements in accuracy to help develop an AI model that could screen thin blood smears for malaria.In practice, smaller batch size training may be more feasible because of reduced hardware demands.The subsequent increase in model training time makes iterative training and retraining more cumbersome; however, as training can occur independently of screening, it should not affect screening speed for patients in practice.

Appendices Appendix 1: Python libraries
A comprehensive list of Python libraries and version numbers used in this study.

FIGURE 1 :
FIGURE 1: Random selection of images from the 13,779 parasitized and 13,779 unparasitized images, along with corresponding labels, after resizing and importing for use in models.

FIGURE 2 :
FIGURE 2: Left: number of parasitized (n = 1,300) and uninfected (n = 1,300) images set aside for model testing.Right: graphical depiction of the number of parasitized (n = 12,479) and uninfected (n = 12,479) images set aside for model training.

FIGURE 3 :
FIGURE 3: Overview of study design, depicting randomization of datasets, and the four specific batch sizes utilized in CNN models for each dataset.

FIGURE 4 :
FIGURE 4: Left: F1-scores across batch sizes for each randomized dataset.Right: Total false-positive and false-negative CNN model predictions across batch size for each randomized dataset.

FIGURE 5 :FIGURE 6 :
FIGURE 5: Stacked bar chart of confusion matrix data for each randomization vs. batch size combination analyzed in this study.

FIGURE 7 :
FIGURE 7: Total model training time for 10 epochs with varying batch size, for each dataset.

Table 2
presents a full summary of all model prediction accuracy variables that were calculated and training time for each randomization vs batch size CNN model analyzed in this study.The range (maximum minus minimum, or delta) for each metric by batch size was also calculated to analyze the dispersion of outcome data across randomized datasets and can be seen in Table3.

TABLE 3 : Maximum-minimum value (delta, ∆) across the four randomized datasets was computed for each outcome metric by batch size.
The false-negative deltas, or range of the observed false negatives, were 32, 41, 28, and 14 incorrect predictions for analyses run with batch sizes 16, 32, 64, and 128, respectively.Maximum value ranges varied by 58.2%, 87.2%, 45.9%, and 23.7% above the minimum values for each respective batch size.Notably, the dispersion metrics computed in

Table 3
, when compared to the total 2,600 predictions on test data for each randomized dataset made by models, varied by only small amounts.False-negative deltas computed in Table3, for example, were only 1.2%, 1.6%, 1.1%, and 0.5% for batch sizes of 16, 32, 64, and 128, respectively.From Tables2-3, we can see how the increase in incorrect model predictions with batch size is somewhat linear, although not specifically isolated to false-positive or false-negative predictions.It should be noted, however, from Table3, that the dispersion of model prediction accuracy across datasets decreased as batch size increased, or the dispersion predictions, both false and true, from randomization one through four got smaller as batch size got larger.Another way to interpret this is that CNN models had more consistent results with different randomized isolated test images as batch size got larger.