An Internal Consistency Reliability Study of the Catalyst Datafinch Applied Behavior Analysis Data Collection Application With Autistic Individuals

Introduction Many psychometric studies have scrutinized the dependability of different instruments for evaluating and treating autism using applied behavior analysis (ABA). However, there has been no exploration into the psychometric attributes of the Catalyst Datafinch Applied Behavior Analysis Data Collection Application, namely, internal consistency reliability measures. Materials and methods Four datasets were extracted (n=100, 98, 103, and 62) from published studies at The Oxford Center, Brighton, MI, ranging from March 19, 2023, through January 8, 2024, using Catalyst Datafinch as the data collection tool. All data were gathered by Board Certified Behavior Analysts (BCBAs) and behavioral technicians and designed to replicate how practitioners collect traditional paper and pencil data. SPSS Statistics (v. 29.0) computed internal consistency reliability measures, including Cronbach’s alpha, inter-item, split-half, and interclass correlation coefficients. Results Dataset #1: Cronbach’s alpha was 0.916 with seven items, indicating excellent reliability. Cronbach's split-half reliability for Part 1 was 0.777, indicating good reliability, and for Part 2 was 0.972, indicating excellent reliability. Guttman split-half coefficient was 0.817, indicating good reliability. Inter-item correlation coefficients ranged from 0.474 to 0.970. The average measures interclass correlation (ICC) was 0.916, indicating excellent reliability. Single measures (ICC) reliability was 0.609, indicating acceptable reliability. Dataset #2: Cronbach’s alpha was 0.954 with three items, indicating excellent reliability. Cronbach's split-half reliability for Part 1 was 0.912, indicating excellent reliability, and for Part 2 was 0.975, indicating excellent reliability. Guttman split-half coefficient was 0.917, indicating excellent reliability. Inter-item correlation coefficients ranged from 0.827 to 0.977. Average measures (ICC) was 0.954, indicating excellent reliability. Single measures (ICC) reliability was 0.875, indicating good reliability. Dataset #3: Cronbach’s alpha was 0.974 with three items, indicating excellent reliability. Cronbach's split-half reliability for Part 1 was 0.978, indicating excellent reliability. Split-half reliability for Part 2 was 0.970, indicating excellent reliability. Guttman split-half coefficient was 0.935, indicating excellent reliability. Inter-item correlation coefficients ranged from 0.931 to 0.972. The average measures (ICC) was 0.974, indicating excellent reliability. Single measures (ICC) reliability was 0.926, indicating excellent reliability. Dataset #4: Cronbach’s alpha was 0.980 with 12 items, indicating excellent reliability. Cronbach's split-half reliability for Part 1 was 0.973, indicating excellent reliability. Split-half reliability for Part 2 was 0.996, indicating excellent reliability. Guttman split-half coefficient was 0.838, indicating good reliability. Inter-item correlation coefficients ranged from 0.692 to 0.999. The average measures (ICC) was 0.980, indicating excellent reliability. Single measures (ICC) reliability was 0.804, indicating good reliability. Conclusions These results suggest that Catalyst Datafinch demonstrates high internal consistency reliability when used with individuals with autism. This indicates that the application is reliable for collecting and analyzing behavioral data in this population. The ratings ranged from good to excellent, indicating a high consistency in the measurements.


Introduction
A variety of evaluation tools exist to identify and measure the primary symptoms and behavioral results in individuals with autism spectrum disorder (ASD).There are frequently used instruments in randomized trials or ongoing registries to gauge health outcomes for individuals with ASD.While many of these tools are employed in clinical environments to characterize autism symptoms and disabilities or to determine the disorder's severity for diagnostic reasons, they have been and continue to be utilized as outcome indicators in research contexts, particularly in clinical trials [1].In their systematic review and meta-analysis, Yu et al. [2] pointed out general measures of symptomatic outcomes of ASD, including outcomes for socialization, communication, expressive language, receptive language, adaptive behavior, daily living skills, and intelligence quotient.
In the 1990s, many tracking methods were created, such as manual tools using paper and pencil.They were completed by parents, caregivers, or behavioral therapists to monitor progress or lack thereof.However, with the advent and integration of electronic tools in the treatment outcomes for children with autism, these instruments have become indispensable.
Early Intensive Behavioral Interventions (EIBI), a proven and effective treatment method for children with autism, heavily depends on data for progress evaluation [15].These electronic tools can facilitate this process by offering a simplified and efficient data collection and analysis method, ultimately contributing to improved treatment results.It is important to remember that, while these tools can significantly assist in treatment, they are merely one component of a comprehensive approach to autism treatment.They should be used with other strategies and interventions for a well-rounded treatment plan.

Catalyst Datafinch
Catalyst is a data collection instrument developed by DataFinch Technologies and is a market-ready tool for electronic data collection.It was designed to aid applied behavior analysts in capturing and analyzing extensive datasets in the context of autism spectrum and related developmental disorders.Users of Catalyst, such as Board-Certified Behavior Analysts (BCBAs) and behavioral technicians, set up a unique profile for each autistic patient and devise programs and data collection methods for managing problem behavior and skill acquisition programs [16].
Data are collected using a real-time data-stamping method, allowing for a detailed examination of the data down to the exact second it was collected rather than just a summary metric.For problem behavior, users define the problem operationally, choose from a variety of continuous (e.g., frequency, duration) and discontinuous measurement systems, and set the interval length for discontinuous systems, namely 10 seconds, 30 seconds, and two minutes from a drop-down menu in the portal [16].
All problem behavior topography that uses the same discontinuous measurement system shares the same interval setting, which the patient, not the topography, determines.Typically, technicians, such as registered behavior technicians, use a portable electronic device, usually an iPad, to record data during ongoing therapy sessions [16].When a discontinuous measurement system is in use, an auditory or vibratory stimulus (user-selected setting) indicates the end of the interval.The technician then records whether each problem behavior is happening or has occurred since the last signal or for the entire interval [16].
Scientific discourse calls for the significant need for psychometric studies that delineate the reliability and validity of instruments designed to measure behaviors and skill acquisition in individuals with autism [17].Reliability is a crucial aspect of any psychological measurement instrument [18].They ensure that the tool consistently measures what it intends to measure.In the context of autism, this is particularly important due to the wide range of behaviors and skills that can be present in this population [17,18].Psychometric studies can help improve assessment.They can enhance the accuracy of assessments, leading to more effective intervention strategies.They can help tailor interventions to the individual needs of each autistic person.They can provide a reliable means of tracking progress over time, which is crucial for evaluating the effectiveness of interventions.Without these studies, there is a risk that the tools used may not accurately capture the complexities of behaviors and skill acquisition in individuals with autism.This could lead to misinterpretation of data, ineffective interventions, and missed opportunities for skill development.Therefore, ongoing psychometric research is essential [17][18][19].
While numerous psychometric studies have examined the reliability of various tools used in assessing and treating autism with ABA [20][21][22], no research studies have looked at the psychometric properties of the Catalyst Datafinch application.This study addresses internal consistency reliability measures, assessing Cronbach's alpha, inter-item, split-half, and interclass correlation coefficients.

Materials And Methods
Four datasets were extracted (n=100, n=98, n=103, n=62) from previously published studies at The Oxford Center, Brighton and Troy, MI [23][24][25][26], ranging from March 19, 2023, through January 8, 2024, using Catalyst Datafinch as the data collection tool.All data were gathered by Board Certified Behavior Analysts (BCBAs) and behavioral technicians and designed to replicate how practitioners collect traditional paper and pencil data.Internal consistency reliability measures were computed and reported, including Cronbach's alpha, inter-item, split-half, and interclass correlation coefficients.

Dataset #1
General target mastery data were collected daily by a team of multiple (three to five) behavioral technicians per child for 100 individuals with autism via retrospective chart reviews [23].Behavior analysts administered a mixed model of discrete trial training, mass trials, and naturalistic environment treatment for three months between March 19, 2023, and June 11, 2023.Data were obtained at two-week intervals for seven time points (baseline, two weeks, four weeks, six weeks, eight weeks, 10 weeks, and 12 weeks).General target mastery data were collected for 89 children and four adults, with seven missing values [23].Behavior technicians assigned to specific autistic individuals used real-time data-stamping procedures to enter data the second the behavior was observed.The behavior technician created an operational definition for the problem behavior and selected continuous (frequency, duration) measurement systems using a portable electronic device (an iPad; Apple Inc., Cupertino, CA).Researchers then accessed those data online for analysis and reporting [23].

Dataset #2
Target mastery data were collected for 98 autistic individuals, including four adults over 18 years of age, via a retrospective chart review who were administered ABA treatment for one month between June 7, 2023, and July 7, 2023.Data were obtained at two-week intervals for three time points (baseline, two weeks, and four weeks) [24].Behavior technicians assigned to specific individuals with autism utilized procedures that allowed for data entry in real time the moment a behavior was observed.These technicians established a clear, operational definition for the problematic behavior.They chose to use continuous measurement systems (tracking frequency and duration) with the help of a portable electronic device, specifically an iPad.These data were then accessible online for researchers to analyze and report on [24].

Dataset #3
Participant cohort target mastery data were gathered using a retrospective chart review from 103 autistic individuals who received ABA treatment.A repeated measures analysis covered three time points (baseline, two weeks, and four weeks) between June 7, 2023, and August 8, 2023, measuring cumulative target behaviors [25].Behavior technicians tasked with monitoring specific individuals with autism employed realtime data entry methods, recording observations as they occurred.They formulated a detailed, operational definition for the behavior deemed problematic.They opted for continuous measurement systems (monitoring both frequency and duration) using a portable electronic device, an iPad.The collected data were subsequently made available online, enabling researchers to conduct their analysis and compile their reports [25].

Dataset #4
Retrospective chart review data were collected from a cohort of 62 autistic individuals who were administered ABA treatment over a five-month snapshot period from August 8, 2023, to January 8, 2024, covering 12 time points (baseline, two weeks, four weeks, six weeks, eight weeks, 10 weeks, 12 weeks, 14 weeks, 16 weeks, 18 weeks, 20 weeks, and 22 weeks) measuring cumulative target behaviors [26].Behavior technicians assigned to observe specific individuals with autism used real-time data entry techniques, documenting behaviors when they happened.They developed a comprehensive, operational definition for the behavior identified as problematic.They chose continuous measurement systems (frequency and duration tracking) with a portable electronic device, specifically an iPad.The data were then posted online, providing researchers with the necessary information for their analysis and report generation [26].

Statistical methods
Statistical Product and Service Solutions (SPSS, version 29.0;IBM Corp., Armonk, NY) was used for all descriptive and reliability statistics [27].Demographic characteristics were summarized above, including summary statistics for the categorical variables gender, race/ethnicity, continuous variables age (mean and standard deviation, median, range), and individual timepoint variable descriptive statistics.Each valid score in the four datasets was an equally weighted composite score of the number of aggregated general target behaviors mastered, measured at either three, seven, or 12 time points, which were the average of the multiple (three to five behavioral technician) ratings [27].Internal consistency reliability estimates are presented as Cronbach's alpha [28,29], inter-item, split-half, and interclass correlation coefficients.

Internal consistency reliability interpretations
The internal consistency reliability of a test is often measured by a correlation coefficient, denoted as α (not to be confused with the α that represents the probability of a Type I error).The value of α can range from 0 to 1, interpreted as follows: If α is greater than or equal to 0.90, the test is considered to have excellent reliability.A test with an α value between 0.70 and 0.90 has good reliability.If α falls within the range of 0.60-0.70, the test has acceptable reliability.The test's reliability is poor if α is between 0.50 and 0.60.A test with an α value less than or equal to 0.50 is considered unacceptable reliability.These ranges serve as a guideline for researchers to evaluate the consistency of their tests.

IRB approval
Consent was obtained or waived by all participants in this study.This research study retrospectively used data collected from chart reviews for clinical purposes.The study was submitted to the WIRB-Copernicus Group (WCG®IRB) for review and was granted an exemption (#1-1703366-1).The authors declare that this research investigation involves minimal risk and complies with the Belmont Report Regulations 45 CFR

Inter-Item Correlation Coefficients
Inter-item correlation coefficients (two-tailed) for the seven timepoint variables ranged from 0.474 to 0.971 and are presented in Table   The single measures ICC is the reliability coefficient for one single, typical rater.It means fair to good reproducibility if the test is performed on one or several occasions.The single measures rating is used when an individual rating is the level of observation in the outcome.Theoretically, the single measures reliability was 0.609, indicating acceptable reliability if a random single rater was used [30,31].

Inter-Item Correlation Coefficients
Inter-item correlations (two-tailed) for three timepoint variables are presented in Table 3. Inter-item correlation coefficients ranged from 0.827 to 0.977.The single measures ICC is the reliability coefficient for one single, typical rater.It means fair to good reproducibility if the test is performed on one or several occasions.The single measures rating is used when an individual rating is the level of observation in the outcome.Theoretically, the single measures interclass correlation equaled 0.875, indicating good reliability [30,31].

Inter-Item Correlation Coefficients
Inter-item correlation coefficients (two-tailed) for three timepoint variables are presented in   The single measures ICC is the reliability coefficient for one single, typical rater.It means fair to good reproducibility if the test is performed on one or several occasions.The single measures rating is used when an individual rating is the level of observation in the outcome.Theoretically, the single measures reliability was 0.926, indicating excellent reliability if a random single rater was used [30,31].

Inter-Item Correlation Coefficients
Inter-item correlation coefficients for three timepoint variables are presented in Table 7. Inter-item correlation coefficients ranged from 0.694 to 0.999.

Inter-Item Correlation Variables Pearson r p-value 95% CI Lower Limit 95% CI Upper Limit
Targets Mastered Time  The average measures value is used when the average of several ratings is the level of observation in the outcome -the average measures interclass correlation equaled 0.980, indicating excellent reliability [30,31].
The single measures ICC is the reliability coefficient for one single, typical rater.It means fair to good reproducibility if the test is performed on one or several occasions.The single measures rating is used when an individual rating is the level of observation in the outcome.Theoretically, the single measures reliability was 0.804, indicating good reliability if a random single rater was used [30,31].

Discussion
This study aimed to address a gap in the existing literature by examining the psychometric properties of the Catalyst Datafinch data collection application using four distinct research datasets [23][24][25][26].The objective was to evaluate the internal consistency reliability, as determined by Cronbach's alpha, split-half reliability (both Cronbach's and Guttman), and inter-item and interclass correlation coefficients.The Catalyst application has been extensively utilized as a digital alternative to traditional paper and pencil methods for tracking skill acquisition and behaviors in individuals with ASD.The results indicated that Cronbach's alpha, Cronbach's alpha split-half, Guttman split-half, and inter-item and interclass correlations were predominantly excellent (α ≥ 0.90), with some being good (0.70 ≤ α ≤ 0.90).We observed evidence of alignment when comparing our findings with those of previously published psychometric studies.This consistency provides credence to the robustness of the methodologies employed in these studies, further validating the use of such tools in this area of research.It is noteworthy that the findings of our research align with those of comparable studies.
The median internal consistency reliability for all five MSEL scales ranged from 0.75 to 0.83.The internal consistency for the early learning composite, namely, the four cognitive scales (visual reception, fine motor, receptive language, and expressive language), was between 0.83 and 0.95.Test-retest reliability, with a mean retest time of 11 days (about one and a half weeks), ranged from 0.82 to 0.85 for children 1-24 months (about two years) of age and is less than 0.80 for children 25-56 months (about four and a half years) of age [3].
Zander et al. [32] and Janvier et al. [33] commented on the reliability properties of the ADOS.The median interrater reliability for items across the four modules was 0.74-0.83,with the single ADOS items ranging from 0.23 to 0.94.The total score interrater reliability was 0.85-0.92.Test-retest reliability for the calibrated severity scores of the 608-item ADOS was strong.
Usry et al. reported evidence of excellent inter-rater reliability (ICC=0.95,p<0.001) across the Assessment of Basic Language and Learning Skills (ABLLS-R) scores obtained from a second panel of expert raters [34].
Schmidt et al. [35] and Hobden et al. [36] reported high internal consistency on the ABC, with Cronbach's alpha ranging from 0.86 to 0.94.The original test-retest reliabilities ranged from 0.96 to 0.99.The whole scale had low interrater reliability with a mean correlation of 0.63.Subsequent studies have shown a range from 0.50 to 0.67 (teacher form) and 0.80 to 0.95 (parent form) [37].
The ADI-R had good internal consistency [38], with test-retest reliability very high at the 0.93-0.97range.
Interrater reliability was as high as the initial study, with multi-rater Kappas ranging from 0.62-0.96for individual items [39].
The VABS had high internal consistency reliability with split-half reliability for the adaptive behavior composite ranging from 0.93 to 0.97 [40,41], while subdomains were within the 0.80s to 0.90s.Test-retest reliabilities ran mostly from 0.80s to 0.90s [41].Inter-rater reliabilities ranged from the low 0.70s to the high 0.80s.At the same time, another study found inter-rater coefficients ranging from 0.62 to 0.78 [40,41].
The CARS had good internal consistency with Cronbach's alpha of 0.94 [43,44].A meta-analysis of research using the CARS between 1980 and 2021 indicated an internal consistency of 0.89 [45].After a 12-month interval, test-retest reliability for 91 cases was 0.88 for the total score [45].The inter-rater reliability was found to be 0.71.That same meta-analysis using CARS between 1980 and 2021 found an inter-rater reliability of 0.79.
This current research is not without its limitations.The use of four non-random samples limits the scope of the study and restricts the generalizability of the results.The findings may not apply to a broader population or different contexts.The assumption of construct validity in the Catalyst Datafinch application implies that all items are designed to measure the same construct.However, this assumption may not always hold and could potentially impact the accuracy of the results.
Furthermore, the variability in task stimuli, the number of trials, the type of individuals participating, the administration conditions, and the focal task variable across different studies can introduce additional complexity.These factors can influence the outcomes and make it challenging to compare results across studies.
Therefore, while this study provides valuable insights, it is crucial to interpret the findings with these limitations in mind.Future research could address these limitations using a more diverse and randomized sample, ensuring consistent administration conditions and verifying the tools' validity.This would help enhance the robustness and generalizability of the results.

Conclusions
These findings suggest that the Catalyst Datafinch Applied Behavior Analysis Data Collection Application demonstrates high internal consistency reliability when used with individuals on the autism spectrum.This indicates that the application is a reliable tool for collecting and analyzing behavioral data in this population.The ratings ranged from good to excellent, indicating a high consistency in the measurements obtained through this application.However, it is important to note that these findings, while promising, are part of an ongoing research process.Further studies are necessary to validate these results and ensure the tool's effectiveness and reliability in diverse settings and populations.Continued research will also help refine the application's features and functionality, ensuring it remains a valuable resource for those working in the field of ABA.This ongoing commitment to research and validation is crucial in ensuring that the Catalyst Datafinch Application continues to meet the needs of practitioners and individuals with autism.

Cronbach's alpha
for Dataset #2 was 0.954, with three items indicating excellent reliability.Cronbach's alpha split half Part 1 = 0.912, indicating excellent reliability; Part 2 = 0.975, indicating excellent reliability; and Guttman split-half coefficient = 0.917, indicating excellent reliability.The items are Targets Mastered Baseline, Targets Mastered 2 Weeks, and Targets Mastered 4 Weeks.

Table 2 .
[30,31]rage measures ICC indicates the reliability of several raters averaged together, used when the average of several ratings is the level of observation in the outcome.It implies excellent reproducibility if you repeat the test several times to calculate the mean value.The average measures value is used when the average of several ratings is the level of observation in the outcome -the average measures interclass correlation equaled 0.916, indicating excellent reliability[30,31].

TABLE 2 : Dataset #1 -Intraclass Correlation Coefficients Two
-way random effects model where both people effects and measures effects are random.

Table 4 .
[30,31]rage measures ICC indicates the reliability of several raters averaged together, used when the average of several ratings is the level of observation in the outcome.It implies excellent reproducibility if you repeat the test several times to calculate the mean value.The average measures value is used when the average of several ratings is the level of observation in the outcome -the average measures interclass correlation equaled 0.954, indicating excellent reliability[30,31].

TABLE 4 : Dataset #2 -Intraclass Correlation Coefficients Two
-way random effects model where both people effects and measures effects are random.

Table 6 .
[30,31]rage measures ICC indicates the reliability of several raters averaged together, used when the average of several ratings is the level of observation in the outcome.It implies excellent reproducibility if you repeat the test several times to calculate the mean value.The average measures value is used when the average of several ratings is the level of observation in the outcome -the average measures interclass correlation equaled 0.974, indicating excellent reliability[30,31].

TABLE 6 : Dataset #3 -Intraclass Correlation Coefficients Two
-way random effects model where both people effects and measures effects are random.

TABLE 7 : Dataset #4 -Inter-Item Correlations
Interclass correlation coefficients are presented in Table8.The average measures ICC indicates the reliability of several raters averaged together, used when the average of several ratings is the level of observation in the outcome.It implies excellent reproducibility if you repeat the test several times to calculate the mean value.