Research deployed via the internet and administered via smartphones could have access to more diverse samples than lab-based research. Diverse samples could have relatively high variation in their traits and so yield relatively reliable measurements of individual differences in these traits. Cognitive tasks have been reported to yield relatively low reliabilities (Hedge et al., 2018), which could potentially be addressed by smartphonemediated administration in diverse samples. We formulate several criteria to determine whether a cognitive task is suitable for individual differences research on commodity smartphones: no very brief or precise stimulus timing, relative response times (RTs), a maximum of two response options, and a small number of graphical stimuli. The Flanker Task meets these criteria. We compared the reliability of individual differences in the Flanker Effect across samples and devices in a pre-registered study. We found no evidence that a more diverse sample yields higher reliabilities. We also found no evidence that commodity smartphones yield lower reliabilities than commodity laptops. Hence, diverse samples might not improve reliability above student samples, but smartphones may well measure individual differences with cognitive tasks reliably. Exploratively, we examined different reliability coefficients, split-half reliabilities, and the development of reliability estimates as a function of task length.