Introduction and Epistemological Positioning
This chapter provides a systematic account of the methodological framework, data collection procedures, analytical pipeline, and associated limitations that governed the first phase of empirical work undertaken for this research project. The chapter documents, in precise and reproducible terms, how data was obtained, processed, sampled, and subjected to preliminary analysis. It is intended both as an audit trail for the researcher's own subsequent work and as a methodological statement suitable for incorporation into formal academic outputs, including peer-reviewed publications and a prospective PhD dissertation.
The project adopts an inductive, interpretivist epistemological position. Rather than proceeding from fixed hypotheses or predetermined analytical categories, the research allows themes, discourses, and patterns to emerge from sustained engagement with the data. This positions the work within critical qualitative and discourse-analytic traditions, drawing particularly on Foucauldian approaches to knowledge, identity, and the production of truth (Foucault, 1972; 1980). The methodological choices documented below — including the selection of data source, the sampling strategy, and the analytical framework — were made in deliberate alignment with this epistemological commitment.
Analytical Framework: AI-Assisted Inductive Discourse Analysis
3.2.1 Defining the Method
The analytical method employed in this phase of research is best characterised as AI-Assisted Inductive Discourse Analysis (AIDA). This is an emergent methodological approach that combines the interpretive and critical traditions of discourse analysis with the large-scale pattern recognition capabilities of Large Language Models (LLMs), specifically Claude (Sonnet 4.5, Anthropic, 2025).
Discourse analysis, in its foundational sense, is concerned not merely with what is said but with how language constructs social realities, identities, and relations of power (Fairclough, 1992; Jørgensen and Phillips, 2002; Gee, 2014). In the context of social media research, discourse analysis has been applied to examine how communities produce, circulate, and contest meaning in digitally mediated environments (KhosraviNik, 2018; Herring, 2013).
The 'inductive' designation is methodologically significant. Unlike deductive approaches that apply a predetermined coding scheme, an inductive approach allows analytical categories to emerge from the data itself — consistent with grounded theory traditions (Glaser and Strauss, 1967) and with interpretivist approaches to social media discourse (boyd, 2014). Prompts submitted to the LLM were explicitly designed to resist the imposition of predetermined categories, instructing the model to allow themes and discourses to emerge from close reading of the posts.
3.2.2 The Role of the Large Language Model
The LLM in this study served as a first-pass analytical instrument, tasked with identifying emergent themes, characterising patterns in language and vocabulary, mapping identity constructions, and flagging sites of contestation across batches of sampled posts. The model did not replace the researcher's analytical judgement; all outputs were framed explicitly as provisional, requiring critical review and validation by the researcher.
This positions the LLM as analogous to a highly efficient research assistant capable of processing large volumes of text rapidly, surfacing patterns for the researcher's interpretive engagement — rather than as an autonomous analytical authority.
3.2.3 Data Collection Pipeline
Limitations of AI-Assisted Discourse Analysis
The AIDA approach carries a number of significant methodological limitations that must be acknowledged explicitly and addressed in any published output drawing on this preliminary analysis.
LLM Subjectivity and Bias
LLMs reflect biases embedded in training data. Analytical outputs may reproduce existing biases in how ADHD is framed, particularly around neurotypical assumptions, clinical framings, or Western mental health discourses.
Interpretive Opacity
LLM reasoning processes are not fully transparent. The model cannot provide the kind of traceable interpretive audit trail that conventional qualitative research demands.
Hallucination Risk
LLMs are susceptible to generating plausible-sounding but inaccurate characterisations. All AI-generated analytical claims require cross-checking against source data before any academic use.
Sampling Constraints
The 200-post sample, though stratified, represents less than 0.05% of the full dataset. Thematic saturation cannot be assumed; findings should be treated as orienting rather than definitive.
Platform-Specificity
Analysis is limited to Reddit's text-based format. This captures only one mode of ADHD discourse; visual, audio-visual, and ephemeral content on other platforms are outside scope.
Researcher Positionality
The researcher's own ADHD diagnosis and lived experience constitute both a methodological asset and a potential source of bias requiring explicit reflexive acknowledgement in formal outputs.
Data Collection, Processing & Sampling
Data Source: Arctic Shift Archive
Data was collected from r/ADHD via Arctic Shift (Heitmann, 2023), an independent archival resource providing structured access to historical Reddit data. Arctic Shift provides access to post-level data — including post title, body text, score, comment count, flair, and timestamp — for subreddits not subject to API access restrictions, in machine-readable CSV format. This represents a significant methodological advantage over direct Reddit API access, which has been subject to significant restrictions since the API policy changes of 2023.
The full dataset was extracted covering the period January 2023 to February 2026, yielding 426,224 posts. Initial data cleaning using Python's pandas library removed deleted posts, null-body posts, and posts with scores below threshold, producing a cleaned working dataset suitable for sampling.
Sampling Strategy
A stratified random sample of 200 posts was drawn from the cleaned dataset. Stratification was by time period — proportional monthly sampling across the 38-month range — to ensure temporal representativeness and enable potential longitudinal analysis in subsequent phases. Within each temporal stratum, posts were selected using Python's random module, with no additional filtering by post score, flair, or engagement level, to avoid systematic bias toward high-engagement or algorithmically amplified content.
Future Analytical Directions
Several analytical extensions are planned for subsequent phases of the project. These include close reading of the full 200-post sample by the researcher using conventional discourse analytic methods; corpus linguistic analysis using keyword-in-context (KWIC) tools; longitudinal trend analysis using the full 38-month dataset; and cross-platform comparison extending data collection to YouTube, TikTok, and Instagram in later phases.
The platform comparison work will enable analysis of how platform affordances — Reddit's threaded text format, TikTok's short video format, YouTube's long-form video — shape the discourses and identities that emerge, directly addressing the platform dynamics research problematic outlined in the project brief.
Ethical Considerations
This research works with publicly available social media data. All posts collected from r/ADHD were publicly visible at the time of data collection and were obtained through a legitimate archival resource. No private or restricted data was accessed. Nonetheless, the ethical complexities of social media research require explicit acknowledgement.
Key Ethical Commitments
- Informed consent: Users who posted to r/ADHD did not consent to their posts being used in academic research. In line with best practice, no individual will be named or identified in any published output; usernames will be removed and verbatim quotations paraphrased or modified to prevent identification.
- Researcher positionality: The researcher is a member of the ADHD community with lived experience of ADHD — a methodological asset requiring explicit reflexivity about the potential for the researcher's subject position to influence analytical choices.
- Data security: The raw dataset and any derivative files are stored securely on password-protected personal devices and cloud storage. The dataset will not be shared with third parties.
- Platform terms of service: Data was obtained via Arctic Shift. The legal and terms-of-service status of archival Reddit data is a contested area; the researcher has proceeded on the basis that the data is publicly accessible and its use for non-commercial academic research is consistent with emerging norms in the field.
Summary
This chapter has documented the methodological framework, data collection procedure, processing pipeline, sampling strategy, analytical method, and associated limitations for the first phase of empirical work on ADHD discourse in the r/ADHD subreddit. The approach — AI-Assisted Inductive Discourse Analysis applied to a stratified random sample of 200 posts drawn from a large archival dataset — represents a defensible, innovative, and appropriately positioned first-pass analytical strategy.
The outputs of this phase will orient the researcher's own systematic close reading in subsequent phases, which will apply conventional discourse analytic methods to a carefully selected corpus, with LLM-generated findings serving as an orienting framework rather than a definitive analytical account.
References
- Baumgartner, J. et al. (2020). The Pushshift Reddit Dataset. Proceedings of ICWSM. arXiv:2001.08435.
- Bouvier, G. and Machin, D. (2018). Critical Discourse Analysis and the Challenges and Opportunities of Social Media. Review of Communication, 18(3), pp. 178–192.
- boyd, d. (2014). It's Complicated: The Social Lives of Networked Teens. New Haven: Yale University Press.
- Brown, T. et al. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33.
- Fairclough, N. (1992). Discourse and Social Change. Cambridge: Polity Press.
- Foucault, M. (1972). The Archaeology of Knowledge. London: Tavistock.
- Gee, J.P. (2014). An Introduction to Discourse Analysis (4th ed.). Abingdon: Routledge.
- Gerber, M.S. et al. (2025). Digital Debating Cultures: Communicative Practices on Reddit. Digital Scholarship in the Humanities, 40(1), pp. 227–250.
- Glaser, B.G. and Strauss, A.L. (1967). The Discovery of Grounded Theory. Chicago: Aldine.
- Heitmann, A. (2023). Arctic Shift: Making Reddit Data Accessible to Researchers. Available at: https://arctic-shift.photon-reddit.com
- Herring, S.C. (2013). Relevance in Computer-Mediated Conversation. In: Pragmatics of Computer-Mediated Communication. Berlin: De Gruyter.
- Jørgensen, M. and Phillips, L. (2002). Discourse Analysis as Theory and Method. London: Sage.
- Kampen, J.K. (2024). A Textmining Approach to Discourse Analysis of Reddit Comments Aided by a LLM. Journal of Advanced Research in Natural and Applied Sciences.
- KhosraviNik, M. (2018). Social Media Critical Discourse Studies (SM-CDS). In: The Routledge Handbook of Critical Discourse Studies. London: Routledge.
- Törnberg, A. and Törnberg, P. (2016). Muslims in Social Media Discourse. Discourse, Context and Media, 13, pp. 132–142.