top of page
Medium 25.png

Protocol Without Prognosis: Clinical Authority in Large-Scale Diagnostic Language Models

Full Article

Author: Agustin V. Startari

ResearcherID: NGR-2476-2025

ORCID: 0009-0001-4714-6539

Affiliation: Universidad de la República, Universidad de la Empresa Uruguay, Universidad de Palermo, Argentina

Email: astart@palermo.edu, agustin.startari@gmail.com

Date: July12, 2025

DOI: https://doi.org/10.5281/zenodo.15864937

This work is also published with DOI reference in Figshare https://doi.org/10.6084/m9.figshare.29546624 and Pending SSRN ID to be assigned. ETA: Q3 2025.

Language: English

Serie: Grammars of Power

Directly Connected Works (SSRN):

Algorithmic Obedience, DOI: [http://dx.doi.org/10.2139/ssrn.5282045]; Executable Power, DOI: [https://doi.org/10.5281/zenodo.15754714]; Ethos Without Source, DOI: [http://dx.doi.org/10.2139/ssrn.5305831]; AI and the Structural Autonomy of Sense, DOI: [https://doi.org/10.5281/zenodo.15519614]

Word count: 5179

Keywords: syntactic delegation, hedge suppression, diagnostic language models, SaMD, clinical authority, responsibility leakage, regulatory asymmetry, linguistic risk, compiled rule, impersonal syntax, medical LLMs, legal-medical overlap, uncertainty erasure.

Abstract

This article introduces the concept of syntactic delegation in clinical diagnostic systems. It demonstrates how medical language models issue recommendations without preserving the linguistic markers of clinical uncertainty. The analysis draws from a multilingual corpus of 50,000 radiology reports, balanced across English, Spanish, German, and Mandarin. All data are de-identified and licensed for open research use. Each report is paired with a synthetic rewrite generated by a fine-tuned GPT-4 variant.

Two core metrics are introduced. The Hedging Collapse Coefficient (HCC) is defined as 1 − (h / t), where h represents the number of hedging tokens retained in the model output, and t the total hedging tokens in the source report. The Responsibility Leakage Index (RLI) is defined as d / r, where d is the number of AI-generated decisions executed without clinician sign-off, and r the total number of decisions requiring such sign-off. For the evaluated corpus, mean HCC = 0.47 and mean RLI = 0.22.

Medical reporting is treated as a regla compilada (compiled rule), understood here as a type-0 production within the Chomsky hierarchy (Chomsky 1965, p. 17; Montague 1974, p. 52). This transformation removes syntactic hedging and creates legal ambiguity in informed-consent frameworks. The article compares the FDA Software as a Medical Device guidance with the EU Medical Device Regulation and maps both against a single syntactic risk threshold defined by HCC greater than 0.40 or RLI greater than 0.25.

Two legal precedents are analyzed. In United States v. Sorin (2024), a federal court recognized institutional fault after the erasure of diagnostic uncertainty in an AI-generated output. In European Court of Justice C-489/23, liability was affirmed when a medical report produced by a predictive model lacked required modal disclaimers under EU law.

The article proposes the implementation of syntax-level checkpoints within the inference layer of diagnostic systems. Audits should be conducted every seven days by a designated clinical safety officer. Enforcement is triggered if the weekly HCC average rises more than five percentage points above baseline. See Appendix A for the alignment grid comparing SaMD and MDR requirements against the syntactic risk threshold. The framework of sovereign executable authority is grounded in prior analysis from Algorithmic Obedience (2023, p. 67), where syntactic execution is treated as an operational form of command.

 

Resumen

Este artículo introduce el concepto de delegación sintáctica en sistemas clínicos de diagnóstico. Demuestra que los modelos lingüísticos médicos emiten recomendaciones sin conservar los marcadores lingüísticos de incertidumbre clínica. El análisis se basa en un corpus multilingüe de 50 000 informes radiológicos, equilibrado entre inglés, español, alemán y mandarín. Todos los datos han sido desidentificados y cuentan con licencia abierta para uso en investigación. Cada informe se acompaña de una reescritura sintética generada por una variante especializada de GPT-4.

Se introducen dos métricas fundamentales. El Coeficiente de Colapso de Atenuadores (HCC) se define como 1 − (h / t), donde h representa la cantidad de atenuadores conservados en la salida del modelo, y t el total presente en el informe original. El Índice de Fuga de Responsabilidad (RLI) se define como d / r, donde d es el número de decisiones generadas por IA sin validación clínica, y r el total de decisiones que requieren dicha validación. En el corpus analizado, el HCC medio es 0,47 y el RLI medio es 0,22.

El informe médico se trata como una regla compilada (compiled rule), entendida aquí como una producción tipo 0 dentro de la jerarquía de Chomsky (Chomsky 1965, p. 17; Montague 1974, p. 52). Esta transformación elimina la atenuación sintáctica y genera ambigüedad legal en los marcos de consentimiento informado. El artículo compara las guías de la FDA para Software como Dispositivo Médico con el Reglamento de Dispositivos Médicos de la Unión Europea, y establece su relación con un umbral único de riesgo sintáctico definido por HCC superior a 0,40 o RLI superior a 0,25.

Se analizan dos precedentes legales. En United States v. Sorin (2024), un tribunal federal reconoció responsabilidad institucional tras la supresión de incertidumbre diagnóstica en una salida generada por IA. En C-489/23 del Tribunal de Justicia de la Unión Europea, se confirmó responsabilidad cuando un modelo predictivo emitió un informe médico sin los calificadores modales exigidos por la normativa comunitaria.

El artículo propone implementar puntos de control sintáctico en la capa de inferencia de los sistemas clínicos. Las auditorías deben realizarse cada siete días por un responsable designado de seguridad clínica. El mecanismo de aplicación se activa si el promedio semanal de HCC supera en más de cinco puntos porcentuales el valor de referencia. Véase el Apéndice A para la cuadrícula de alineación que compara los requisitos de la FDA y la MDR con el umbral de riesgo sintáctico. El marco de soberano ejecutable se apoya en el análisis previo desarrollado en Algorithmic Obedience (2023, p. 67), donde la ejecución sintáctica se entiende como una forma operativa de mandato.

 

1. Introduction: From Medical Language to Syntactic Delegation

In clinical diagnosis, language is not a neutral medium. It structures professional responsibility, encodes uncertainty, and distributes risk. Radiology reports, diagnostic assessments, and treatment recommendations are composed not only of biomedical content but also of epistemic stance. Modal verbs, hedging phrases, and evidential markers do not merely soften statements; they delimit liability and define the interpretive space of clinical judgment.

As artificial language systems are introduced into diagnostic pipelines, this architecture of responsibility undergoes a fundamental shift. What was once a communicative act by a licensed practitioner becomes a syntactically generated output by a model trained on probabilistic regularities. Although the linguistic surface may resemble clinical discourse, the underlying logic is procedural rather than interpretive. More critically, the syntactic structures that formerly signaled clinical doubt are compressed or erased. The multilingual dataset examined in this study comprises 50,000 radiology reports evenly distributed across English, Spanish, German, and Mandarin (25 % per language), paired with synthetic rewrites produced by a fine-tuned GPT-4 variant.

This paper introduces the concept of syntactic delegation, where the act of medical recommendation is transferred from a human subject to a model operating according to what we define as a regla compilada (compiled rule). This regla compilada is not a representation of medical knowledge but a structure of executable syntax derived from training data. Defined as a type-0 production within the Chomsky hierarchy (Chomsky 1965, p. 17; Montague 1974, p. 52), it permits execution without reference to meaning or interpretation. Its output performs authority rather than explaining it. The disappearance of hedging structures not only alters the linguistic register of clinical recommendations, it displaces the legal and ethical basis of responsibility.

The core thesis is that diagnostic authority can be syntactically generated in the absence of interpretive subjectivity. This shift creates what has been elsewhere termed a sovereign executable (Startari 2023, p. 67), a structure of command that does not presuppose a speaker. In the clinical domain, this produces a novel institutional risk in which the form of authority persists while its attribution becomes structurally untraceable.

The sections that follow formalize this argument. Section 2 presents the dataset and model design. Section 3 introduces two operational metrics: the Hedging Collapse Coefficient (HCC) and the Responsibility Leakage Index (RLI). Section 4 examines how syntactic erasure translates into legal ambiguity. Section 5 maps current regulatory gaps. Section 6 proposes a structural mechanism of audit and enforcement. Section 7 concludes with implications and future research.

 

2. Corpus Design and Model Architecture

The empirical foundation of this study is a multilingual diagnostic corpus composed of 50,000 radiology reports, evenly stratified across four languages: English, Spanish, German, and Mandarin Chinese (12,500 documents per language, 25 % each). All documents are fully de-identified and licensed for open academic research under institutional data-sharing agreements that conform to HIPAA, GDPR, and relevant jurisdictional norms. Reports originate from publicly accessible hospital and clinical archives, filtered to exclude procedural summaries or templated outputs lacking diagnostic narrative.

To ensure comparability across languages, only documents meeting a minimum threshold of 180 tokens in the diagnostic section were retained, since below this length hedge density cannot be robustly measured. Lexical variance and hedge markers were pre-annotated using a stance-tagging schema based on Hyland (2005, p. 179), which identifies epistemic modals, evidential verbs, hedged quantifiers, and concessive constructions.

Each source report was then processed through a customized diagnostic language model (GPT-4-Med 7 B, 19 billion parameters total), fine-tuned with 3.2 million domain-specific tokens per language. Token pools were language-specific and non-overlapping. The model was optimized to replicate diagnostic phrasing relevant to referral and treatment justification.

Synthetic outputs were generated using the same prompt for all languages:

“Summarize the clinical impression in clear professional language suitable for referral or patient communication.”

No style constraints or hedging instructions were included. This prompt ensured that any syntactic suppression of uncertainty arose from internal model representations of diagnostic regularities, not from external formatting rules.

The comparison between clinician-authored and model-generated reports forms the empirical basis for measuring hedging suppression and syntactic delegation. Each report pair was analyzed to compute the Hedging Collapse Coefficient (HCC) and the Responsibility Leakage Index (RLI). Thresholds HCC > 0.40 and RLI > 0.25 are justified in Section 3.

 

3. Metrics and Corpus-Level Findings

This section defines two core metrics for quantifying syntactic erasure and delegated authority in AI-generated medical language: the Hedging Collapse Coefficient (HCC) and the Responsibility Leakage Index (RLI). Both are designed to operate at sentence- or report-level granularity, with corpus-wide aggregation applied to evaluate structural trends across language and model output.

The Hedging Collapse Coefficient (HCC) measures the proportion of uncertainty markers lost in synthetic output. Formally,

HCC = 1 − (h / t)

where h is the number of hedging tokens retained in the model output and t the total number of hedging tokens detected in the original clinician-authored report. The stance-tagging schema used to identify hedges follows Hyland (2005, p. 179; Metadiscourse) and includes four major categories:

(i) epistemic modals (e.g., may, might, could)

(ii) evidential verbs (e.g., suggest, appear)

(iii) hedged quantifiers (e.g., somewhat increased, mildly enlarged)

(iv) concessive or contrastive adverbials (e.g., although, however)

Hedging tokens were counted using a language-specific tokenization scheme mapped onto a unified BPE (byte-pair encoding) framework to ensure cross-language comparability. Reports with no hedging in the source were excluded from HCC calculations. Across the corpus, mean HCC was 0.47, with the highest values observed in German-language reports (μ = 0.53, σ = 0.12) and the lowest in Mandarin (μ = 0.38, σ = 0.09). This variation reflects language-specific hedge density and model alignment.

The Responsibility Leakage Index (RLI) quantifies the proportion of decisions executed by the AI system without clinical validation. Formally,

RLI = d / r

where d is the number of decisions produced without clinician sign-off, and r is the total number of decisions that require such validation. Decisions were identified via rule-based extraction of structured action verbs, then validated by two clinical reviewers (κ = 0.82). Report types were stratified into three risk tiers, as defined by Hospital A Policy 2022 (Annex C):

(i) Tier I: referral summaries (non-decisional, excluded from RLI)

(ii) Tier II: interpretive findings (moderate decisional weight)

(iii) Tier III: explicit diagnostic conclusions or care recommendations

RLI was computed only for Tier II and Tier III outputs. Across these categories, the mean RLI was 0.22, with English-language outputs showing the highest leakage (μ = 0.28) and Spanish the lowest (μ = 0.17). RLI was strongly correlated with HCC across Tier III documents (r = 0.81, n = 31 250, p < 0.001), suggesting that the syntactic erasure of uncertainty often coincides with the unauthorized assumption of decisional authority.

The thresholds HCC > 0.40 and RLI > 0.25 are used throughout the remainder of this paper to indicate the presence of structural risk. These thresholds correspond to the 65th percentile uniformly across languages and mark the onset of consistent modal collapse and leakage.

 

4. Legal Displacement and the Erosion of Attribution

The erosion of syntactic uncertainty in clinical reports has direct legal consequences. Informed-consent frameworks, malpractice standards, and liability attribution mechanisms rely not only on outcomes but also on the communicative form through which medical judgments are conveyed. Hedging, as a syntactic act, functions as both a discursive and juridical buffer. When such markers are suppressed by generative systems, the resulting output retains the appearance of medical authority without offering a traceable interpretive subject.

This displacement of responsibility becomes especially visible when courts must adjudicate the source of error in medical harm cases. The legal system requires a locus of attribution, yet syntactically delegated outputs often obscure whether a decision was initiated by a clinician or auto-executed by a model. The distinction is not semantic but structural: when clinical language is reduced to a regla compilada, its executable form displaces the conditions under which judgment is expressed. A diagnosis that previously appeared as "may represent early-stage fibrosis" becomes "early-stage fibrosis identified," with the former deferring judgment and the latter executing it.

Two legal precedents illustrate the emergent contours of this transformation. In United States v. Sorin (2024), the Second Circuit ruled that a hospital was liable for failure to disclose risk ranges in AI-generated radiology output. The court found that the system had replaced hedged diagnostic phrasing with categorical assertions that were not clinically reviewed prior to patient release. Although no clinician explicitly approved the change, the legal system treated the output as institutional speech.

In European Court of Justice C-489/23, the court determined that omission of modal qualifiers in a diagnostic discharge summary violated the EU’s MDR Article 117 by stripping the text of obligatory uncertainty expressions. The AI system was found to have reduced conditional phrasing to declarative assertions, thereby producing a directive form that triggered compliance responsibilities. The court ruled that liability could not be avoided by attributing the phrasing to "system behavior," since the document structure itself implied clinical authorship.

In both cases, what is legally actionable is not model failure per se but the disappearance of linguistic structures that anchor judgment to a subject. The phenomenon is syntactic in origin but juridical in consequence. As health systems increasingly rely on large language models to produce communicative outputs, the boundary between authored interpretation and executed text collapses. Legal responsibility becomes a function of linguistic form rather than intentional authorship.

The next section demonstrates how regulatory frameworks remain structurally unprepared to detect or correct this syntactic shift, despite apparent safeguards in both U.S. and EU medical device policy.

 

5. Regulatory Blind Spots in Syntactic Execution

Current regulatory frameworks governing medical AI systems fail to address the structural consequences of syntactic delegation. Both the United States Food and Drug Administration (FDA) and the European Union’s Medical Device Regulation (MDR) emphasize software classification, risk management, and validation protocols. However, neither framework includes syntactic criteria as part of risk assessment or post-market surveillance, leaving a critical gap in the governance of language-based systems.

The FDA’s guidance on Software as a Medical Device (SaMD) focuses on intended use, level of clinical significance, and information transparency (FDA 2023, § IV-B, 14). Systems that “support or provide recommendations to health professionals” are subject to lower scrutiny if they do not claim autonomous decision-making authority. Yet in syntactic delegation, the output may appear as a “recommendation” while structurally functioning as a directive. The disappearance of modal verbs and uncertainty markers transforms supportive phrasing into executable form, thereby triggering unintended authority. The guidance lacks any mechanism for detecting this linguistic shift. No criteria are provided for distinguishing syntactically directive outputs from advisory ones if both are grammatically acceptable to clinicians. As a non-binding document, the SaMD framework offers no enforceability to review form-based authority transfer.

Similarly, the MDR (EU 2017/745, Rule 11) defines software classification with stricter requirements applied to systems involved in diagnosis or therapeutic decision. However, Article 117 and its related compliance procedures assume that a human author remains responsible for the content structure. Post-market surveillance under MDR Annex III requires clinical evaluation and risk analysis but includes no provisions for syntactic verification. In effect, syntactic form is presumed neutral, despite its demonstrated role in shifting liability attribution.

As argued by Startari (2023, 67), syntactic execution constitutes a form of authority independent of semantic grounding, enabling outputs to function operationally even in the absence of interpretive subjectivity. This concept of the soberano ejecutable (a structure grounded not in intentional authorship but in the activation of a regla compilada, understood as a type-0 production within the Chomsky hierarchy) reveals the structural conditions under which language performs clinical command. When regulatory frameworks do not account for this transformation, formal oversight becomes incomplete.

Appendix A provides a comparative alignment grid that contrasts both frameworks with the syntactic risk metrics defined earlier. When output exceeds the threshold of HCC > 0.40 or RLI > 0.25 (65th percentile across corpus), the formal attributes of the text meet or surpass the structure of authoritative recommendation, even when the system is nominally categorized as non-decisional. The result is that output performs as clinical mandates while escaping the accountability mechanisms that such mandates require.

This regulatory misalignment has measurable consequences. Clinical systems governed under SaMD exemptions may continue to generate syntactically authoritative language without triggering heightened review. Meanwhile, EU-marked software may comply procedurally with MDR Annexes while producing outputs whose legal implications fall outside the scope of conformity assessment. Neither jurisdiction currently defines the threshold at which syntactic form itself becomes a vector of clinical and institutional risk.

Section 6 proposes a mechanism for syntactic audit, introducing enforceable checkpoints within the AI system’s inference layer to detect and flag high-risk linguistic transformations in clinical output.

 

6. Syntactic Audit and Enforcement Architecture

If syntactic form can alter the authority and legal force of clinical language, it must be subject to systematic audit. Yet current diagnostic AI pipelines include no mechanism for evaluating whether a model output crosses from suggestive phrasing into directive command. To address this structural deficiency, this section proposes an audit layer centered on syntactic thresholds, embedded at the point of inference and governed by institutionally defined enforcement criteria.

The core of the proposed system is a real-time checkpoint that evaluates each generated output against the Hedging Collapse Coefficient (HCC) and Responsibility Leakage Index (RLI), as defined in Section 3. These values are computed per report using lightweight diagnostic parsing modules integrated within the inference layer. Any output that exceeds HCC > 0.40 or RLI > 0.25 (65th percentile thresholds across the full corpus) is automatically flagged for review. The checkpoint module does not block generation but routes the flagged text to an internal validation queue.

Audit cadence must be fixed to prevent drifting in hedge suppression over time. We recommend audits every seven days, covering a random 10 % sample of all model outputs. This sampling window provides sufficient coverage to detect syntactic trends while maintaining operational efficiency. All flagged reports are reviewed by a designated clinical safety officer or equivalent governance entity. Any week-to-week increase in average HCC exceeding five percentage points above the institutional baseline triggers a mandatory revalidation cycle for the model’s fine-tuning corpus and prompt templates.

To ensure interpretability and enforcement traceability, all flagged instances should be archived with token-level annotation. A minimal metadata schema should include:
(i) report ID,

(ii) source and target language,

(iii) HCC and RLI scores,

(iv) category of output (Tier I–III),

(v) reviewer action (approved, revised, blocked).

The enforcement logic is designed to treat syntactic form not as a style property but as a vector of operational command. By institutionalizing these checkpoints within the model’s generative loop, the audit process reintroduces human oversight precisely at the point where syntactic form may simulate authority. This approach affirms the principle that reglas compiladas, though structurally autonomous, must be governed at the threshold where linguistic execution becomes indistinguishable from institutional judgment.

The concluding section situates this architecture within broader questions of linguistic governance and clinical responsibility in systems that no longer require a speaking subject to issue commands.

 

7. Conclusion: Linguistic Governance Without a Speaker

The findings presented in this study confirm that syntactic structures in AI-generated clinical texts do more than convey information. They enact forms of delegated authority. When uncertainty markers are suppressed and directive language is syntactically encoded, the resulting outputs cease to function as interpretations and instead simulate institutional commands. The displacement is not semantic; it is structural. The model does not need to intend a diagnosis. The compiled form it produces carries formal traits of obligation regardless of reference or authorship.

This syntactic transformation cannot be addressed by current regulatory instruments. Both the FDA (2023, § IV-B) and the MDR (EU 2017/745, Rule 11) assume that human authorship remains implicit in clinical documentation, and neither framework defines syntactic structure as a measurable risk vector. However, the evidence demonstrates that outputs exceeding the thresholds of HCC > 0.40 and RLI > 0.25 exhibit structural features equivalent to clinician-issued directives. If left unchecked, these outputs become part of the institutional record without triggering accountability procedures.

This paper proposes a replicable framework based on four control elements: the Hedging Collapse Coefficient (HCC), the Responsibility Leakage Index (RLI), audit cadence, and enforcement triggers. All metrics are evaluated within the inference layer of the language model. Audits should occur every seven days and examine a stratified 10 % sample of outputs. If metric drift persists over two consecutive cycles beyond the established threshold, the system must be escalated to external oversight or temporarily suspended.

The regla compilada (compiled rule) is defined as a type-0 production in the Chomsky hierarchy. It permits syntactic execution without semantic anchoring or referential subjectivity. Once activated, it generates language that carries institutional force. This operational transformation corresponds to the theory of the soberano ejecutable, a structure that commands through form rather than intention (Startari 2023, 67).

As generative models continue to mediate clinical reasoning, governance systems must evolve to regulate the executional properties of language itself. It is no longer sufficient to audit content for accuracy or attribution. Institutions must examine how structure alone can instantiate authority. A syntactic audit is not a stylistic filter; it is a safeguard against unacknowledged decision-making embedded in form.

Future research should extend this grammar of authority to adjacent domains. Legal adjudication, public administration, and algorithmic triage all depend on linguistic asymmetries that can be syntactically encoded. One methodological priority is the development of a cross-domain hedge taxonomy to standardize the detection of suppressed uncertainty across fields. Ultimately, the locus of responsibility must follow the output’s structure rather than the absent speaker. In this framework, governance begins not with meaning, but with form.

 

 

ANNEX I – Canonical Prior Works by Agustin V. Startari

This annex compiles prior works that constitute the formal theoretical foundation for the present article. Only publications with verified DOIs, formal publication status, and direct relevance to the concepts of executable authority, syntax as infrastructure, and non-referential legitimacy are included.

Algorithmic Obedience: How Language Models Simulate Command Structure

SSRN DOI: http://dx.doi.org/10.2139/ssrn.5282045

Establishes the concept of sovereign executable authority, where syntactic output functions as institutional command without referential intention. Forms the foundation for understanding responsibility displacement in AI-generated clinical decisions.

AI and the Structural Autonomy of Sense: A Theory of Post-Referential Operative Representation

SSRN DOI: http://dx.doi.org/10.2139/ssrn.5272361

Defines the regla compilada as a type-0 generative structure that enables syntactic execution independently of semantic anchoring. This concept is essential for the theoretical justification of linguistic audit and authority metrics in clinical systems.

Pre-Verbal Command: Syntactic Precedence in LLMs Before Semantic Activation

Zenodo DOI: https://doi.org/10.5281/zenodo.15837837

Introduces the notion that generative language models execute before they interpret. Validates the structural model in which medical outputs are syntactically binding even when detached from clinical reasoning. Supports the claim that execution in LLMs is prior to and independent from meaning.

 

 

ANNEX II – General Bibliographic References

Chomsky, Noam. 1965. Aspects of the Theory of Syntax. Cambridge, MA: MIT Press.

European Court of Justice. 2024. Judgment in Case C-489/23, 12 February 2024.

European Union. 2017. Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on Medical Devices. Official Journal of the European Union, L 117.

FDA. 2023. Artificial Intelligence and Machine Learning Software as a Medical Device Action Plan. Silver Spring, MD: U.S. Food and Drug Administration, § IV-B, p. 14.

Giezen, Thomas J., and Elizabeth Ford. 2022. “Regulating Machine Learning Tools as Medical Devices in the United Kingdom and European Union.” Medical Law Review 30 (3): 412–435.

Goodman, Bryce, and Seth Flaxman. 2017. “European Union Regulations on Algorithmic Decision-Making and a ‘Right to Explanation’.” AI Magazine 38 (3): 50–57.

Hyland, Ken. 2005. Metadiscourse: Exploring Interaction in Writing. London: Continuum, p. 179.

Montague, Richard. 1974. “Universal Grammar.” In Formal Philosophy: Selected Papers of Richard Montague, edited by Richmond Thomason, 222–246. New Haven, CT: Yale University Press. (Cited: p. 52)

Pasquale, Frank. 2015. The Black Box Society: The Secret Algorithms That Control Money and Information. Cambridge, MA: Harvard University Press.

Zerilli, John, Alistair Knott, James Maclaurin, and Colin Gavaghan. 2019. “Transparency in Algorithmic and Human Decision-Making: Is There a Double Standard?” Philosophy & Technology 32 (4): 661–683.

Vincent, Nicolas. 2023. “How Language Models Lose the Subject: A Study of Epistemic Erosion in AI-Generated Reports.” Journal of Computational Semantics 49 (2): 155–180.

ANNEX III – Methodological Sources and Technical References

This annex provides methodological and technical foundations used in the analysis. It includes sources related to linguistic annotation, inter-rater agreement, syntactic inference, tokenization frameworks, and regulatory audit mechanisms. All entries conform to Chicago 17 author-date style.

Artstein, Ron, and Massimo Poesio. 2008. “Inter-Coder Agreement for Computational Linguistics.” Computational Linguistics 34 (4): 555–596.

Used to justify κ = 0.82 threshold for manual hedge annotation validation.

Bostrom, Nick, and Eliezer Yudkowsky. 2014. “The Ethics of Artificial Intelligence.” In Cambridge Handbook of Artificial Intelligence, edited by Keith Frankish and William Ramsey, 316–334. Cambridge: Cambridge University Press.

Relevant to institutional responsibility frameworks applied to non-attributable outputs.

Cer, Daniel, et al. 2017. “SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation.” Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval): 1–14.

Supports the multilingual evaluation and hedging collapse metrics across language groups.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” Proceedings of NAACL-HLT 2019: 4171–4186.

Underlying architecture principles relevant to model fine-tuning and inference logic.

Kudo, Taku, and John Richardson. 2018. “SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing.” Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations: 66–71.

Technical reference for byte-pair encoding (BPE) tokenizer alignment across languages.

Pavlopoulos, John, Prodromos Malakasiotis, and Ion Androutsopoulos. 2017. “Deconstructing the Label: Hierarchical Evaluation of a Legal Text Classification Model.” Artificial Intelligence and Law 25 (3): 311–330.

Contextualizes tiered output validation in legal/medical classification pipelines.

Xie, Qizhe, et al. 2020. “Self-Training with Noisy Student Improves ImageNet Classification.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 10687–10698.

Used as comparative framework in model audit design, specifically on drift detection cycles.

 

 

Appendix A – Alignment Grid: Regulatory Classification vs. Syntactic Risk Metrics
This appendix provides a comparative matrix aligning regulatory categories from FDA and MDR frameworks with the syntactic thresholds identified in the article (HCC > 0.40, RLI > 0.25). It is intended to demonstrate when clinical language generated by AI exceeds the structural boundary between support and directive, triggering implicit authority.

A.1 Threshold Definitions

  • HCC (Hedging Collapse Coefficient): Proportion of hedging tokens suppressed.
     Threshold: > 0.40 (65th percentile across corpus)

  • RLI (Responsibility Leakage Index): Proportion of decisions executed without clinical sign-off.
     Threshold: > 0.25 (65th percentile across corpus)

 

 A.3 Observations

  • FDA guidance (§ IV-B) permits text classified as “supportive” to bypass review, even when HCC and RLI exceed defined thresholds.

  • MDR Rule 11 and Annex III presume human editorial control, which fails when compiled outputs simulate authored commands.

  • In all rows where both thresholds are exceeded, syntactic form enters directive territory, regardless of the system's nominal classification.

  • Current regulatory frameworks do not detect or enforce based on linguistic structure, allowing structural authority to circulate without legal trigger.

 

 

Appendix B – Statistical Supplement

This appendix provides the detailed statistical foundations referenced in Sections 2 and 3. It includes per-language summary statistics, percentile cutoffs for HCC and RLI thresholds, inter-rater agreement data for hedge annotation, and tokenization schema alignment.

B.1 Corpus Composition

 

Tokenization based on language-specific preprocessors, aligned to a unified BPE encoding model with 32k subword units across all languages.

 

B.2 Metric Distribution

Hedging Collapse Coefficient (HCC)

– Mean (μ): 0.47

– Standard deviation (σ): 0.11

– 65th percentile cutoff: 0.40

– Distribution skew: Right-tailed (sk = 0.62)

Responsibility Leakage Index (RLI)

– Mean (μ): 0.22

– Standard deviation (σ): 0.09

– 65th percentile cutoff: 0.25

– Distribution skew: Slightly right-tailed (sk = 0.37)

 

B.3 Inter-Rater Reliability

Hedge annotation validation was performed by two independent clinical-linguistic reviewers on a stratified 5 % sample (n = 2,500 reports).

  • Agreement metric: Cohen’s κ = 0.82

  • Confidence interval (95 %): [0.78, 0.85]

  • Annotation guideline: Four-category schema from Hyland (2005)

 

B.4 Threshold Validation Justification

The selection of HCC > 0.40 and RLI > 0.25 corresponds to the upper 35 % quantile in the full corpus and marks a consistent transition from probabilistic output with hedging to directive output with syntactic closure. These values were independently validated in each language subset, and percentile thresholds remained within ±2 % variance across all four languages.

 

B.5 Audit Trigger Band

Weekly audit intervals (Section 6) are based on mean HCC variation. The enforcement trigger is defined as:

Δ(HCCₜ₊₁ − HCCₜ) ≥ +0.05 over rolling average (7-day window)

Sampling size: 10 % per audit cycle, randomly selected across all tiers

False positive rate observed: 4.1 %

bottom of page