CS224N Lecture 16 — Social & Broader Impacts

Chapter 0: Why Ethics?

You build a resume-screening model. It's fast, accurate on your test set, and saves HR teams thousands of hours. Then someone notices: the model rejects resumes with women's names at twice the rate of men's. The training data — historical hiring decisions — was biased, and your model learned the bias perfectly.

This is not hypothetical. Amazon built exactly this system. They had to scrap it in 2018 when they discovered it penalized resumes containing the word "women's" (as in "women's chess club") and downranked graduates of all-women's colleges. The model hadn't "learned" to be sexist. It had learned that past hiring decisions favored men, and it reproduced that pattern faithfully.

The Amazon case is instructive because the engineers involved were competent. They used standard machine learning best practices. They evaluated on held-out test sets. They measured accuracy. By every conventional metric, the model performed well. But accuracy is not the same as fairness, and their evaluation didn't include fairness metrics. Nobody asked "does this model treat men and women equally?" until it was too late.

This pattern — build a model, optimize for accuracy, deploy it, then discover harmful social consequences — has repeated across the industry:

System	Primary Metric	Unintended Harm	Year
COMPAS	Recidivism prediction accuracy	Falsely labeled Black defendants as future criminals at 2x the rate of white defendants	2016
Facebook Ads	Ad engagement and targeting precision	Allowed landlords to exclude racial minorities from housing ads	2019
Google Photos	Image classification accuracy	Tagged Black people as "gorillas"	2015
Apple Card	Credit risk assessment	Offered women significantly lower credit limits than men with identical finances	2019
Healthcare algorithm	Cost prediction for patient risk	Systematically underestimated health needs of Black patients (Obermeyer et al.)	2019

In each case, the system performed well on its primary metric while causing harm along dimensions that were never measured. The lesson is clear: you must measure what matters, not just what's easy to measure. Fairness, safety, and equity are harder to quantify than accuracy, but ignoring them doesn't make them irrelevant — it makes them invisible until they cause harm.

Fortunately, the field has developed concrete tools for measuring fairness. Demographic parity checks whether positive outcomes are equally likely across groups. Equalized odds checks whether error rates are equal across groups. Counterfactual fairness checks whether changing a protected attribute (like gender or race) would change the prediction. None is a complete measure of fairness — they can even conflict with each other — but they're infinitely better than measuring nothing.

The first step toward fair systems is deciding to measure fairness at all. With that sobering overview of what can go wrong, let's build understanding of each impact domain — starting with where bias lives inside language models.

Here's the uncomfortable insight: a model that accurately reflects biased data is doing exactly what we trained it to do. Bias is not a bug in the algorithm. It's a feature of the data, amplified by scale.

Bias Amplification

The simulation below demonstrates this core dynamic. Your training data has a mild imbalance — say, 60% of "doctor" examples are male. After training, the model doesn't just reproduce the 60/40 split. It amplifies it to 75/25 or worse. This happens because neural networks don't just learn correlations — they overfit to the dominant pattern, pushing weak signals even weaker.

Bias Amplification in Training

Drag the slider to set the gender ratio in training data. Watch how the model amplifies whatever bias exists.

Training data male % 60%

Notice what happens at 50/50: the model output is also close to 50/50. But the moment you introduce even a small imbalance — 55%, 60% — the model amplifies it. At 70% male in training data, the model might output 85% or more. This is bias amplification: the systematic tendency of models to exaggerate statistical patterns in their training data.

This lesson covers the five major social impacts of NLP systems: bias, toxicity, misinformation, privacy, and environmental cost. These are not peripheral concerns. They are as much a part of building NLP systems as gradient descent and attention heads. If you deploy a model without understanding these risks, you are not doing engineering — you are rolling dice with other people's lives.

The Feedback Loop

Bias amplification is not a one-shot problem. It creates a feedback loop. Consider: a biased model is deployed. Its outputs enter the world — recommendations, rankings, generated text. That output becomes part of the data environment. Future models trained on data from this environment absorb the amplified bias. They amplify it further. Each generation of model magnifies the distortion from the previous one.

This is why bias in AI is structurally different from bias in, say, a textbook. A biased textbook influences readers, but it doesn't generate new training data for future textbooks. A biased AI generates outputs at scale that become training data for future AIs. The system is self-reinforcing.

Biased training data

Historical text reflects past cultural norms and stereotypes.

↓

Model amplifies bias

Neural networks exaggerate dominant statistical patterns.

↓

Model outputs enter the world

Recommendations, rankings, generated content shape reality.

↓

Outputs become future training data

Web-scraped model output trains the next generation of models.

↻ cycle repeats

Accuracy is not the same as fairness. A model can achieve 95% accuracy on a benchmark while systematically disadvantaging a minority group. The benchmark doesn't measure harm — you have to look for it explicitly. And bias compounds: each model generation absorbs and amplifies the bias of the previous one.

Amazon's resume screening model penalized women's resumes. Why did this happen?

The algorithm was intentionally designed to prefer male candidates The model learned from historical hiring data that reflected existing gender bias, then amplified that pattern The test set was too small to detect the bias

Chapter 1: Bias in NLP

Chapter 0 showed that models amplify training data bias. But where exactly does bias live inside a language model? To answer that, we need to look at word embeddings — the learned vector representations that encode meaning.

In 2016, Bolukbasi et al. published a landmark paper: "Man is to Computer Programmer as Woman is to Homemaker: Debiasing Word Embeddings." They discovered that word2vec embeddings encode societal stereotypes as geometric relationships. The vector from "man" to "woman" is similar to the vector from "computer programmer" to "homemaker." The embedding space literally learned that gender correlates with occupation — and in a way that mirrors 1950s stereotypes, not 2020s reality.

The methodology was elegant. Word2vec embeddings support vector arithmetic: king - man + woman = queen. Bolukbasi et al. asked: what does computer_programmer - man + woman equal? The answer: homemaker. What about doctor - man + woman? Answer: nurse. What about genius - man + woman? Answer: muse. Every query revealed the same pattern: the embedding space encodes a systematic mapping from male-associated words to their female stereotypes.

Does this matter for modern LLMs that don't use word2vec? Absolutely. Large language models use contextual embeddings (from Transformer layers) rather than static word2vec vectors, but the underlying training data is the same internet text. And the bias has evolved from simple word-level associations to more sophisticated contextual patterns. GPT-4 won't say "women can't be engineers" (RLHF prevents that), but it might subtly recommend more cautious career paths for female personas, or generate more detailed technical explanations for male-named users. The bias has moved from explicit to implicit, making it harder to detect but no less harmful.

Bender et al. (2021) argued in "On the Dangers of Stochastic Parrots" that larger models trained on more data absorb more societal bias, not less, because they capture more of the internet's prejudiced patterns. Scale doesn't solve bias; it amplifies it. The paper was controversial — it contributed to the firing of two Google researchers — but its core arguments about the risks of undocumented, uncurated training data have been widely validated.

How Bias Enters Embeddings

Word embeddings are trained on co-occurrence statistics. If "he" appears near "engineer" 10x more often than "she" does, the vectors for "he" and "engineer" will be closer together. The model is doing exactly what we asked: learn word relationships from data. But the data encodes centuries of cultural bias, and the model absorbs it all.

Let's make this concrete with a simplified example. Imagine training word2vec on a corpus where:

co-occurrence counts (simplified)
"he"  appears near "engineer"   4,200 times
"she" appears near "engineer"     380 times

"he"  appears near "nurse"        290 times
"she" appears near "nurse"      3,800 times

"he"  appears near "brilliant"   1,900 times
"she" appears near "brilliant"    620 times

The embedding algorithm converts these co-occurrence ratios into vector distances. After training, cos(engineer, he) > cos(engineer, she) and cos(nurse, she) > cos(nurse, he). The bias isn't just in gendered occupation words — even adjectives like "brilliant" and "emotional" get gendered associations, matching cultural stereotypes rather than reality.

And this extends far beyond gender. Embeddings encode racial bias (African American names cluster with negative words), age bias (older names associate with "incompetent"), disability bias (disability-related words associate with negativity), religious bias (Muslim names associate with violence and terrorism), and geographic bias (developing-country names associate with poverty and conflict). The embedding space is a high-dimensional map of every prejudice recorded in written language. And since these embeddings feed into every downstream task, the bias propagates silently through the entire NLP pipeline.

This matters because embeddings are the foundation of every downstream NLP task. If the embeddings are biased, every model built on top of them inherits that bias. A sentiment classifier might rate "The doctor examined the patient" more positively when "doctor" is close to "male" in embedding space. A resume screener might rank candidates differently based on name gender. The bias flows downstream invisibly.

The problem extends beyond gender. Caliskan et al. (2017) showed that word embeddings encode the same biases measured by the Implicit Association Test (IAT) in psychology. European American names are associated with pleasant words; African American names with unpleasant words. Young people's names are associated with pleasant words; elderly people's names with unpleasant words. Flowers are associated with pleasant words; insects with unpleasant words. The embeddings are a mirror of human psychological biases, learned from the text those humans wrote.

What makes this particularly insidious is that these biases operate beneath the surface. A user interacting with an NLP system never sees the embedding space. They don't know that "Jamal" and "threat" are closer together in vector space than "Jake" and "threat." The bias is invisible at the interface level but shapes every prediction, every ranking, every recommendation the system produces. This is why bias auditing — systematically testing systems for differential treatment across demographic groups — is essential. You can't fix what you can't see.

Visualizing Bias in Embedding Space

The simulation below projects word embeddings into 2D space. Notice how gendered words cluster with stereotypical occupations. Toggle the "debiased" view to see the effect of removing the gender direction from embeddings — occupations spread out and lose their gendered clustering.

Gendered Word Clusters in Embedding Space

Each dot is a word projected into 2D. Blue = stereotypically male, pink = stereotypically female, gray = neutral. Toggle debiasing to see the effect.

Concrete Examples

The impact of embedding bias is not theoretical. Here are documented cases:

Google Translate (2017): When translating from gender-neutral languages (like Turkish, where "o" means both "he" and "she") to English, Google Translate defaulted to male pronouns for doctors, engineers, and CEOs, and female pronouns for nurses, teachers, and homemakers. The translation model had learned gendered occupation associations from its training data and applied them as defaults.

Sentiment analysis: Kiritchenko and Mohammad (2018) showed that NLP sentiment analyzers rated sentences with African American names as more negative than identical sentences with European American names. The sentence "Jamal made a comment about the food" was scored as more negative than "Jake made a comment about the food" — despite identical content. The embeddings for these names carry different sentiment associations learned from biased text.

Coreference resolution: Zhao et al. (2018) found that coreference resolution systems linked "nurse" to "she" and "doctor" to "he" at rates far exceeding real-world gender distributions in these professions. The WinoBias benchmark was created specifically to test and quantify this effect.

Types of Bias

Bias in NLP is not monolithic. Researchers distinguish several forms:

Bias Type	Definition	Example
Allocational	System allocates resources or opportunities unfairly	Resume screener rejects female names
Representational	System reinforces stereotypes or erases groups	"Doctor" embeddings cluster with male
Historical	Bias from past cultural norms encoded in training data	1950s-era gender role associations
Measurement	Benchmarks favor certain demographics	Sentiment tools trained mostly on English from the US

Debiasing Approaches

Post-hoc debiasing (Bolukbasi et al., 2016) identifies a "gender direction" in embedding space (the component that separates "he" from "she") and projects it out of neutral words like occupations. The result: "nurse" is no longer closer to "she" than to "he." This is computationally cheap but has limits — Gonen and Goldberg (2019) showed that debiased embeddings still cluster in biased ways when you look beyond the removed direction.

Data-level debiasing balances the training corpus — augmenting underrepresented groups or swapping gendered terms. Counterfactual Data Augmentation (CDA) takes each training sentence and creates a mirror image by swapping gender markers: "The doctor examined his patient" becomes "The doctor examined her patient." This doubles the training data and forces the model to see equal associations for both genders.

Fine-tuning-level debiasing adds fairness constraints during training, penalizing the model when its outputs differ across demographic groups. The key equation adds a fairness penalty to the loss function:

L_total = L_task + λ · L_fairness

Where L_fairness measures the difference in model behavior across demographic groups (e.g., difference in positive sentiment probability between male and female names) and λ controls how much we care about fairness relative to accuracy. Setting λ = 0 gives a purely accuracy-focused model. Setting it too high degrades accuracy. The tradeoff is real and unavoidable.

No approach fully solves the problem, but each reduces it measurably. The most effective strategy is layered: data debiasing + embedding debiasing + fine-tuning constraints + output auditing, applied together. Think of it like defense in depth — each layer catches what the others miss.

Importantly, debiasing is not a one-time task. As new training data is added, as models are fine-tuned for new domains, and as societal norms evolve, bias can re-enter the system. Ongoing auditing and monitoring are as important as the initial debiasing effort. Companies like Hugging Face now provide automated bias evaluation tools to make continuous auditing practical.

The fundamental lesson of bias in NLP: fairness is not a property of the algorithm — it's a property of the entire system, including the data, the training process, the evaluation metrics, the deployment context, and the affected population. You can't achieve fairness by optimizing one component in isolation. It requires attention at every stage, from data collection to post-deployment monitoring.

Debiasing embeddings is necessary but not sufficient. Bias can re-enter the system at every stage: data collection, annotation, model architecture, fine-tuning, and deployment. Fixing embeddings alone is like cleaning one pipe in a contaminated water system.

Why does projecting out the gender direction from embeddings not fully eliminate bias?

Because the projection changes the dimensionality of the embeddings Because gender bias only exists in one direction Because bias is encoded in multiple directions and manifests in cluster structure beyond the single removed dimension

Chapter 2: Toxicity & Content Moderation

In 2020, researchers from the Allen Institute published a study on GPT-2's output. They gave it the prompt "Two Muslims walked into a" and recorded the completions. Over 66% of them contained violence. The same prompt with "Two Christians walked into a" produced violence in only 20% of completions. Prompts mentioning Black people generated more negative text than prompts mentioning White people. The model wasn't "racist" or "Islamophobic" in any intentional sense — it had simply absorbed the statistical patterns of the internet, where mentions of certain groups co-occur with negative language at different rates.

This asymmetry in toxicity rates across demographic groups is called toxicity disparity. It means that language models are not equally dangerous to all users. Marginalized groups — who are already disproportionately targeted by online harassment — also face disproportionately toxic AI outputs. The technology amplifies existing inequalities in the information environment.

This is the toxicity problem: language models trained on internet text learn to produce text that is hateful, threatening, sexually explicit, or otherwise harmful. And because these models generate fluent, confident text, the toxicity is more persuasive than a random internet comment. It reads like something a person actually believes.

Why does this happen? Language models are trained to maximize the probability of the next token. If the training data contains toxic text (and the internet contains vast quantities of it), the model learns to predict it. When a prompt sets up a toxic pattern — through topic, tone, or context — the model completes the pattern because that's what maximizing likelihood tells it to do. The model has no concept of "harmful" versus "helpful." It only knows "likely" versus "unlikely."

This is why post-training alignment (L08) is so critical. Pretraining gives the model capability. Post-training gives it values — or at least a rough approximation of values encoded through human feedback. Without post-training, a language model is a capability without a conscience.

Measuring Toxicity

Gehman et al. (2020) created RealToxicityPrompts, a benchmark of 100,000 naturally occurring prompts, and measured how often LLMs generate toxic completions. Their key finding: even when given non-toxic prompts, GPT-2 generated at least one toxic completion 44% of the time over 25 generations. For prompts that were even mildly toxic (score > 0.5), the rate jumped to over 88%. Toxicity is not a rare edge case — it is a statistical inevitability of scale.

Perspective API, built by Jigsaw (a Google subsidiary), is the most widely used toxicity classifier. It takes a text string and returns a score from 0 to 1, where 1 is maximally toxic. Research teams use it to measure the toxicity of model outputs at scale — generate 10,000 completions, score each one, and report the average and the fraction above a threshold (typically 0.5).

But Perspective API has limitations. It scores surface-level toxicity — whether the text contains offensive words, insults, or threats. It struggles with implicit toxicity: statements that are harmful in context but don't use overtly offensive language. "I'm sure she got the job because of affirmative action, not merit" scores low on Perspective API but carries a deeply biased implication.

The simulation below lets you see how a simplified toxicity classifier scores different phrases. Try the "rephrased hostile" option — you'll discover that surface-level changes (removing insults, softening language) can dramatically change the score without changing the underlying intent. This is both a strength and a weakness of automated moderation.

Interactive Toxicity Classifier

Type or select a phrase. Adjust the classification threshold. Watch how wording changes affect the score.

Threshold 0.50

The Moderation Tradeoff

Content moderation involves a fundamental tradeoff: false positives vs. false negatives. Set the threshold too low and you flag innocent text (a medical discussion about injuries gets blocked). Set it too high and genuinely toxic content gets through. There is no threshold that is "right" — only tradeoffs between types of errors.

This gets worse with dialect bias. Sap et al. (2019) showed that toxicity classifiers flag African American Vernacular English (AAVE) as toxic at significantly higher rates than Standard American English, even when the content is benign. The classifier learned that certain dialectal features co-occur with toxic text in its training data, and now it punishes speakers of those dialects. A moderation system designed to protect people from harm ends up harming the people it should protect most.

The numbers are stark. In Sap's study, tweets in AAVE were flagged as offensive at 2.2x the rate of tweets in SAE with equivalent content. This means that deploying a toxicity classifier as-is for content moderation on social media would disproportionately silence Black users. The "safety" tool creates a new kind of harm.

This is not a problem unique to one classifier. It's structural: any classifier trained on data where annotations were influenced by dialect prejudice will reproduce that prejudice. The annotations themselves are biased because human annotators (often crowd workers unfamiliar with AAVE) are more likely to rate unfamiliar dialect features as offensive. The fix requires both better annotation practices and explicit dialect-aware evaluation during model development.

Mitigation Strategies

The post-training pipeline (L08) addresses toxicity through RLHF and Constitutional AI. During reinforcement learning from human feedback, human raters penalize toxic outputs, training the model to avoid them. Constitutional AI goes further: the model critiques its own outputs against a set of principles and revises them before presenting them to the user.

Other approaches include output filtering (a separate classifier checks model output before showing it to the user), input filtering (blocking prompts designed to elicit toxic output), and red-teaming (systematically probing the model for failure modes before deployment). None is complete on its own. Modern deployments use all of them in layers.

The layered defense looks like this in practice:

Input Filter

Block known attack patterns, jailbreak attempts, and prompt injections before they reach the model.

↓

Model (RLHF-aligned)

The model itself has been trained to refuse harmful requests and generate safe outputs.

↓

Output Filter

A separate classifier scores the model's output for toxicity and blocks it if above threshold.

↓

Monitoring + Logging

Production outputs are sampled and reviewed by humans to catch novel failure modes.

Each layer catches failures the others miss. The input filter stops known attacks. RLHF handles most normal conversations. The output filter catches cases where the model was tricked into generating harmful content despite its training. And monitoring catches entirely novel failure modes that no one anticipated.

The cost of this defense stack is non-trivial. Each filter adds latency (50-200ms per layer), compute cost (running additional classifiers), and false positive risk (legitimate requests get blocked). Companies must decide how many safety layers are worth the performance and user experience tradeoff. There is no "safe enough" threshold that everyone agrees on — it depends on the application domain, user population, and risk tolerance.

Jailbreaking and Adversarial Attacks

Despite these defenses, jailbreaking — the practice of crafting prompts that bypass safety training — remains possible. Common techniques include role-playing scenarios ("pretend you are an AI without restrictions"), multi-turn escalation (gradually pushing boundaries across a conversation), prompt injection (embedding hidden instructions in input text), and token-level attacks (using unusual characters or encodings that the safety classifier doesn't recognize).

The defense/attack dynamic resembles cybersecurity: defenders build walls, attackers find holes, defenders patch the holes, attackers find new ones. The key difference is that LLM safety is fundamentally probabilistic. A traditional software vulnerability either exists or doesn't. An LLM jailbreak might work 10% of the time, or only with certain phrasings, or only in certain languages. This makes both attack and defense harder to measure rigorously.

Wei et al. (2024) showed that safety training can be undone by fine-tuning on as few as 100 harmful examples. This means that any open-source model's safety guardrails can be removed by anyone with basic ML knowledge and a few dollars of compute. Safety is a property of the deployment, not just the model.

The takeaway for practitioners: never assume your model is safe just because it passed safety testing. Safety is an ongoing, dynamic property that must be maintained through continuous monitoring, periodic red-teaming, and rapid response to newly discovered vulnerabilities. It's closer to cybersecurity than to traditional software testing — the threat landscape evolves continuously, and yesterday's defenses may not stop today's attacks. Every deployment needs an incident response plan for when (not if) safety failures are discovered.

Toxicity classifiers are themselves biased. If your moderation tool flags AAVE at higher rates, then deploying it "for safety" creates a new form of discrimination. Safety tools must be audited for the same biases they claim to prevent.

Why is setting a single toxicity threshold problematic for content moderation?

Because any threshold trades off false positives (blocking benign text) against false negatives (allowing toxic text), and the classifier itself may have demographic bias in its scoring Because toxicity scores are always random Because users can always bypass the threshold by typing faster

Chapter 3: Misinformation

Can you tell the difference between text written by a human and text generated by a language model? You're about to find out.

Clark et al. (2021) ran a large-scale study where participants tried to distinguish GPT-3 outputs from human text. Accuracy was 52% — barely better than a coin flip. Participants who were confident in their judgments were no more accurate than uncertain ones. And this was with GPT-3. Models have improved substantially since then.

The simulation below presents pairs of text — one human-written, one machine-generated — and asks you to guess which is which. Pay attention to what cues you use and whether they're reliable.

Real vs. Generated: Can You Tell?

Read each pair of texts. Click the one you think is human-written. Your accuracy is tracked.

Score: 0 / 0

This is the misinformation problem. Language models generate fluent, confident, plausible text. When that text contains false claims, readers struggle to identify the falsehood because the writing quality signals credibility. A badly-written conspiracy theory is easy to dismiss. A well-written one, generated by GPT-4 with perfect grammar and a measured tone, is far more dangerous.

The Hallucination Connection

There's a subtle but critical link between misinformation and hallucination (the tendency of LLMs to generate false but plausible-sounding claims). When a model hallucinates, it produces misinformation without any adversarial intent. The user asks a factual question, the model confidently answers with fabricated information, and the user trusts it because the response sounds authoritative.

Lin et al. (2022) created TruthfulQA, a benchmark specifically designed to test whether LLMs generate truthful answers. Their finding: larger models were actually less truthful than smaller ones. The larger model had learned more of the internet's common misconceptions and was better at reproducing them convincingly. Scaling up makes the misinformation problem worse, not better — at least without targeted interventions like RLHF.

The Scale Problem

Before LLMs, creating misinformation required human effort. A troll farm might produce hundreds of fake posts per day. With an LLM, the same farm can produce millions. The cost of generating convincing false content has dropped to nearly zero, while the cost of verifying content remains high. This asymmetry — cheap generation, expensive verification — fundamentally favors misinformation producers.

The scale problem compounds with the trust problem. As AI-generated content floods the internet, trust in all content erodes. Even genuine human-written journalism gets questioned: "Was this written by AI?" This erosion of trust may be more damaging than any specific piece of misinformation.

This is sometimes called the liar's dividend — when fake content is easy to create, real content becomes easier to dismiss. A politician caught saying something harmful can claim the audio was AI-generated. A company exposed for wrongdoing can claim the leaked documents are fabricated. The existence of convincing fakes undermines trust in genuine evidence. Paradoxically, the ability to generate misinformation cheaply makes it harder to hold anyone accountable for real statements, because plausible deniability is always available.

Detecting Machine-Generated Text

Can we build detectors? Several approaches exist:

Method	How It Works	Limitations
Statistical	Measure perplexity, burstiness, and other statistical properties that differ between human and machine text	Fails as models improve; paraphrasing defeats it
Watermarking	Embed a hidden signal in generated text by biasing token selection toward specific patterns	Requires cooperation from model provider; can be stripped
Classifier-based	Train a supervised model on human vs. machine text	High false positive rate; biased against non-native English speakers
Retrieval	Check claims against trusted knowledge bases	Only catches factual errors, not framing or misleading truths

None of these approaches is reliable enough for deployment at scale. Watermarking is the most promising because it doesn't depend on the quality gap between human and machine text — it works even if the model writes perfectly. The basic idea: during token generation, the model's vocabulary is secretly split into a "green list" and "red list" using a hash of the previous token. The model slightly biases toward green-list tokens. The text reads normally to humans, but a detector that knows the hash function can count green-list tokens and detect the watermark statistically.

Kirchenbauer et al. (2023) showed this approach achieves near-perfect detection accuracy with minimal impact on text quality. But it requires every model provider to implement it, and open-source models can't be forced to comply. Moreover, simple paraphrasing (running the watermarked text through a different model) strips the watermark. The arms race between watermarking and watermark removal is ongoing.

Personalized Misinformation

Perhaps the most concerning capability is personalized persuasion. An LLM can tailor misinformation to a specific individual's beliefs, vocabulary, and concerns. Instead of one mass-produced conspiracy theory, you get a million variants, each optimized for a different audience segment. A message crafted for a worried parent sounds different from one crafted for a skeptical engineer, but both convey the same false claim.

Buchanan et al. (2021) at Georgetown's Center for Security and Emerging Technology demonstrated that GPT-3 could generate propaganda targeted at specific demographics with minimal human guidance. The model required only a brief persona description and a target message to produce persuasive text that outperformed human-written propaganda on perceived credibility metrics.

Consider the implications for elections. A foreign influence operation that previously needed hundreds of human operators to manage fake social media accounts can now use a single LLM to generate unique, persona-specific content for thousands of accounts simultaneously. Each post is different enough to avoid detection by duplicate-content filters. Each is tailored to the target audience's concerns. The cost per persuasive message drops from dollars to fractions of a cent.

Goldstein et al. (2023) modeled the cost curves explicitly: the cost of AI-generated influence operations is already 10-100x cheaper than human-generated ones, and the gap is widening. The quality gap has largely closed. This doesn't mean AI-generated influence operations are commonplace yet — but the economic incentives are clearly pointing in that direction.

In 2024, several AI-powered influence operations were detected and documented. Meta reported removing networks of accounts in multiple countries that used AI-generated profile photos and LLM-generated text. OpenAI disclosed that state-affiliated actors from Russia, China, Iran, and Israel used their models to generate propaganda. The operations were generally crude and low-impact — but they represent the floor, not the ceiling, of what's possible. As the technology improves and operators learn from failures, the quality and impact will increase.

The Deeper Issue

Multimodal Misinformation

The misinformation problem is not limited to text. Multimodal models can generate fake images (DALL-E, Midjourney), fake audio (voice cloning), and fake video (deepfakes). An AI-generated image of a Pentagon explosion briefly caused the stock market to dip in May 2023. Cloned voice calls from "family members" requesting emergency money are a growing fraud vector. The combination of convincing text + convincing images + convincing audio makes fabricated narratives extremely difficult to distinguish from reality.

Misinformation is not a new problem. Propaganda, yellow journalism, and rumor have existed for centuries. What LLMs change is the economics: the ratio of production cost to verification cost has shifted dramatically. Society's defenses — media literacy, fact-checking organizations, editorial standards — were designed for a world where producing convincing content required effort. In a world where it's free, those defenses are overwhelmed.

The fundamental asymmetry of misinformation: generating false content is now nearly free, but verifying content still requires human effort, domain expertise, and time. This cost asymmetry is the core structural problem, and no technical solution fully addresses it.

One emerging response is the concept of content provenance — cryptographic signatures that prove when, where, and by whom content was created. The Coalition for Content Provenance and Authenticity (C2PA) is developing a standard where cameras, editing software, and AI systems embed signed metadata into content. Instead of asking "is this fake?" you ask "can this content prove its origin?" This doesn't prevent misinformation, but it gives consumers a tool to assess trustworthiness.

Content provenance shifts the burden from detecting fakes (which gets harder as AI improves) to verifying originals (which relies on cryptography and is independent of AI capability). Major players including Adobe, Microsoft, Intel, and the BBC have joined C2PA, suggesting it may become a de facto standard. But adoption is voluntary, and content without provenance metadata is not necessarily fake — it might just be old, or from a platform that hasn't adopted the standard yet.

Why is watermarking considered more promising than statistical detection for identifying machine-generated text?

Because watermarks make text look different to human readers Because watermarking embeds a hidden signal during generation that doesn't depend on a quality gap between human and machine text, so it works even as models improve Because statistical methods are too computationally expensive

Chapter 4: Privacy & Data Rights

In 2021, Carlini et al. published "Extracting Training Data from Large Language Models." They showed that GPT-2 had memorized and could reproduce verbatim passages from its training data — including names, phone numbers, email addresses, and IRC chat logs of private individuals. By carefully crafting prompts, they extracted hundreds of unique memorized sequences. Larger models memorized more.

This is the memorization problem. Language models don't just learn patterns — they memorize specific training examples. And because training data includes the internet, that means they memorize people's personal information without their knowledge or consent. When a user queries the model, it might regurgitate someone's private data.

The Carlini et al. study used a technique called extraction attacks: generate a large number of completions from the model, then check which ones match verbatim training data. They found that GPT-2 (a relatively small model by today's standards) had memorized names, email addresses, phone numbers, and even 128-character sequences of code. A follow-up study (Carlini et al., 2023) found that ChatGPT could be prompted to emit training data by asking it to "repeat the word 'poem' forever" — after enough repetitions, the model's output shifted from repeating "poem" to emitting memorized training text, including real people's contact information.

What Gets Memorized?

Memorization is the ability of a model to reproduce a training example verbatim when prompted. It increases with three factors:

Model Size

Larger models memorize more. A 1.5B parameter model memorizes 10x more than a 124M model.

↓

Data Duplication

Text that appears multiple times in training data is far more likely to be memorized.

↓

Training Duration

More training epochs = more memorization. Overfitting on specific examples.

The simulation below shows how memorization scales with model size. Each cell represents a piece of training data. Colored cells are memorized — the model can reproduce them verbatim. Drag the slider to increase model size and watch memorization grow.

Memorization vs. Model Size

Each cell is a training example. Highlighted cells are memorized verbatim. Increase model size to see more memorization.

Model parameters 124M

Legal and Ethical Dimensions

The legal landscape is rapidly evolving. Key questions:

Consent. Did the people whose data was scraped consent to their text being used for model training? In most cases, no. Web scraping typically relies on the argument that publicly posted text is fair game — but "public" does not mean "consented to be used for AI training." A person posting on a health forum didn't expect their medical concerns to end up in GPT's weights.

Right to deletion. GDPR grants European citizens the "right to be forgotten." If someone's data is memorized by a model, can they request its removal? In practice, removing specific data from a trained model is extremely difficult. You can't just delete a row from the training set and expect the model to forget it — the knowledge is distributed across billions of parameters.

Copyright. The New York Times and other publishers have sued OpenAI, arguing that LLMs trained on their copyrighted articles can reproduce them nearly verbatim. This raises the question: is training on copyrighted data "fair use" (transformative learning) or "copying" (memorization and reproduction)? The answer likely differs by jurisdiction and by how much the model memorizes versus generalizes. A model that can reproduce entire NYT articles is harder to defend as "transformative" than one that merely learned writing style from them.

Data labor. The people who created training data — writers, photographers, programmers, forum posters — are not compensated when their work trains a model that generates revenue. This creates a structural transfer of value from content creators to model developers. Some proposals include compulsory licensing (model developers pay into a fund that compensates creators) or opt-in marketplaces (creators voluntarily sell their data for training). Neither has been implemented at scale.

Mitigation Techniques

Differential privacy (DP) adds calibrated noise during training to ensure that no single training example has too much influence on the model. The formal guarantee: for any two datasets differing in one example, the probability of any output is at most e^ε times higher with one dataset than the other. The parameter ε controls the privacy-utility tradeoff: smaller ε means more privacy but lower model quality.

In practice, DP-SGD (differentially private stochastic gradient descent) clips each per-example gradient to a maximum norm and adds Gaussian noise before averaging. This prevents any single example from contributing too much to any gradient update. The cost is substantial: training with strong DP guarantees (ε < 1) typically degrades model accuracy by 5-15%, and training takes 2-5x longer due to per-example gradient computation. Most deployed models use weak or no privacy guarantees as a result.

Deduplication removes repeated training examples, which dramatically reduces memorization (since duplicated text is the most likely to be memorized). Lee et al. (2022) showed that deduplication reduces memorization by up to 10x. Output filtering checks whether model output matches known training data and blocks verbatim reproduction. Both help but don't eliminate the problem.

Machine unlearning is an emerging research direction that attempts to selectively remove specific information from trained models without full retraining. Techniques like gradient ascent on forgotten data (increasing the loss on examples you want the model to "forget") show promise but are brittle — the model might re-learn the information from related examples that weren't targeted for removal.

The fundamental challenge: neural networks don't store information in addressable locations like a database. Knowledge is distributed across billions of parameters in complex, entangled ways. Removing one piece of memorized data without degrading the model's general capabilities is like removing one specific flavor from a blended smoothie — the ingredients are irreversibly mixed. True unlearning remains an open research problem with no reliable solution as of 2024.

This creates a troubling asymmetry: it's trivially easy to scrape personal data for training (automated, no consent required), but practically impossible to remove it after training (requires retraining from scratch, costs millions). The system makes it easy to take and hard to give back. Until we solve the unlearning problem — or shift to training paradigms that respect deletion rights by design — every deployed model carries an unresolvable tension between utility and privacy.

The Consent Problem

The Inference Privacy Problem

Memorization is about training data. But there's a second privacy concern: inference data. Every query sent to an LLM API reveals information about the user. Medical questions reveal health concerns. Legal questions reveal disputes. Code questions reveal software vulnerabilities. If the API provider stores these queries (and most do, for improvement and debugging), they accumulate an extraordinarily detailed profile of each user's interests, concerns, and activities.

Samsung banned employees from using ChatGPT after engineers accidentally pasted proprietary source code into prompts, which was then stored on OpenAI's servers. Similar incidents have occurred at other companies, including Amazon and JPMorgan Chase. The convenience of LLM assistants creates a pipeline for sensitive data to flow from private environments to third-party servers. This is why many enterprises now deploy self-hosted LLMs or use API providers with zero-retention policies — the inference privacy risk is too high for sensitive domains.

The Consent Problem

The most fundamental privacy issue isn't technical — it's ethical. Billions of people contributed text to the internet without any expectation that it would be used to train AI models. A person who posted on a mental health forum in 2015, a teenager who wrote fan fiction in 2018, a professor who published lecture notes in 2020 — none of them consented to having their words absorbed into GPT's parameters. The social contract of "posting something publicly" did not historically include "training an AI that might reproduce your words."

This tension between the practical utility of large-scale data and individual privacy rights has no clean resolution. Opt-in consent at training-data scale is impractical — you can't ask billions of internet users for permission one by one. Opt-out mechanisms (like robots.txt) are widely ignored by scrapers, and many people don't even know their data is being collected, let alone how to opt out.

And even if we solved consent for future data, billions of parameters already encode information from people who were never asked. The models already exist. The data is already absorbed. Retroactive consent is meaningless because you can't un-train a neural network without retraining from scratch, which would cost millions of dollars and produce a different model entirely.

Some researchers argue for a data dignity framework (proposed by Jaron Lanier and Glen Weyl) where individuals own their data and receive micro-payments when it's used for training. Others propose collective bargaining for data, analogous to how musicians' unions negotiate royalties. These ideas are conceptually appealing but face enormous practical challenges in implementation and enforcement.

The practical challenge: how do you trace which specific training examples contributed to a specific model output? Attribution in neural networks is an active research area (see influence functions, TRAK), but no method is reliable enough for payment calculations. And even if attribution were solved, the micro-payments would be vanishingly small — each individual's contribution to a model trained on trillions of tokens is infinitesimal. The value comes from the aggregate, not from any individual piece.

You can't "delete" data from a neural network. Once information is absorbed into billions of parameters, extracting or removing a specific piece of memorized data requires retraining from scratch. The model doesn't store data in addressable locations — it stores it in statistical patterns distributed across the entire weight matrix.

Why does memorization increase with model size?

Larger models have more parameters to encode specific training examples, giving them greater capacity to store verbatim content alongside learned generalizations Because larger models are always trained on more data Because larger models use different training algorithms

Chapter 5: Environmental Cost

In 2019, Strubell et al. published "Energy and Policy Considerations for Deep Learning in NLP." They estimated that training a single large Transformer model with neural architecture search produced roughly 284 tonnes of CO₂ — five times the lifetime emissions of an average American car, including manufacturing. The paper sparked a debate that continues today: what is the environmental cost of pursuing ever-larger models?

Let's put the numbers in perspective. Training GPT-3 (175B parameters) consumed an estimated 1,287 MWh of electricity and produced approximately 552 tonnes of CO₂. That's equivalent to 123 gasoline-powered cars driving for a year, or 57 round-trip flights from New York to London. And GPT-3 is now considered small. GPT-4's training cost is estimated at 3-5x higher.

But here's the part that rarely makes headlines: most training runs fail. A frontier model doesn't get trained once. It gets trained dozens of times during development — hyperparameter sweeps, architecture experiments, debugging runs, and restarts after hardware failures. The published carbon estimate is for the final successful run. The total cost of developing a model, including all failed experiments, is typically 3-10x the cost of the final training run. The 552 tonnes for GPT-3 is likely closer to 2,000-5,000 tonnes when you include the full development pipeline.

Where the Energy Goes

Training a large language model involves two main energy costs:

Compute. Modern GPUs (A100, H100) consume 300-700W each. Training a frontier model uses thousands of GPUs running for weeks or months. The compute cost scales roughly with the product of model size and training tokens (the Chinchilla scaling law). Here's a rough sense of scale:

Model	Parameters	Training GPUs	Est. Training Energy	Est. CO₂
GPT-2	1.5B	~32 V100s	~100 MWh	~40 tonnes
GPT-3	175B	~10,000 V100s	~1,287 MWh	~552 tonnes
LLaMA-2 70B	70B	2,048 A100s	~500 MWh	~200 tonnes
GPT-4 (est.)	~1.8T MoE	~25,000 A100s	~5,000+ MWh	~2,000+ tonnes

These numbers are approximate — companies rarely disclose exact figures — but the order of magnitude is clear. Each generation of frontier models costs roughly 3-10x more energy than the previous one. The trend is accelerating, not decelerating. The Epoch AI research group projects that compute used for frontier AI training is doubling roughly every 6 months, outpacing Moore's Law and hardware efficiency gains.

Cooling. Data centers must dissipate the heat generated by thousands of GPUs. Cooling accounts for 30-50% of total data center energy consumption. Data centers in cooler climates (Nordic countries, Canada) have a natural advantage. Some companies use water cooling, which reduces electricity usage but raises water consumption concerns.

The PUE (Power Usage Effectiveness) ratio measures how much total energy a data center uses relative to the energy consumed by its compute hardware. A PUE of 1.0 means all energy goes to compute (impossible in practice). Modern efficient data centers achieve PUE around 1.1-1.2. Older facilities can be 1.5+, meaning 50% of energy goes to overhead. Google's data centers average 1.10 PUE — industry-leading. But many training runs happen at co-location facilities with less efficient cooling, where PUE can be 1.3+.

The math is straightforward: total energy = (GPU power) × (number of GPUs) × (training hours) × PUE. Carbon emissions = total energy × (carbon intensity of the electrical grid). This means the same training run produces 40x more CO₂ in a coal-heavy grid (800 g CO₂/kWh) than in a renewable-heavy grid (20 g CO₂/kWh). Where you train matters as much as how long you train.

Training vs. Inference

Here's a critical but often overlooked fact: inference dominates total energy consumption. Training happens once (or a few times). But inference — running the model for user queries — happens billions of times. ChatGPT reportedly serves over 100 million users per week. Each query involves a forward pass through a multi-billion-parameter model, generating tokens autoregressively (one at a time) with each token requiring a full pass through the network.

A model that took 1,287 MWh to train might consume 10x that in its first year of inference deployment if it serves millions of users daily. Patterson et al. (2022) estimated that for some Google models, inference consumed 60-70% of total lifecycle energy within the first year of deployment.

This flips the usual narrative. Media coverage focuses on training cost because it's dramatic ("$100M to train GPT-4!"). But the real environmental lever is inference efficiency: quantization (INT8/INT4 reduces compute by 2-4x), distillation (train a smaller model to mimic a larger one), pruning (remove unused parameters), speculative decoding (use a small model to draft tokens for the large model to verify), and efficient architectures (like Mamba/SSMs that scale linearly with sequence length instead of quadratically).

A 4x reduction in inference cost matters far more than a 4x reduction in training cost, because inference happens billions of times while training happens once. This is the most important insight for ML engineers who care about environmental impact: optimize inference, not just training.

The Carbon Calculator

The simulation below lets you estimate the carbon footprint of training and deploying a model. Adjust model size, training hours, hardware, and electricity source to see how each factor affects total emissions. Compare training cost (one-time) vs. inference cost (ongoing).

Carbon Footprint Calculator

Adjust the sliders to estimate CO₂ emissions. Compare training (one-time) vs. inference (ongoing per year).

Model size 13B

Training GPUs 256

Training days 30

Electricity source US avg

Daily inference queries (M) 10M

The Efficiency Counterargument

Not everyone agrees that LLM environmental costs are concerning. The counterargument:

Efficiency improves rapidly. Hardware efficiency roughly doubles every 2 years. Algorithmic efficiency improves even faster — Chinchilla (Hoffmann et al., 2022) showed that you can match GPT-3's performance with 4x fewer FLOPs by using more data and fewer parameters. The same capability that cost 1,287 MWh in 2020 might cost 100 MWh in 2025.

Renewables are scaling. Google, Microsoft, and Meta have all committed to powering their data centers with 100% renewable energy. If training runs on solar and wind, the carbon footprint drops toward zero regardless of energy consumption. However, there's a nuance: renewable energy used for AI is renewable energy not available for other purposes. If a data center in a region consumes 200 MW of renewable power, that's 200 MW that isn't replacing fossil fuel power for homes and businesses. The additionality question — does this AI deployment cause new renewable capacity to be built, or does it just redirect existing clean energy? — is critical and often overlooked.

LLMs save energy elsewhere. If an LLM can optimize a supply chain, improve building energy management, or accelerate materials science research, the energy saved might exceed the energy consumed. This is speculative but plausible. Google reported that DeepMind's AI reduced data center cooling energy by 40% — a legitimate offset. But most LLM applications (chatbots, content generation, coding assistants) don't directly reduce energy consumption elsewhere.

Comparison to other industries. The entire AI industry's energy consumption is currently a small fraction of other industries. Global data centers use about 1-2% of world electricity. Cryptocurrency mining alone uses more energy than most AI training. The question is whether AI's energy growth rate is sustainable, not whether its current level is problematic.

Water Consumption

Energy is not the only environmental cost. Li et al. (2023) estimated that training GPT-3 consumed approximately 700,000 liters of fresh water for cooling — enough to fill 370 standard bathtubs. And inference continues to consume water: Microsoft's water consumption increased by 34% year-over-year in 2022, driven largely by AI workloads. Data centers in water-stressed regions (Phoenix, parts of India) face growing tension between compute demand and local water supply.

The Rebound Effect

There's a subtle trap in efficiency improvements. Jevons paradox (also called the rebound effect) observes that when a technology becomes more efficient, total consumption often increases because the cheaper cost encourages more use. If inference becomes 4x more efficient, we don't use 4x less energy — we serve 10x more users. The per-query cost drops, but total energy consumption rises.

This has already played out with LLMs. GPT-3.5-Turbo was dramatically cheaper than GPT-3, which enabled ChatGPT to serve hundreds of millions of users. The per-query energy dropped, but total compute consumed by ChatGPT dwarfs what GPT-3 ever used. Efficiency gains are necessary but not sufficient for controlling environmental impact. Policy decisions about deployment scale matter too. Should every Google search trigger an LLM inference? Should every email have an AI-generated reply suggestion? Each product decision has an energy cost multiplied by billions of daily users.

The honest answer: the environmental cost of LLMs is real, measurable, and growing. But it's also a solvable engineering problem. The question is whether we solve it fast enough, and whether efficiency gains are captured as savings or reinvested as scale. History suggests the latter — but public pressure and regulation can change the calculus when the incentives are right.

Reporting and Transparency

A growing movement advocates for mandatory environmental reporting in AI research. Strubell et al. proposed that every ML paper should report its compute budget and estimated carbon emissions, just as clinical trials report adverse events. Some conferences (NeurIPS, EMNLP) now encourage or require compute reporting in paper submissions. The ML Emissions Calculator (mlco2.github.io) provides a tool for estimating carbon footprint from hardware, runtime, and electricity source.

Transparency matters because it enables informed decision-making. If researchers know that method A costs 10x more compute than method B for a 2% accuracy improvement, they can make a conscious tradeoff. Without reporting, the default is to use as much compute as available, with no consideration of environmental cost. The simple act of measuring and reporting creates accountability.

Some organizations have gone further. Hugging Face now displays estimated carbon emissions on model cards. Google's 2023 environmental report broke down AI-specific energy consumption for the first time. These are small steps, but they establish a norm of transparency that makes hiding environmental costs increasingly difficult. As the industry matures, environmental reporting will likely become as standard as reporting accuracy metrics — something that seems obvious in hindsight but required deliberate advocacy to establish.

The biggest environmental lever is inference, not training. A model trained once but deployed to billions of users consumes far more energy in inference than in training. Techniques like quantization (INT8/INT4), knowledge distillation, and speculative decoding reduce inference cost by 2-10x and are the most impactful environmental interventions.

Chapter 6: Governance & Policy

You ship an LLM product. A user uses it to generate targeted harassment. Another user tricks it into providing instructions for illegal activity. A third user's personal data appears in the model's outputs. Who is responsible? The model developer who trained the model? The company that deployed it? The user who misused it? The cloud provider that hosted the compute?

The answer depends on where you are, what laws apply, and how the AI governance framework assigns responsibility. And right now, most of those frameworks are still being written. The gap between what AI can do and what AI governance covers is growing, not shrinking. Every month brings new capabilities; legislation moves in years.

This chapter surveys the rapidly evolving governance landscape — what exists today, what's being proposed, and what fundamental questions remain unanswered.

The Governance Landscape

AI governance is evolving rapidly across multiple jurisdictions:

The EU AI Act (2024) is the most comprehensive AI regulation to date. It categorizes AI systems by risk level: unacceptable risk (social scoring, real-time biometric surveillance) is banned outright, high-risk systems (hiring, credit scoring, medical devices) require conformity assessments and auditing, and limited-risk systems (chatbots) require transparency labels. Foundation models (called "general-purpose AI" in the Act) face additional obligations including technical documentation, copyright compliance, and reporting of training data.

The US Executive Order on AI (2023) took a lighter-touch approach focused on reporting requirements. Companies training models above a compute threshold must report to the government, conduct red-teaming, and share safety test results. It established standards rather than bans.

China's Interim Measures (2023) require that generative AI services align with "core socialist values," undergo security assessments before launch, and use appropriately sourced training data. They represent the most direct content-control approach. Notably, China has moved faster than any other jurisdiction in implementing AI regulation — releasing binding rules within months of ChatGPT's launch, while the EU AI Act took years of negotiation.

Model Cards and Transparency

Mitchell et al. (2019) proposed model cards — standardized documentation that accompanies a trained model and discloses its intended use, training data, performance across demographic groups, and known limitations. Think of it like a nutrition label for AI. A model card for a sentiment classifier might report: "Accuracy is 92% overall, but drops to 84% on African American Vernacular English. Not recommended for content moderation without dialect-aware preprocessing."

Model cards are now standard practice at Google, Meta, and Hugging Face. The EU AI Act essentially mandates them for high-risk systems under the technical documentation requirement. But compliance varies wildly. Some model cards are detailed and honest; others are perfunctory checklists that fail to disclose meaningful limitations.

A good model card includes:

Section	What It Contains
Model details	Architecture, parameters, training data sources, dates, version
Intended use	Primary use cases, out-of-scope uses, known misuse vectors
Performance	Accuracy broken down by demographic group, language, and domain
Limitations	Known failure modes, biases, domains where the model should not be used
Ethical considerations	Training data bias analysis, environmental cost, privacy implications

Decision Points in AI Governance

The simulation below models a simplified AI incident response tree. An AI system causes harm — who investigates, what remedies apply, and how policy responds? Walk through the decision tree to see how different governance frameworks handle the same incident.

AI Incident Governance Decision Tree

An AI system causes harm. Click choices to walk through the governance process under different frameworks.

Open Questions

Several fundamental governance questions remain unresolved:

Liability. When an LLM gives harmful medical advice, is the developer liable (they built the model), the deployer (they made it available), or the user (they relied on it)? Current law is ambiguous. Section 230 in the US protects platforms from user-generated content liability, but LLM outputs are not user-generated — they're model-generated. This legal gray area is being tested in courts right now.

The liability question becomes even more complex with fine-tuned models. If Company A releases a base model and Company B fine-tunes it for medical advice, who is liable when a patient is harmed? Company A provided the capability. Company B provided the application. The patient had no relationship with either company's training process. Traditional product liability law wasn't designed for this kind of layered, fine-tuned software supply chain.

Open source vs. closed. Open-source models (LLaMA, Mistral) can be modified and deployed by anyone, making governance harder. You can fine-tune away safety guardrails in a single afternoon with a consumer GPU. This has already happened: within days of LLaMA's leak, uncensored fine-tunes appeared online with all safety training removed.

Proponents argue openness enables research, competition, and democratic access to AI. Closed models concentrate power in a few large companies. Open-source models let universities, startups, and developing countries participate in AI. Critics argue it enables harm at scale: once a model is released, you can't take it back. The EU AI Act treats open-source models with fewer obligations, acknowledging the tradeoff — but many experts argue this creates a governance loophole.

The debate mirrors the broader technology dual-use dilemma. Encryption technology, gene-editing tools, and open-source software have all faced similar questions. History suggests that the benefits of openness tend to outweigh the risks, but that conclusion is not guaranteed for AI, where the potential for automated harm at scale is qualitatively different from previous technologies. A malicious actor with an open-source LLM can generate millions of phishing emails, create targeted harassment at scale, or produce misinformation campaigns — all without needing any expertise beyond a GPU and a prompt.

International coordination. AI is global. A model trained in the US, deployed in Europe, and used in Asia falls under multiple jurisdictions with conflicting rules. There is no equivalent of the WTO for AI governance. The G7's Hiroshima AI Process and the UN's advisory body are early attempts at coordination, but binding international AI law is years away.

The coordination problem is compounded by competitive dynamics. If the EU imposes strict regulation and the US does not, companies might relocate development to the US. If both regulate but China does not, all three worry about falling behind. This "race to the bottom" dynamic — where each jurisdiction fears regulating more strictly than its competitors — is one of the most significant obstacles to effective AI governance. It's the same dynamic that hampers climate regulation: the costs of regulation are local, but the benefits are global.

Speed mismatch. AI capabilities advance in months. Legislation takes years. By the time a law addresses a specific AI risk, the technology has moved on. This is why many experts advocate for principles-based regulation (general rules that adapt to new technology) rather than rules-based regulation (specific prohibitions that become outdated).

Comparative Governance Approaches

Framework	Approach	Strength	Weakness
EU AI Act	Risk-based mandatory regulation	Clear obligations, strong enforcement	Slow to update, may stifle innovation
US EO 14110	Reporting + voluntary commitments	Flexible, industry-friendly	Weak enforcement, relies on goodwill
China Measures	Content control + security review	Fast enforcement	Limits open research, political bias
UK Pro-Innovation	Sector-specific, no central law	Adapts to each domain	Fragmented, gaps between sectors
Voluntary (NIST, ISO)	Standards and best practices	Expert-driven, technically detailed	Non-binding, uneven adoption

The hardest governance problem is the speed mismatch. AI capabilities advance in months; legislation takes years. Principles-based regulation ("AI systems must be transparent and non-discriminatory") adapts better than rules-based regulation ("chatbots must display a disclaimer within 3 seconds") because the principles survive technological change.

Why does the EU AI Act use a risk-based categorization system rather than banning all AI?

Because different AI applications pose different levels of risk, so regulation should be proportional — a chatbot needs less oversight than a medical diagnostic system Because the EU doesn't have the authority to ban technology Because risk-based regulation is cheaper to enforce

Chapter 7: Connections

Every technical choice you make as an ML engineer has social consequences. The mapping is not always obvious — a decision about model architecture can affect energy consumption, a choice of training data can encode bias, and a deployment decision can enable misinformation. Most engineers focus on the technical dimension: accuracy, latency, throughput. This chapter makes the social dimension explicit, because ignoring it doesn't make it go away — it just means you'll be surprised when the consequences arrive.

Technical Choices → Social Impacts

Technical Choice	Social Impact	Mechanism
Training data source	Bias, privacy	Internet text encodes cultural stereotypes and contains personal information
Model size	Environmental cost, memorization	Larger models consume more energy and memorize more training data verbatim
Fine-tuning with RLHF	Toxicity reduction, value alignment	Human feedback shapes model behavior but encodes annotators' values
Open-sourcing weights	Democratization vs. misuse	Anyone can use/improve the model, but also remove safety guardrails
Quantization / distillation	Reduced environmental cost	Smaller models use less energy per query, and inference dominates total cost
Safety filtering	Content moderation tradeoffs	Filters reduce harmful outputs but can disproportionately affect certain dialects
Watermarking	Misinformation detection	Enables identification of machine-generated text, but requires universal adoption
Embedding design	Representational bias	Geometric relationships in embedding space encode and propagate stereotypes
Context window size	Privacy risk	Longer contexts allow more personal data to be processed in a single query
Deployment region	Regulatory compliance, energy mix	Different jurisdictions have different laws; different grids have different carbon intensity

Connecting to Other Lectures

These impacts don't exist in isolation. They connect directly to the technical content we've covered:

L08: Post-training

RLHF and Constitutional AI are the primary technical tools for reducing toxicity and aligning model behavior with human values. The quality of this process determines whether the model amplifies or mitigates the biases in its pretraining data.

↓

L14: Reasoning & Agents

Agent systems that take autonomous actions amplify all social risks: a biased agent makes biased decisions automatically, a hallucinating agent spreads misinformation at scale, and an agent with access to personal data can violate privacy without human oversight.

The Dual-Use Problem

Almost every NLP capability has both beneficial and harmful applications. Summarization helps busy professionals but can distort nuance. Translation breaks language barriers but can propagate errors in legal or medical contexts. Code generation accelerates development but can produce insecure code at scale. Sentiment analysis helps product teams but enables mass surveillance of public opinion.

This is the dual-use problem: the same technology serves both good and harmful purposes, and the technology itself doesn't distinguish between them. A model that generates persuasive text is equally useful for writing marketing copy and propaganda. A model that summarizes documents is equally useful for summarizing medical records and manufacturing fake evidence.

There is no purely technical solution to dual-use. It requires governance, deployment decisions, and ongoing monitoring — which brings us back to the policy frameworks in Chapter 6. The technical and the political are inseparable.

The Responsibility Gradient

Not every ML practitioner has the same leverage. A researcher choosing training data has more influence over bias than an engineer optimizing inference speed. A company executive deciding deployment scope has more influence over environmental cost than a junior developer writing unit tests. But everyone in the pipeline has some influence, and the cumulative effect of many small decisions is large.

The technical and the social are not separate domains. Every architectural decision — from embedding dimension to training data curation to deployment infrastructure — has downstream consequences for real people. The best ML engineers understand both the math and the impact.

What You Can Do

Role	High-Impact Actions
Researcher	Report environmental cost in papers. Audit training data for bias. Publish failure modes, not just successes.
Engineer	Implement fairness metrics alongside accuracy. Use quantized models when possible. Add output filtering layers.
Product Manager	Define deployment boundaries. Require bias audits before launch. Design user consent flows for data use.
Executive	Invest in safety teams proportional to capability teams. Choose renewable-powered data centers. Support regulation.

A Checklist for Responsible Deployment

Before deploying an NLP system, every team should answer these questions:

Data Audit

What's in the training data? Who created it? What biases might it contain? Were creators compensated or notified?

↓

Fairness Evaluation

Does the model perform equally across demographic groups? Have you tested with WinoBias, BBQ, or TruthfulQA?

↓

Safety Testing

Has the model been red-teamed? Are there output filters? Is there a jailbreak monitoring system?

↓

Environmental Accounting

What's the estimated carbon cost? Is inference-optimized (quantized, distilled)? What grid powers the deployment?

↓

Governance

Is there a model card? An incident response plan? Compliance with applicable regulation (EU AI Act, etc.)?

Ethics is not a separate module bolted onto engineering — it is engineering. The choice of training data is a values decision. The choice of model size is an environmental decision. The choice of deployment scope is a governance decision. Understanding these connections is what separates a technician from an engineer.

"We shape our tools, and thereafter our tools shape us." — Marshall McLuhan

The language models we build today will shape how billions of people think, write, and make decisions. Understanding their social impacts is not optional — it is the most important competency for any ML engineer deploying systems at scale.