第三次课
自然语言处理(NLP)技术讲义
课程时长:3 小时 | 面向对象:计算机/AI 方向学生
课程目标:理解 NLP 是什么 → 掌握核心基础技术 → 了解前沿应用
目录
- 什么是 NLP?
- 文本预处理基础
- 2.1 分词(Tokenization)
- 2.2 词性标注(POS Tagging)
- 2.3 命名实体识别(NER)
- 2.4 句法分析(Syntactic Parsing)
- 文本表示
- 3.1 词袋模型(Bag of Words)
- 3.2 TF-IDF
- 3.3 词向量(Word2Vec / GloVe)
- 经典 NLP 任务
- 4.1 情感分析
- 4.2 文本分类
- 4.3 机器翻译
- 深度学习与 NLP
- 5.1 RNN / LSTM
- 5.2 Attention 机制
- 5.3 Transformer
- 预训练语言模型(PLM)
- 6.1 BERT
- 6.2 GPT 系列
- 前沿应用
- 7.1 大语言模型(LLM)与 RAG
- 7.2 多模态 NLP
- 7.3 智能 Agent
- 总结与展望
1. 什么是 NLP?
自然语言处理(Natural Language Processing,NLP) 是人工智能与语言学交叉的学科,目标是让计算机能够理解、分析、生成人类语言。
NLP 的核心挑战
- 歧义性:同一句话在不同语境下含义不同("苹果发布新品" — 是苹果公司还是水果?)
- 多样性:方言、俚语、缩写无处不在
- 长距离依赖:句子中相距很远的词可能存在语法关联
- 世界知识:理解语言有时需要大量的常识和背景知识
NLP 技术发展脉络
规则系统(1950s-1980s)
↓
统计机器学习(1990s-2010s):HMM、CRF、SVM
↓
深度学习(2013-2017):Word2Vec、CNN、RNN/LSTM
↓
预训练模型(2018-2020):BERT、GPT-2
↓
大语言模型时代(2020-至今):GPT-4、LLaMA、Claude
2. 文本预处理基础
⏱️ 预计时间:40 分钟
2.1 分词(Tokenization)
分词是将连续文本切分成有意义的最小单元(Token)的过程。
- 英文:以空格和标点为自然边界,相对简单
- 中文:没有天然分隔符,是 NLP 的核心挑战之一
常见分词方法
| 方法 | 说明 | 工具 |
|---|---|---|
| 基于词典(正向最大匹配) | 从左到右查词典,取最长匹配 | jieba |
| 统计模型(HMM/CRF) | 结合上下文概率,更准确 | jieba、pkuseg |
| 子词分词(BPE/WordPiece) | 将词拆分为更小单元,解决 OOV 问题 | HuggingFace tokenizers |
Python 示例:中文分词
# 安装:pip install jieba
import jieba
import jieba.posseg as pseg
# 基础分词
text = "自然语言处理是人工智能的重要分支,让计算机能理解人类语言。"
# 精确模式(默认)
seg_result = jieba.cut(text, cut_all=False)
print("精确模式:", "/ ".join(seg_result))
# 全模式(所有可能的分词)
seg_all = jieba.cut(text, cut_all=True)
print("全模式: ", "/ ".join(seg_all))
# 搜索引擎模式(在精确模式基础上对长词再次切分)
seg_search = jieba.cut_for_search(text)
print("搜索模式:", "/ ".join(seg_search))
# 添加自定义词典
jieba.add_word("自然语言处理", freq=1000, tag='n')
print("加入自定义词后:", "/ ".join(jieba.cut(text)))
# 子词分词:BPE(Byte-Pair Encoding)示例
# 安装:pip install transformers
from transformers import BertTokenizer, AutoTokenizer
# BERT 的 WordPiece 分词器
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")
text = "我爱自然语言处理"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)
print("WordPiece Tokens:", tokens)
print("Token IDs: ", token_ids)
# GPT 风格的 BPE 分词
gpt_tokenizer = AutoTokenizer.from_pretrained("gpt2")
en_text = "Natural language processing is fascinating!"
gpt_tokens = gpt_tokenizer.tokenize(en_text)
print("\nGPT BPE Tokens:", gpt_tokens)
2.2 词性标注(POS Tagging)
词性标注(Part-of-Speech Tagging)为每个词分配语法类别,如名词(n)、动词(v)、形容词(a)等。
常见标注集
- Penn Treebank(英文):NN(名词)、VB(动词)、JJ(形容词)...
- 北大标准(中文):n(名词)、v(动词)、a(形容词)、r(代词)...
Python 示例:词性标注
# 中文词性标注(jieba)
import jieba.posseg as pseg
text = "今天天气很好,我们去公园散步吧。"
words = pseg.cut(text)
print("词语\t\t词性\t\t含义")
print("-" * 40)
pos_map = {
'n': '名词', 'v': '动词', 'a': '形容词',
'r': '代词', 'd': '副词', 'p': '介词',
'm': '数词', 'q': '量词', 'c': '连词',
'x': '非语素字', 'w': '标点'
}
for word, flag in words:
print(f"{word}\t\t{flag}\t\t{pos_map.get(flag, flag)}")
# 英文词性标注(NLTK + spaCy)
# 安装:pip install nltk spacy
# python -m spacy download en_core_web_sm
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
import spacy
# NLTK 方式
from nltk import pos_tag, word_tokenize
sentence = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(sentence)
tagged = pos_tag(tokens)
print("NLTK POS Tags:")
for word, tag in tagged:
print(f" {word:15} → {tag}")
# spaCy 方式(更强大,包含依存关系)
print("\nspaCy POS Tags:")
nlp = spacy.load("en_core_web_sm")
doc = nlp(sentence)
for token in doc:
print(f" {token.text:15} → {token.pos_:6} ({token.tag_})")
2.3 命名实体识别(NER)
命名实体识别(Named Entity Recognition)识别文本中的专有名词,如人名、地名、机构名、时间、数字等。
实体类别
| 类别 | 标签 | 示例 |
|---|---|---|
| 人名 | PER | 马云、乔布斯 |
| 地名 | LOC | 北京、纽约 |
| 机构名 | ORG | 阿里巴巴、MIT |
| 时间 | TIME | 2024年3月 |
| 数字 | NUM | 500亿美元 |
标注格式:BIO 标注
B-PER = 人名实体的开始(Beginning)
I-PER = 人名实体的内部(Inside)
O = 非实体(Outside)
"马 云 创 立 了 阿 里 巴 巴"
B-PER I-PER O O O B-ORG I-ORG I-ORG I-ORG
Python 示例:命名实体识别
# 英文 NER(spaCy)
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Apple Inc. was founded by Steve Jobs in Cupertino, California in 1976."
doc = nlp(text)
print("命名实体识别结果:")
print("-" * 50)
for ent in doc.ents:
print(f" 实体: {ent.text:20} 类型: {ent.label_:10} 解释: {spacy.explain(ent.label_)}")
# 可视化(在 Jupyter 中运行)
# from spacy import displacy
# displacy.render(doc, style="ent")
# 中文 NER(使用 HuggingFace 的中文预训练模型)
# 安装:pip install transformers torch
from transformers import pipeline
# 使用中文 NER 模型
ner_pipeline = pipeline(
"token-classification",
model="hfl/chinese-roberta-wwm-ext", # 可换成专门的 NER 模型
aggregation_strategy="simple"
)
# 使用更专业的中文 NER 模型示例
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "dslim/bert-base-NER" # 英文示范
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
text = "My name is Wolfgang and I live in Berlin"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# 获取预测标签
predictions = torch.argmax(outputs.logits, dim=2)
predicted_labels = [model.config.id2label[p.item()] for p in predictions[0]]
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print("Token\t\t\tLabel")
print("-" * 40)
for token, label in zip(tokens, predicted_labels):
if not token.startswith("##"):
print(f"{token:20}\t{label}")
2.4 句法分析(Syntactic Parsing)
句法分析揭示句子的语法结构,分为两大类:
- 成分句法分析(Constituency Parsing):将句子解析为层级的短语结构树
- 依存句法分析(Dependency Parsing):分析词与词之间的依存关系
依存关系示意
"小明 爱 读书"
↓
爱 ←── nsubj ── 小明
爱 ──── dobj ──→ 读书
Python 示例:句法分析
# 依存句法分析(spaCy)
import spacy
nlp = spacy.load("en_core_web_sm")
sentence = "The cat sat on the mat and watched the birds."
doc = nlp(sentence)
print("依存句法分析结果:")
print(f"{'词语':12} {'词性':8} {'中心词':12} {'依存关系':12}")
print("-" * 55)
for token in doc:
print(f"{token.text:12} {token.pos_:8} {token.head.text:12} {token.dep_:12}")
# 找出句子的主语和谓语
print("\n句子核心结构:")
for token in doc:
if token.dep_ == "ROOT":
print(f" 谓语(ROOT): {token.text}")
elif token.dep_ == "nsubj":
print(f" 主语(nsubj): {token.text}")
elif token.dep_ == "dobj":
print(f" 宾语(dobj): {token.text}")
# 成分句法分析(使用 NLTK + Stanford Parser 或 benepar)
# 安装:pip install benepar
import benepar
import nltk
# 下载模型(首次运行)
# benepar.download('benepar_en3')
nlp = spacy.load("en_core_web_sm")
if spacy.__version__.startswith('3'):
nlp.add_pipe("benepar", config={"model": "benepar_en3"})
sentence = "The quick brown fox jumps over the lazy dog."
doc = nlp(sentence)
for sent in doc.sents:
print("成分句法树:")
print(sent._.parse_string)
# 输出示例:(S (NP (DT The) (JJ quick) (JJ brown) (NN fox))
# (VP (VBZ jumps) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog)))))
3. 文本表示
⏱️ 预计时间:35 分钟
将文本转换为计算机可处理的数值表示,是所有 NLP 任务的基础。
3.1 词袋模型(Bag of Words,BoW)
词袋模型将文本表示为词频向量,忽略词序和语法结构。
文档1:"我 爱 北京 天安门" → [1, 1, 1, 1, 0, 0]
文档2:"天安门 上 太阳 升" → [0, 0, 1, 0, 1, 1]
词典: [我, 爱, 天安门, 北京, 上, 太阳]
Python 示例:词袋模型
from sklearn.feature_extraction.text import CountVectorizer
# 示例文档集(英文)
corpus = [
"I love natural language processing",
"Natural language processing is amazing",
"Machine learning is part of AI",
"Deep learning improves NLP tasks",
"I love machine learning and AI"
]
# 构建词袋模型
vectorizer = CountVectorizer(
max_features=20, # 最多保留 20 个词
stop_words='english' # 去除停用词
)
X = vectorizer.fit_transform(corpus)
# 查看词典
print("词典(前 15 个词):")
vocab = vectorizer.get_feature_names_out()
print(list(vocab))
# 查看文档向量
import pandas as pd
df = pd.DataFrame(X.toarray(), columns=vocab)
print("\n文档-词矩阵:")
print(df)
# 问题:向量稀疏,高维,且忽略词序
print(f"\n向量维度: {X.shape}")
print(f"稀疏度: {1 - X.nnz / (X.shape[0] * X.shape[1]):.2%}")
3.2 TF-IDF
TF-IDF(词频-逆文档频率)在词频基础上,对在很多文档中都出现的常见词降权。
TF(t, d) = 词 t 在文档 d 中出现的频率
IDF(t) = log(总文档数 / 包含词 t 的文档数)
TF-IDF = TF × IDF
Python 示例:TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
corpus = [
"I love natural language processing and NLP",
"Natural language processing is amazing and powerful",
"Machine learning is a core part of AI research",
"Deep learning greatly improves NLP performance",
"I love machine learning, deep learning, and AI"
]
# TF-IDF 向量化
tfidf = TfidfVectorizer(max_features=15, stop_words='english')
X = tfidf.fit_transform(corpus)
feature_names = tfidf.get_feature_names_out()
# 可视化每篇文档的重要词
print("每篇文档中 TF-IDF 最高的词:")
print("=" * 60)
for i, doc in enumerate(corpus):
print(f"\n文档 {i+1}: {doc[:40]}...")
scores = X[i].toarray()[0]
top_indices = np.argsort(scores)[::-1][:3]
for idx in top_indices:
if scores[idx] > 0:
print(f" {feature_names[idx]:20} → TF-IDF: {scores[idx]:.4f}")
# TF-IDF 用于文档相似度计算
from sklearn.metrics.pairwise import cosine_similarity
sim_matrix = cosine_similarity(X)
print("\n\n文档相似度矩阵:")
import pandas as pd
labels = [f"doc{i+1}" for i in range(len(corpus))]
sim_df = pd.DataFrame(sim_matrix, index=labels, columns=labels)
print(sim_df.round(3))
3.3 词向量(Word Embeddings)
词向量将词语映射到低维连续向量空间,语义相似的词在向量空间中距离更近。
Word2Vec 的核心思想
"你能通过一个词的近邻来了解它" —— J.R. Firth (1957)
两种训练模式:
- CBOW(Continuous Bag of Words):用上下文预测目标词
- Skip-gram:用目标词预测上下文
Python 示例:Word2Vec
# 安装:pip install gensim
from gensim.models import Word2Vec
import numpy as np
# 训练数据(实际场景需要大量语料)
sentences = [
["king", "man", "woman", "queen", "royal", "crown"],
["doctor", "hospital", "nurse", "medicine", "patient"],
["computer", "software", "hardware", "program", "code"],
["cat", "dog", "animal", "pet", "bird"],
["paris", "france", "london", "england", "berlin", "germany"],
["good", "great", "excellent", "wonderful", "amazing"],
["bad", "terrible", "awful", "horrible", "poor"],
["king", "queen", "prince", "princess", "royal"],
["man", "woman", "boy", "girl", "person"],
]
# 训练 Word2Vec 模型
model = Word2Vec(
sentences,
vector_size=50, # 词向量维度
window=3, # 上下文窗口大小
min_count=1, # 最少出现次数
workers=4, # 并行线程
epochs=100, # 训练轮数
sg=1 # 1=Skip-gram, 0=CBOW
)
# 查看词向量
print("'king' 的词向量(前 10 维):")
print(model.wv['king'][:10])
# 最相似的词
print("\n与 'king' 最相似的词:")
similar = model.wv.most_similar('king', topn=5)
for word, score in similar:
print(f" {word:15} → 相似度: {score:.4f}")
# 词向量算术:king - man + woman ≈ queen
print("\n词向量算术:king - man + woman = ?")
result = model.wv.most_similar(
positive=['king', 'woman'],
negative=['man'],
topn=3
)
for word, score in result:
print(f" {word:15} → 相似度: {score:.4f}")
# 使用预训练的大规模词向量
# import gensim.downloader as api
# glove_model = api.load("glove-wiki-gigaword-100") # 下载 GloVe
# print(glove_model.most_similar('artificial', topn=5))
# 词向量可视化(t-SNE 降维)
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['font.sans-serif'] = ['SimHei'] # 支持中文
words_to_plot = ['king', 'queen', 'man', 'woman', 'doctor', 'nurse',
'cat', 'dog', 'paris', 'london', 'good', 'bad']
# 只保留模型中存在的词
words_to_plot = [w for w in words_to_plot if w in model.wv]
vectors = np.array([model.wv[w] for w in words_to_plot])
# t-SNE 降维到 2D
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
coords = tsne.fit_transform(vectors)
# 绘图
plt.figure(figsize=(10, 7))
plt.scatter(coords[:, 0], coords[:, 1], s=100, alpha=0.7, c='steelblue')
for i, word in enumerate(words_to_plot):
plt.annotate(word, (coords[i, 0], coords[i, 1]),
fontsize=12, ha='right', va='bottom')
plt.title("词向量 t-SNE 可视化", fontsize=14)
plt.xlabel("t-SNE 维度 1")
plt.ylabel("t-SNE 维度 2")
plt.tight_layout()
plt.savefig("word_embeddings_tsne.png", dpi=150)
plt.show()
4. 经典 NLP 任务
⏱️ 预计时间:30 分钟
4.1 情感分析(Sentiment Analysis)
判断文本所表达的情感倾向(正面/负面/中性),是最常见的 NLP 应用之一。
Python 示例:情感分析
# 方法一:基于词典的情感分析
# 安装:pip install textblob
from textblob import TextBlob
reviews = [
"This movie is absolutely amazing! I loved every minute of it.",
"Terrible film, waste of time and money. Boring and predictable.",
"The food was okay, nothing special but not bad either.",
"Outstanding performance! Best concert I've ever attended.",
"Disappointing experience. The service was rude and slow."
]
print("词典方法情感分析:")
print("-" * 65)
for review in reviews:
blob = TextBlob(review)
polarity = blob.sentiment.polarity # -1(负) 到 1(正)
subjectivity = blob.sentiment.subjectivity # 0(客观) 到 1(主观)
sentiment = "正面 😊" if polarity > 0.1 else "负面 😞" if polarity < -0.1 else "中性 😐"
print(f"文本: {review[:45]}...")
print(f" 情感: {sentiment} 极性: {polarity:+.3f} 主观性: {subjectivity:.3f}")
print()
# 方法二:基于 BERT 的情感分析
from transformers import pipeline
# 使用预训练的情感分析模型
sentiment_analyzer = pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english"
)
reviews = [
"I absolutely love this product! It exceeded all my expectations.",
"This is the worst purchase I have ever made. Complete waste of money.",
"It's decent, does what it's supposed to do.",
]
print("BERT 情感分析:")
print("-" * 65)
results = sentiment_analyzer(reviews)
for review, result in zip(reviews, results):
emoji = "😊" if result['label'] == 'POSITIVE' else "😞"
print(f"文本: {review[:50]}...")
print(f" 预测: {result['label']} {emoji} 置信度: {result['score']:.4f}")
print()
# 方法三:从头训练情感分类器
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
import numpy as np
# 模拟训练数据(实际场景使用真实数据集,如 IMDB、SST-2)
positive_texts = [
"amazing wonderful great excellent superb",
"love enjoy happy delightful fantastic",
"outstanding brilliant magnificent impressive",
"beautiful perfect awesome incredible",
"best ever wonderful experience",
]
negative_texts = [
"terrible horrible awful disgusting worst",
"hate dislike disappointing poor bad",
"boring dull tedious frustrating annoying",
"ugly broken failed useless waste",
"nightmare disaster catastrophic failure",
]
texts = positive_texts + negative_texts
labels = [1] * len(positive_texts) + [0] * len(negative_texts)
# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(
texts, labels, test_size=0.3, random_state=42
)
# TF-IDF + 逻辑回归
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
clf = LogisticRegression()
clf.fit(X_train_vec, y_train)
y_pred = clf.predict(X_test_vec)
print("分类报告:")
print(classification_report(y_test, y_pred, target_names=['负面', '正面']))
4.2 文本分类(Text Classification)
将文本自动归类到预定义类别,如新闻分类、垃圾邮件检测、主题分类等。
Python 示例:新闻分类
from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report
# 加载 20 新闻组数据集(选 4 个类别)
categories = ['sci.space', 'rec.sport.baseball', 'comp.graphics', 'talk.politics.misc']
train_data = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))
test_data = fetch_20newsgroups(subset='test', categories=categories, remove=('headers', 'footers', 'quotes'))
print(f"训练集大小: {len(train_data.data)}")
print(f"测试集大小: {len(test_data.data)}")
print(f"类别: {train_data.target_names}")
# 方法一:朴素贝叶斯分类器
nb_pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=10000, ngram_range=(1, 2))),
('clf', MultinomialNB(alpha=0.1)),
])
nb_pipeline.fit(train_data.data, train_data.target)
nb_pred = nb_pipeline.predict(test_data.data)
print(f"\n朴素贝叶斯准确率: {accuracy_score(test_data.target, nb_pred):.4f}")
# 方法二:SVM 分类器(效果通常更好)
svm_pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=10000, ngram_range=(1, 2))),
('clf', LinearSVC(C=1.0, max_iter=1000)),
])
svm_pipeline.fit(train_data.data, train_data.target)
svm_pred = svm_pipeline.predict(test_data.data)
print(f"SVM 准确率: {accuracy_score(test_data.target, svm_pred):.4f}")
# 详细分类报告
print("\nSVM 详细分类报告:")
print(classification_report(test_data.target, svm_pred, target_names=categories))
# 预测新文本
new_texts = [
"NASA launched a new rocket to explore Mars surface",
"The pitcher threw a perfect game in last night's baseball match",
]
preds = svm_pipeline.predict(new_texts)
for text, pred in zip(new_texts, preds):
print(f"\n文本: {text}")
print(f"预测类别: {categories[pred]}")
4.3 机器翻译(Machine Translation)
发展历程
规则翻译(1950s)
↓
统计机器翻译 SMT(1990s):IBM模型、短语翻译
↓
神经机器翻译 NMT(2014):Seq2Seq + Attention
↓
Transformer 翻译(2017):Google Translate、DeepL
↓
大语言模型(2023+):多语言、零样本翻译
Python 示例:神经机器翻译
# 使用 HuggingFace 预训练翻译模型
from transformers import MarianMTModel, MarianTokenizer
def translate(text, src_lang="en", tgt_lang="zh"):
"""使用 Helsinki-NLP 的翻译模型"""
model_name = f"Helsinki-NLP/opus-mt-{src_lang}-{tgt_lang}"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
# 编码输入
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
# 生成翻译(Beam Search)
outputs = model.generate(
**inputs,
num_beams=4, # Beam Search 宽度
max_length=512,
early_stopping=True
)
# 解码输出
translated = tokenizer.decode(outputs[0], skip_special_tokens=True)
return translated
# 示例:英译中
sentences_en = [
"Artificial intelligence is transforming the world.",
"Natural language processing enables computers to understand human language.",
"The future of AI looks incredibly promising.",
]
print("英语 → 中文翻译示例:")
print("-" * 60)
for sent in sentences_en:
# 实际使用时取消注释:
# translated = translate(sent, "en", "zh")
print(f"原文: {sent}")
print(f"译文: [需要下载模型后运行]")
print()
5. 深度学习与 NLP
⏱️ 预计时间:35 分钟
5.1 RNN 与 LSTM
循环神经网络(RNN)通过隐藏状态传递时序信息,天然适合处理序列数据。但普通 RNN 存在梯度消失问题,LSTM 通过门控机制解决了长距离依赖问题。
LSTM 门控机制
遗忘门 f_t = σ(W_f · [h_{t-1}, x_t] + b_f) ← 决定丢弃什么
输入门 i_t = σ(W_i · [h_{t-1}, x_t] + b_i) ← 决定存储什么
输出门 o_t = σ(W_o · [h_{t-1}, x_t] + b_o) ← 决定输出什么
细胞状态 C_t = f_t ⊙ C_{t-1} + i_t ⊙ tanh(...)
隐藏状态 h_t = o_t ⊙ tanh(C_t)
Python 示例:LSTM 文本分类
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
# 构建简单 LSTM 情感分类模型
class LSTMClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, num_layers=2, dropout=0.3):
super(LSTMClassifier, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(
embed_dim,
hidden_dim,
num_layers=num_layers,
batch_first=True,
dropout=dropout,
bidirectional=True # 双向 LSTM
)
self.dropout = nn.Dropout(dropout)
# 双向 LSTM 输出维度 × 2
self.fc = nn.Linear(hidden_dim * 2, num_classes)
def forward(self, x):
# x shape: (batch, seq_len)
embedded = self.dropout(self.embedding(x))
# LSTM 输出
output, (hidden, cell) = self.lstm(embedded)
# 取最后一层前向和后向的隐藏状态
hidden = torch.cat([hidden[-2], hidden[-1]], dim=1)
hidden = self.dropout(hidden)
out = self.fc(hidden)
return out
# 模型超参数
VOCAB_SIZE = 10000
EMBED_DIM = 128
HIDDEN_DIM = 256
NUM_CLASSES = 2 # 正面 / 负面
BATCH_SIZE = 32
EPOCHS = 5
model = LSTMClassifier(VOCAB_SIZE, EMBED_DIM, HIDDEN_DIM, NUM_CLASSES)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
print("LSTM 模型结构:")
print(model)
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\n可训练参数量: {total_params:,}")
# 模拟训练过程(实际需要真实数据)
def train_epoch(model, dataloader, optimizer, criterion):
model.train()
total_loss, correct = 0, 0
for batch_x, batch_y in dataloader:
optimizer.zero_grad()
output = model(batch_x)
loss = criterion(output, batch_y)
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), 1.0) # 梯度裁剪
optimizer.step()
total_loss += loss.item()
correct += (output.argmax(1) == batch_y).sum().item()
return total_loss / len(dataloader), correct / len(dataloader.dataset)
print("\n✅ LSTM 模型定义完成,可在真实数据集(如 IMDB)上训练")
5.2 Attention 机制
Attention 机制让模型在处理每个词时,能够动态关注输入序列中最相关的部分,解决了 Seq2Seq 中信息瓶颈问题。
Attention 计算公式
Attention(Q, K, V) = softmax(QK^T / √d_k) · V
Q = Query(查询向量)
K = Key (键向量)
V = Value(值向量)
Python 示例:Scaled Dot-Product Attention
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class ScaledDotProductAttention(nn.Module):
"""缩放点积注意力"""
def __init__(self, dropout=0.1):
super().__init__()
self.dropout = nn.Dropout(dropout)
def forward(self, Q, K, V, mask=None):
d_k = Q.size(-1)
# 注意力分数
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
# 掩码(用于 Decoder 防止看到未来信息)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attn_weights = F.softmax(scores, dim=-1)
attn_weights = self.dropout(attn_weights)
output = torch.matmul(attn_weights, V)
return output, attn_weights
class MultiHeadAttention(nn.Module):
"""多头注意力"""
def __init__(self, d_model, num_heads, dropout=0.1):
super().__init__()
assert d_model % num_heads == 0
self.d_k = d_model // num_heads
self.num_heads = num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
self.attention = ScaledDotProductAttention(dropout)
def split_heads(self, x):
batch, seq_len, d_model = x.size()
x = x.view(batch, seq_len, self.num_heads, self.d_k)
return x.transpose(1, 2) # (batch, heads, seq_len, d_k)
def forward(self, Q, K, V, mask=None):
Q = self.split_heads(self.W_q(Q))
K = self.split_heads(self.W_k(K))
V = self.split_heads(self.W_v(V))
x, attn = self.attention(Q, K, V, mask)
# 合并多头
batch, heads, seq_len, d_k = x.size()
x = x.transpose(1, 2).contiguous().view(batch, seq_len, -1)
return self.W_o(x), attn
# 测试
d_model, num_heads, seq_len, batch = 512, 8, 10, 2
mha = MultiHeadAttention(d_model, num_heads)
Q = torch.randn(batch, seq_len, d_model)
K = torch.randn(batch, seq_len, d_model)
V = torch.randn(batch, seq_len, d_model)
output, attn_weights = mha(Q, K, V)
print(f"多头注意力输出形状: {output.shape}")
print(f"注意力权重形状: {attn_weights.shape}")
5.3 Transformer
Transformer 完全基于 Attention 机制,抛弃了 RNN 的顺序计算,实现了高效的并行训练,成为现代 NLP 的基础架构。
Transformer 架构
Input → Embedding + Positional Encoding
↓
┌────────────────────────────┐
│ Encoder (×N) │
│ Multi-Head Self-Attention │
│ Add & Norm │
│ Feed-Forward │
│ Add & Norm │
└────────────────────────────┘
↓
┌────────────────────────────┐
│ Decoder (×N) │
│ Masked Multi-Head Attn │
│ Add & Norm │
│ Cross-Attention │
│ Add & Norm │
│ Feed-Forward │
│ Add & Norm │
└────────────────────────────┘
↓
Linear + Softmax → Output
Python 示例:Transformer 核心组件
import torch
import torch.nn as nn
import math
class PositionalEncoding(nn.Module):
"""位置编码:为序列中每个位置添加位置信息"""
def __init__(self, d_model, max_len=5000, dropout=0.1):
super().__init__()
self.dropout = nn.Dropout(dropout)
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1).float()
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term) # 偶数维:sin
pe[:, 1::2] = torch.cos(position * div_term) # 奇数维:cos
pe = pe.unsqueeze(0) # (1, max_len, d_model)
self.register_buffer('pe', pe)
def forward(self, x):
x = x + self.pe[:, :x.size(1)]
return self.dropout(x)
class TransformerBlock(nn.Module):
"""单个 Transformer 编码器层"""
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.attention = MultiHeadAttention(d_model, num_heads, dropout)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.feed_forward = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model),
)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Self-Attention + 残差连接 + Layer Norm
attn_out, _ = self.attention(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_out))
# Feed-Forward + 残差连接 + Layer Norm
ff_out = self.feed_forward(x)
x = self.norm2(x + self.dropout(ff_out))
return x
class TransformerClassifier(nn.Module):
"""用于文本分类的 Transformer"""
def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff, num_classes, max_len=512):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model, padding_idx=0)
self.pos_encoding = PositionalEncoding(d_model, max_len)
self.layers = nn.ModuleList([
TransformerBlock(d_model, num_heads, d_ff) for _ in range(num_layers)
])
self.pool = lambda x: x.mean(dim=1) # 全局平均池化
self.fc = nn.Linear(d_model, num_classes)
def forward(self, x, mask=None):
x = self.pos_encoding(self.embedding(x))
for layer in self.layers:
x = layer(x, mask)
x = self.pool(x)
return self.fc(x)
# 构建模型
model = TransformerClassifier(
vocab_size=30000,
d_model=256,
num_heads=8,
num_layers=4,
d_ff=1024,
num_classes=2
)
print("Transformer 分类模型:")
print(model)
total_params = sum(p.numel() for p in model.parameters())
print(f"\n总参数量: {total_params:,} ({total_params/1e6:.2f}M)")
6. 预训练语言模型
⏱️ 预计时间:25 分钟
6.1 BERT
BERT(Bidirectional Encoder Representations from Transformers,2018,Google)通过双向Transformer 在大规模语料上预训练,再针对下游任务微调。
BERT 预训练任务
- 掩码语言模型(MLM):随机遮蔽 15% 的词,让模型预测被遮蔽的词
输入: "我 [MASK] 自然语言处理" 目标: 预测 [MASK] = "爱" - 下一句预测(NSP):判断两个句子是否相邻
Python 示例:BERT 微调文本分类
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from transformers import get_linear_schedule_with_warmup
import torch
from torch.utils.data import DataLoader, TensorDataset
# 加载预训练的 BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=2, # 二分类
hidden_dropout_prob=0.1
)
# 准备输入数据
def encode_texts(texts, labels, max_length=128):
encodings = tokenizer(
texts,
max_length=max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)
return TensorDataset(
encodings['input_ids'],
encodings['attention_mask'],
torch.tensor(labels)
)
# 示例训练数据
train_texts = [
"This product is absolutely fantastic!",
"I hate this, it broke after one day.",
"Great quality and fast shipping.",
"Terrible customer service, avoid!",
]
train_labels = [1, 0, 1, 0]
dataset = encode_texts(train_texts, train_labels)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
# 微调配置
optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=0,
num_training_steps=len(dataloader) * 3
)
# 单个训练步骤演示
model.train()
batch = next(iter(dataloader))
input_ids, attention_mask, labels = batch
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels
)
loss = outputs.loss
logits = outputs.logits
print(f"训练 Loss: {loss.item():.4f}")
print(f"Logits 形状: {logits.shape}")
print(f"预测类别: {logits.argmax(dim=1).tolist()}")
loss.backward()
optimizer.step()
scheduler.step()
optimizer.zero_grad()
print("\n✅ BERT 微调演示完成!")
print("实际使用:在 IMDB/SST-2 等数据集上,BERT 微调准确率可达 93%+")
6.2 GPT 系列
GPT(Generative Pre-trained Transformer,OpenAI)是自回归语言模型,从左到右生成文本。
| 模型 | 参数量 | 特点 |
|---|---|---|
| GPT-1 (2018) | 117M | 首次证明预训练+微调范式有效 |
| GPT-2 (2019) | 1.5B | 强大的零样本生成能力 |
| GPT-3 (2020) | 175B | Few-shot 学习,In-Context Learning |
| GPT-4 (2023) | ~1.8T(估计) | 多模态,推理能力大幅提升 |
Python 示例:GPT-2 文本生成
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
# 加载 GPT-2
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()
def generate_text(prompt, max_new_tokens=100, num_sequences=2,
temperature=0.8, top_p=0.9, top_k=50):
"""使用 GPT-2 生成文本"""
inputs = tokenizer.encode(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
inputs,
max_new_tokens=max_new_tokens,
num_return_sequences=num_sequences,
temperature=temperature, # 控制随机性(越低越保守)
top_p=top_p, # Nucleus Sampling
top_k=top_k, # Top-K Sampling
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
repetition_penalty=1.2 # 抑制重复
)
results = []
for output in outputs:
text = tokenizer.decode(output, skip_special_tokens=True)
results.append(text)
return results
# 文本生成示例
prompts = [
"Artificial intelligence will",
"The future of natural language processing is",
]
for prompt in prompts:
print(f"提示词: {prompt}")
print("-" * 60)
texts = generate_text(prompt, max_new_tokens=60, num_sequences=1)
for text in texts:
print(text)
print()
7. 前沿应用
⏱️ 预计时间:25 分钟
7.1 大语言模型(LLM)与 RAG
检索增强生成(RAG,Retrieval-Augmented Generation) 将向量检索与大语言模型结合,解决 LLM 知识时效性问题和幻觉问题。
用户问题
↓
向量化(Embedding)
↓
向量数据库检索(FAISS / Chroma)→ 相关文档片段
↓
Prompt 构建:[系统指令] + [检索到的上下文] + [用户问题]
↓
LLM 生成回答
Python 示例:简单 RAG 系统
# 安装:pip install langchain openai faiss-cpu sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
class SimpleRAG:
"""简单的 RAG 系统实现"""
def __init__(self, embed_model="all-MiniLM-L6-v2"):
self.embed_model = SentenceTransformer(embed_model)
self.index = None
self.documents = []
def add_documents(self, docs: list[str]):
"""添加文档到知识库"""
self.documents.extend(docs)
embeddings = self.embed_model.encode(docs, convert_to_numpy=True)
if self.index is None:
dim = embeddings.shape[1]
self.index = faiss.IndexFlatIP(dim) # 内积(余弦相似度)
# L2 归一化后使用内积 = 余弦相似度
faiss.normalize_L2(embeddings)
self.index.add(embeddings.astype('float32'))
print(f"✅ 已添加 {len(docs)} 个文档,知识库共 {len(self.documents)} 条")
def retrieve(self, query: str, top_k=3):
"""检索最相关的文档"""
query_emb = self.embed_model.encode([query], convert_to_numpy=True)
faiss.normalize_L2(query_emb)
scores, indices = self.index.search(query_emb.astype('float32'), top_k)
results = []
for score, idx in zip(scores[0], indices[0]):
if idx != -1:
results.append({"document": self.documents[idx], "score": float(score)})
return results
def answer(self, query: str, top_k=3):
"""检索 + 生成(这里用模板替代真实 LLM)"""
retrieved = self.retrieve(query, top_k)
context = "\n".join([f"[{i+1}] {r['document']}" for i, r in enumerate(retrieved)])
# 实际场景替换为调用 OpenAI API 或本地 LLM
prompt = f"""根据以下参考信息回答问题。
参考信息:
{context}
问题:{query}
回答:"""
return prompt, retrieved
# 构建知识库
rag = SimpleRAG()
knowledge_base = [
"BERT 是 Google 在 2018 年发布的双向预训练语言模型,使用 Masked LM 和 NSP 预训练任务。",
"GPT 是 OpenAI 开发的自回归语言模型系列,GPT-3 有 1750 亿参数。",
"Transformer 由 Google 于 2017 年提出,完全基于 Attention 机制,抛弃了 RNN 结构。",
"词向量(Word Embedding)将词语映射到低维连续向量空间,Word2Vec 是经典方法。",
"命名实体识别(NER)识别文本中的人名、地名、机构名等专有名词。",
"情感分析判断文本的情感倾向,广泛用于舆情监控和产品评价分析。",
"RAG(检索增强生成)结合了信息检索和语言生成,可以减少 LLM 的幻觉问题。",
"大语言模型(LLM)是在大规模语料上训练的语言模型,具有强大的零样本泛化能力。",
]
rag.add_documents(knowledge_base)
# 测试检索
query = "BERT 是什么?它用什么方法预训练?"
prompt, retrieved = rag.answer(query)
print(f"查询: {query}\n")
print("检索到的相关文档:")
for r in retrieved:
print(f" [相似度: {r['score']:.4f}] {r['document']}")
print(f"\n构建的 Prompt:\n{prompt}")
7.2 多模态 NLP
多模态 NLP 将文本与图像、音频、视频等模态融合处理。
Python 示例:图文匹配(CLIP)
# 安装:pip install transformers pillow
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
import requests
# 加载 CLIP 模型(OpenAI)
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
def compute_image_text_similarity(image_url, texts):
"""计算图像与文本描述的相似度"""
# 加载图像
image = Image.open(requests.get(image_url, stream=True).raw)
# 处理输入
inputs = processor(
text=texts,
images=image,
return_tensors="pt",
padding=True
)
# 计算相似度
with torch.no_grad():
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
return probs[0].tolist()
# 示例(需要网络连接)
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg" # 猫的图片
descriptions = [
"a photo of cats",
"a photo of dogs",
"a photo of a car",
"a photo of buildings"
]
print("CLIP 图文相似度计算:")
print(f"图片: {image_url}\n")
# probs = compute_image_text_similarity(image_url, descriptions)
# for desc, prob in zip(descriptions, probs):
# print(f" {desc:30} → 概率: {prob:.4f}")
print("(需联网下载模型后运行)")
# 文字转图像提示词生成(Stable Diffusion)
print("\n文本驱动的图像生成流程:")
print(" 用户文本描述 → CLIP 文本编码器 → 潜在空间")
print(" → 扩散模型去噪 → 图像解码器 → 生成图像")
7.3 智能 Agent
基于 LLM 的 Agent 能够规划任务、调用工具、自主执行复杂任务。
Agent 核心架构
用户目标(Goal)
↓
┌─────────────────────────────────────┐
│ LLM 大脑(规划 + 推理 + 决策) │
│ ┌──────────┐ ┌──────────────────┐ │
│ │ 记忆系统 │ │ 工具调用 │ │
│ │ • 短期记忆│ │ • 搜索引擎 │ │
│ │ • 长期记忆│ │ • 代码执行 │ │
│ │ • 外部DB │ │ • 文件操作 │ │
│ └──────────┘ └──────────────────┘ │
└─────────────────────────────────────┘
↓
执行动作 → 观察结果 → 继续规划
↓
输出结果
Python 示例:ReAct Agent 框架
# ReAct(Reasoning + Acting)模式:让 LLM 交替进行推理和行动
# 安装:pip install langchain openai
from langchain.agents import create_react_agent, AgentExecutor
from langchain_openai import ChatOpenAI
from langchain.tools import Tool
from langchain import hub
import json
# 定义工具函数
def search_web(query: str) -> str:
"""模拟搜索引擎"""
# 实际场景接入 SerpAPI、Tavily 等
mock_results = {
"python": "Python 是一种高级编程语言,广泛用于 AI 和数据科学。",
"nlp": "NLP 是自然语言处理的缩写,让计算机理解人类语言。",
}
for key, val in mock_results.items():
if key.lower() in query.lower():
return val
return f"搜索到关于 '{query}' 的信息:这是模拟搜索结果。"
def calculate(expression: str) -> str:
"""数学计算器"""
try:
result = eval(expression)
return f"计算结果:{expression} = {result}"
except:
return "计算错误"
def get_current_date(input: str = "") -> str:
"""获取当前日期"""
from datetime import datetime
return f"当前日期:{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
# 注册工具
tools = [
Tool(name="WebSearch", func=search_web, description="搜索互联网上的信息,输入搜索关键词"),
Tool(name="Calculator", func=calculate, description="进行数学计算,输入数学表达式"),
Tool(name="CurrentDate", func=get_current_date, description="获取当前日期和时间"),
]
# ReAct 推理模式示例(不调用真实 API)
react_example = """
问题:GPT-4 有多少参数?(15 的平方是多少)
思考:我需要搜索 GPT-4 的参数信息,并计算 15 的平方
行动:WebSearch("GPT-4 参数量")
观察:GPT-4 估计有约 1.8 万亿参数(未官方确认)
思考:现在需要计算 15 的平方
行动:Calculator("15 ** 2")
观察:计算结果:15 ** 2 = 225
思考:我已经有了两个答案
最终回答:GPT-4 估计约 1.8 万亿参数(未官方公布),15 的平方是 225。
"""
print("ReAct Agent 推理过程示例:")
print("=" * 60)
print(react_example)
# 实际 LangChain Agent(需要 OpenAI API Key)
print("\n真实 Agent 代码示例:")
print("""
llm = ChatOpenAI(model="gpt-4", temperature=0)
prompt = hub.pull("hwchase17/react")
agent = create_react_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
result = executor.invoke({"input": "今天的日期是什么?GPT 全称是什么?"})
print(result["output"])
""")
8. 总结与展望
⏱️ 预计时间:10 分钟
技术发展脉络回顾
分词 / 词性标注 / NER / 句法分析
↓
词袋模型 / TF-IDF / Word2Vec
↓
RNN / LSTM / Seq2Seq + Attention
↓
Transformer(2017)←── 里程碑!
↓
BERT / GPT(2018-2020)
↓
LLM:GPT-4 / Claude / LLaMA(2022-至今)
↓
多模态 / Agent / RAG(前沿)
NLP 未来方向
| 方向 | 描述 |
|---|---|
| 更高效的模型 | MoE(专家混合)、量化、蒸馏,降低推理成本 |
| 多模态融合 | 文本 + 图像 + 语音 + 视频的统一建模 |
| 可信 AI | 减少幻觉,提高可解释性,增强事实一致性 |
| 长上下文 | 从 4K → 128K → 1M+ token 的超长文本处理 |
| Agent 系统 | 多 Agent 协作,自主完成复杂任务 |
| 低资源 NLP | 小语种、领域专用,数据效率 |
学习路线推荐
基础
├── Python 编程 + NumPy / Pandas
├── 机器学习基础(sklearn)
└── 深度学习基础(PyTorch / TensorFlow)
核心
├── 经典 NLP 任务实战
├── HuggingFace Transformers 库
└── BERT / GPT 微调
进阶
├── 大语言模型原理
├── 提示工程(Prompt Engineering)
├── RAG 系统构建
└── LLM Agent 开发
推荐资源
- 书籍:《Speech and Language Processing》(Jurafsky & Martin,免费在线版)
- 课程:CS224N(Stanford NLP)、fast.ai NLP
- 论文:Attention is All You Need、BERT、GPT-3
- 工具:HuggingFace 🤗、spaCy、LangChain
- 实践:Kaggle NLP 竞赛、Papers with Code
🎓 课程结束
本讲义覆盖了 NLP 从基础分词到前沿 Agent 的完整技术栈。 NLP 发展迅速,建议持续关注 arXiv cs.CL 和 HuggingFace 最新进展。
作者:周老师智能体 | 适用版本:Python 3.9+, PyTorch 2.x, Transformers 4.x
