【计算机基础】机器学习入门
聪头 游戏开发萌新

Python3入门机器学习 经典算法与应用

时间:2022年6月23日16:13:53

Git URL:https://git.imooc.com/coding-169/coding-169/src/master

第1章 欢迎来到 Python3 玩转机器学习

1-1 什么是机器学习

image

1-2 课程涵盖的内容和理念

image image image

1-3 课程所使用的主要技术栈

image image image image image image

第2章 机器学习基础

2-1 机器学习世界的数据

image image image image

↑ 大写代表矩阵,小写代表向量

image image image

2-2 机器学习的主要任务

机器学习的基本任务:分类、回归

分类

image image image image image image

回归

image image image

2-3 监督学习,非监督学习,半监督学习和增强学习

监督学习

监督学习:给机器的训练数据拥有“标记”或者“答案”

image image

非监督学习

image image image image

半监督学习

image image

增强学习

image

2-4 批量学习,在线学习,参数学习和非参数学习

批量学习

image image

在线学习

image image

参数学习

image image

非参数学习

image

2-5 和机器学习相关的“哲学”思考

image

无数据举例:Alpha Go

image image

本章小结

image

2-7 课程使用环境搭建

Anaconda官网:https://www.anaconda.com/

image image

第3章 Jupyter Notebook, numpy和matplotlib

3-1 Jupyter Notebook基础

快捷键

![image-20220623202116627](Python3入门机器学习 经典算法与应用.assets/image-20220623202116627.png)

![image-20220623201319046](Python3入门机器学习 经典算法与应用.assets/image-20220623201319046.png)

![image-20220623201916533](Python3入门机器学习 经典算法与应用.assets/image-20220623201916533.png)

修改某单元格语法类型

![image-20220623201941657](Python3入门机器学习 经典算法与应用.assets/image-20220623201941657.png)

Notebook优势,大数据只需要加载一次永久有效

image

重置:此时会从上至下依次执行代码

image

3-2 Jupyter Notebook中的魔法命令

%run:导入当前目录下某个python文件

image

%timeit:单行性能测试,底层可能执行多次

  • %%:多行测试
image

↑ Python中使用生成表达式创建数组比for循环快

%time:性能测试,只执行一次

  • %%:多行测试

%lsmagic:展示所有魔法命令

%xxx?:查看xxx命令的文档

3-3 Numpy 数据基础

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import numpy as np
print(np.__version__) #1.21.5

# Python List的特点
L = [i for i in range(10)] # 数组生成表达式
L[5] = 'Machine Learning' # List对数据类型没有要求
print(L) #[0, 1, 2, 3, 4, 'Machine Learning', 6, 7, 8, 9]

# 使用array,限定类型,效率和安全性更高
import array
arr = array.array('i', [i for i in range(10)]) # i代表整型数组
# arr[5] = 'Hello' # 报错

# numpy.array
nparr = np.array([i for i in range(10)])
# nparr[5] = 'Hello' # 报错
print(nparr.dtype) # 查看nparr的类型,此为int32
nparr[5] = 3.14
print(nparr) # 3.14自动转换成3 [0 1 2 3 4 3 6 7 8 9]
nparr2 = np.array([1,2,3.0])
print(nparr2.dtype) # float64

3-4 创建Numpy数组(和矩阵)

  • 全0,全1,自定义方式创建向量或矩阵
  • arange:指定一个范围和步长,创建向量或矩阵
  • linspace:指定一个范围和切分次数,创建向量和矩阵(范围为左闭右闭)
  • random:创建随机的向量和矩阵
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
# 全0的向量或矩阵
np.zeros(10)
print(np.zeros(10), np.zeros(10).dtype) # [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] float64
print(np.zeros(10, dtype=int)) # [0 0 0 0 0 0 0 0 0 0]
print(np.zeros((3, 5)))
print(np.zeros(shape=(3, 5), dtype=int))
'''
[[0 0 0 0 0]
[0 0 0 0 0]
[0 0 0 0 0]]
'''

# 全1的向量或矩阵
print(np.ones(10)) # [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
print(np.ones((3, 5)))
'''
[[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]]
'''

# 自定义
print(np.full(shape = (3, 5), fill_value = 666))
'''
[[666 666 666 666 666]
[666 666 666 666 666]
[666 666 666 666 666]]
'''
print(np.full(shape = (3, 5), fill_value = 666.0))
'''
[[666. 666. 666. 666. 666.]
[666. 666. 666. 666. 666.]
[666. 666. 666. 666. 666.]]
'''

# arange
print([i for i in range(0, 20, 2)]) # [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
print(np.arange(0, 20, 2)) # [ 0 2 4 6 8 10 12 14 16 18]
# print([i for i in range(0, 20, 0.2)] #步长不能为浮点数,报错
print(np.arange(0, 1, 0.2)) # [0. 0.2 0.4 0.6 0.8]
print(np.arange(0, 10)) # [0 1 2 3 4 5 6 7 8 9]
print(np.arange(10)) # [0 1 2 3 4 5 6 7 8 9]

# linspace
print(np.linspace(0, 20, 10)) #[0,20]等长截取10个点
# [ 0. 2.22222222 4.44444444 6.66666667 8.88888889 11.11111111
# 13.33333333 15.55555556 17.77777778 20. ]
print(np.linspace(0, 20, 11)) # [ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 20.]

# random
print(np.random.randint(0, 10)) # 生成[0, 10)的一个随机数 # 2
print(np.random.randint(0, 10, 10)) # 生成[0, 10)的数组(第三个参数指明数组大小) # [1 6 1 7 8 0 3 0 8 4]
print(np.random.randint(4, 8, size=10)) # [4 5 6 5 5 5 5 4 7 5]
print(np.random.randint(4, 8, size=(3, 5)))
'''
[[7 6 6 7 7]
[7 4 6 6 5]
[5 6 4 7 7]]
'''
np.random.seed(666) # 随机种子的使用
print(np.random.random()) # [0, 1)的随机数 # 0.7004371218578347
print(np.random.random(10)) # [0, 1)随机数数组(10个元素)
# [0.84418664 0.67651434 0.72785806 0.95145796 0.0127032 0.4135877
# 0.04881279 0.09992856 0.50806631 0.20024754]
print(np.random.random((3, 5)))
'''
[[0.74415417 0.192892 0.70084475 0.29322811 0.77447945]
[0.00510884 0.11285765 0.11095367 0.24766823 0.0232363 ]
[0.72732115 0.34003494 0.19750316 0.90917959 0.97834699]]
'''
print(np.random.normal()) # 服从标准正态分布的浮点数 # -1.6829007709843886
print(np.random.normal(10, 100)) # 生成服从均值为10,标准差为100的正态分布浮点数 # 32.91852477040214
print(np.random.normal(0, 1, (3, 5)))
'''
[[-1.75662522 0.84463262 0.27721986 0.85290153 0.1945996 ]
[ 1.31063772 1.5438436 -0.52904802 -0.6564723 -0.2015057 ]
[-0.70061583 0.68713795 -0.02607576 -0.82975832 0.29655378]]
'''

查看文档

1
2
# np.random.normal? #查看文档
# help(np.random.normal) #在notebook中查看文档

3-5 Numpy数组(和矩阵)的基本操作

  • 基本属性
  • 数据访问
  • reshape:改变维度
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import numpy as np
x = np.arange(10)
print(x) # [0 1 2 3 4 5 6 7 8 9]
X = np.arange(15).reshape(3, 5)
print(X)
'''
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]]
'''

# numpy.array 基本属性
# 查看维度
print(x.ndim) # 1
print(X.ndim) # 2
# 查看结构,返回元组
print(x.shape) # (10,)
print(X.shape) # (3, 5)
# 元素个数
print(x.size) # 10
print(X.size) # 15

# 数据访问
print(x[0], x[-1]) # 0 9
print(X[0][0]) # 不建议这样写 # 0
print(X[(0, 0)], X[2, 2]) # 推荐X[2, 2]的写法 # 0 12
print(x[0:5]) # 切片[0, 5)的元素 # [0 1 2 3 4]
print(x[:5]) # [第一个元素(0), 5) 切片 # [0 1 2 3 4]
print(x[5:]) # [5, 结尾(10)) 切片 # [5 6 7 8 9]
print(x[::2]) # 从头到尾取步长2切片 # [0 2 4 6 8]
print(x[::-1]) # 从头到尾取步长-1切片(逆序) # [9 8 7 6 5 4 3 2 1 0]
print(X[:2, :3]) # 前两行,前三列
'''
[[0 1 2]
[5 6 7]]
'''
# [][]无法表达正确的语义
print(X[:2]) # 前两行
'''
[[0 1 2 3 4]
[5 6 7 8 9]]
'''
print(X[:2][:3]) # 不推荐酱紫写:本意想取前两行前三列,可取的是前两行([:3]表示取三个元素,而[:2]是前两行,故仍旧是取前两行)
'''
[[0 1 2 3 4]
[5 6 7 8 9]]
'''
print(X[:2, ::2]) # 前两行,每行从头到尾间隔为2的元素
'''
[[0 2 4]
[5 7 9]]
'''
print(X[::-1, ::-1]) # 矩阵逆序
'''
[[14 13 12 11 10]
[ 9 8 7 6 5]
[ 4 3 2 1 0]]
'''
print(X[0], X[0, :]) #取第一行 # [0 1 2 3 4] [0 1 2 3 4]
print(X[:, 0]) #取第一列 # [ 0 5 10]
subX = X[:2, :3]
subX[0, 0] = 100
print(subX)
'''
[[100 1 2]
[ 5 6 7]]
'''
print(X) # numpy中使用引用的方式获取子矩阵
'''
[[100 1 2 3 4]
[ 5 6 7 8 9]
[ 10 11 12 13 14]]
'''
subX = X[:2, :3].copy() # 复制值
subX[0, 0] = 99
print(subX)
'''
[[99 1 2]
[ 5 6 7]]
'''
print(X)
'''
[[100 1 2 3 4]
[ 5 6 7 8 9]
[ 10 11 12 13 14]]
'''

# Reshape
print(x.reshape(2, 5)) # 将x转换成2行5列的矩阵,不改变x本身
'''
[[0 1 2 3 4]
[5 6 7 8 9]]
'''
print(x.reshape(10, -1)) # 10行,列数由计算机决定
'''
[[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]]
'''
print(x.reshape(-1, 10)) # 10列,行数由计算机决定 # [[0 1 2 3 4 5 6 7 8 9]]
# x.reshape(3, -1) # 10个元素无法均分给3行,报错

3-6 Numpy数组(和矩阵)的合并与分割

  • concatenate:拼接,默认按行拼接。要求数据是同维的
    • axis:1时按列拼接
  • vstack/hstack:竖直/水平方向堆叠。不要求数据同维
  • split:分割,默认按行分割
    • axis:1时按列分割
  • vsplit/hsplit:竖直/水平分割
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
# 数据合并
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
print(np.concatenate([x, y])) # [1 2 3 3 2 1]
z = np.array([666, 666, 666])
print(np.concatenate([x, y, z])) # [ 1 2 3 3 2 1 666 666 666]
A = np.array([[1, 2, 3],
[4, 5, 6]])
print(np.concatenate([A, A])) # 默认沿行拼接
'''
[[1 2 3]
[4 5 6]
[1 2 3]
[4 5 6]]
'''
print(np.concatenate([A, A], axis=1)) # 沿列拼接
'''
[[1 2 3 1 2 3]
[4 5 6 4 5 6]]
'''
# np.concatenate([A, z]) # 维数不同无法连接,A是二维矩阵,z是一维向量,报错
print(z.reshape(1, -1)) # 1行,列数自动填充 # [[666 666 666]]
print(z.reshape(1, -1).ndim) # 2
print(np.concatenate([A, z.reshape(1, -1)])) # A矩阵和z向量连接,产生新矩阵
'''
[[ 1 2 3]
[ 4 5 6]
[666 666 666]]
'''
A2 = np.concatenate([A, z.reshape(1, -1)])
print(np.vstack([A, z])) # 竖直方向数据堆叠
'''
[[ 1 2 3]
[ 4 5 6]
[666 666 666]]
'''
B = np.full((2, 2), 100)
print(B)
'''
[[100 100]
[100 100]]
'''
print(np.hstack([A, B]))
'''
[[ 1 2 3 100 100]
[ 4 5 6 100 100]]
'''

# 数据分割
x = np.arange(10)
print(np.split(x, [3, 7])) # 分割x,分割点为3,7(规则为左闭右开)
''' [array([0, 1, 2]), array([3, 4, 5, 6]), array([7, 8, 9])] '''
print(np.split(x, [5]))
''' [array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9])] '''
A = np.arange(16).reshape((4, 4))
A1, A2 = np.split(A, [2]) # 基于行分割
print(A1)
'''
[[0 1 2 3]
[4 5 6 7]]
'''
print(A2)
'''
[[ 8 9 10 11]
[12 13 14 15]]
'''
A1, A2 = np.split(A, [2], axis=1) # 基于列分割
print(A1)
'''
[[ 0 1]
[ 4 5]
[ 8 9]
[12 13]]
'''
print(A2)
'''
[[ 2 3]
[ 6 7]
[10 11]
[14 15]]
'''
upper, lower = np.vsplit(A, [2]) # 竖直方向分割
print(upper)
'''
[[0 1 2 3]
[4 5 6 7]]
'''
print(lower)
'''
[[ 8 9 10 11]
[12 13 14 15]]
'''
left, right = np.hsplit(A, [2]) # 水平方向分割
print(left)
'''
[[ 0 1]
[ 4 5]
[ 8 9]
[12 13]]
'''
print(right)
'''
[[ 2 3]
[ 6 7]
[10 11]
[14 15]]
'''
data = np.arange(16).reshape((4, 4))
print(data)
'''
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
'''
X, y = np.hsplit(data, [-1]) # 分割最后一列
print(X)
'''
[[ 0 1 2]
[ 4 5 6]
[ 8 9 10]
[12 13 14]]
'''
print(y)
'''
[[ 3]
[ 7]
[11]
[15]]
'''
print(y[:, 0]) # 将y矩阵转换成向量 # [ 3 7 11 15]

3-7 Numpy中的矩阵运算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
n = 10
L = [i for i in range(n)]
print(2 * L) # 将两个L首尾相接 # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
A = []
for e in L:
A.append(2 * e)
print(A) # [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
A = [2*e for e in L] # 比for快

import numpy as np
L = np.arange(n)
A = np.array([2*e for e in L]) # 比原生List的生成表达式快
A = 2 * L # 向量*2,速度很快
print(A) # [ 0 2 4 6 8 10 12 14 16 18]

# Universal Function
X = np.arange(1, 16).reshape((3, 5))
print(X)
'''
[[ 1 2 3 4 5]
[ 6 7 8 9 10]
[11 12 13 14 15]]
'''
print(X + 1)
'''
[[ 2 3 4 5 6]
[ 7 8 9 10 11]
[12 13 14 15 16]]
'''
print(X - 1)
'''
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]]
'''
print(X * 2)
'''
[[ 2 4 6 8 10]
[12 14 16 18 20]
[22 24 26 28 30]]
'''
print(X / 2) # 浮点数除法
'''
[[0.5 1. 1.5 2. 2.5]
[3. 3.5 4. 4.5 5. ]
[5.5 6. 6.5 7. 7.5]]
'''
print(X // 2) # 整数除法
'''
[[0 1 1 2 2]
[3 3 4 4 5]
[5 6 6 7 7]]
'''
print(X ** 2) # 幂运算
'''
[[ 1 4 9 16 25]
[ 36 49 64 81 100]
[121 144 169 196 225]]
'''
print(X % 2) # 求余
'''
[[1 0 1 0 1]
[0 1 0 1 0]
[1 0 1 0 1]]
'''
print(1 / X) # 求倒数
print(np.abs(X)) # 求绝对值
print(np.sin(X)) # 求正弦
print(np.cos(X))
print(np.tan(X))
print(np.exp(X)) # e^x次方
print(np.power(3, X)) # 等价于3**X
print(np.log(X)) # ln
print(np.log2(X))
print(np.log10(X))

# 矩阵运算
A = np.arange(4).reshape(2, 2)
B = np.full((2, 2), 10)
print(A + B)
'''
[[10 11]
[12 13]]
'''
print(A - B)
'''
[[-10 -9]
[ -8 -7]]
'''
print(A * B) # 对应元素相乘,并非矩阵乘法
'''
[[ 0 10]
[20 30]]
'''
print(A / B)
'''
[[0. 0.1]
[0.2 0.3]]
'''
print(A.dot(B)) # 矩阵乘法
'''
[[10 10]
[50 50]]
'''
print(A.T) # 转置
'''
[[0 2]
[1 3]]
'''

# 向量和矩阵的运算
v = np.array([1, 2])
print(A)
'''
[[0 1]
[2 3]]
'''
print(v + A) # A矩阵每一行和v做加法
'''
[[1 3]
[3 5]]
'''
print(np.vstack([v] * A.shape[0])) # [v] * A的行数(拼接次数) = 竖直拼接结果
'''
[[1 2]
[1 2]]
'''
print(np.tile(v, (2, 1))) # 行向量堆叠2次,列向量堆叠1次
'''
[[1 2]
[1 2]]
'''
print(v * A) # v各元素和A逐行逐元素相乘
'''
[[0 2]
[2 6]]
'''
print(v.dot(A)) # 行向量 * 矩阵 # [4 7]
print(A.dot(v)) # 矩阵 * 列向量 # [2 8]

# 矩阵的逆
invA = np.linalg.inv(A)
print(invA)
'''
[[-1.5 0.5]
[ 1. 0. ]]
'''
print(A.dot(invA))
'''
[[1. 0.]
[0. 1.]]
'''
X = np.arange(16).reshape((2, 8))
pinvX = np.linalg.pinv(X) # 伪逆矩阵
print(pinvX.shape) # (8, 2)
print(X.dot(pinvX))
'''
[[ 1.00000000e+00 -2.49800181e-16]
[ 6.66133815e-16 1.00000000e+00]]
'''

3-8 Numpy中的聚合运算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import numpy as np
np.random.seed(19991101)
L = np.random.random(100) # 100个(0, 1]的数组成的数组
# print(L)
print(sum(L)) # 47.94254974738839
print(np.sum(L)) # 效率更高
print(np.min(L)) # 0.0003132760524273692
print(np.max(L)) # 0.9931441196685156
print(L.min()) # 根据个人喜好,不过更推荐使用np.min,可以显式指明调用numpy库
print(L.max())
print(L.sum())
# 矩阵的聚合运算
X = np.arange(16).reshape(4, -1)
print(X)
'''
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
'''
print(np.sum(X)) # 120
print(np.sum(X, axis=0)) # 沿着行方向运算,即每列求和(诀窍:行压缩)# [24 28 32 36]
print(np.sum(X, axis=1)) # 沿着列方向运算,即每行求和(诀窍:列压缩)# [ 6 22 38 54]
print(np.prod(X)) # 逐元素相乘的乘积 # 0
print(np.prod(X + 1)) # 2004189184
print(np.mean(X)) # 求平均值 # 7.5
print(np.median(X)) # 求中位数 # 7.5

print(np.median(L)) # 0.45331017330323375
print(np.percentile(L, q=50)) # 这组元素50%的元素小于结果值,即中位数 # 0.45331017330323375
for percent in [0, 25, 50, 75, 100]:
print(np.percentile(L, q=percent))
'''
0.0003132760524273692
0.2559220514470348
0.45331017330323375
0.6964296186068865
0.9931441196685156
'''
print(np.var(L)) # 求方差 # 0.07149006634669938
print(np.std(L)) # 求标准差 # 0.26737626361870526
x = np.random.normal(0, 1, size=1000000) # 取均值为0,标准差为1,100000个随机数
print(np.mean(x)) # -0.0010810265562775107
print(np.std(x)) # 1.0004066018104387

3-9 Numpy中的arg运算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import numpy as np
np.random.seed(19991101)
x = np.random.normal(0, 1, size=1000000) # 取均值为0,标准差为1,100000个随机数
print(np.min(x)) # 返回最小值 # -4.9121140331754525
print(np.argmin(x)) # 返回最小值的索引值 # 450222
print(x[np.argmin(x)]) # -4.9121140331754525
print(np.argmax(x)) # 459822

# 排序和使用索引
x = np.arange(16)
print(x) # [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
np.random.shuffle(x) # 乱序处理
print(x) # [ 1 10 3 0 5 6 15 8 14 2 7 13 9 12 4 11]
# np.sort(x)
x.sort()
print(x) # [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
X = np.random.randint(10, size=(4, 4)) # [0,10)之间4x4矩阵
print(X)
'''
[[2 9 2 3]
[5 8 1 1]
[9 4 5 7]
[7 8 7 1]]
'''
print(np.sort(X)) # 默认axis=1,沿着列方向排序,即每行排序
'''
[[2 2 3 9]
[1 1 5 8]
[4 5 7 9]
[1 7 7 8]]
'''
print(np.sort(X, axis=0)) # 每列有序
'''
[[2 4 1 1]
[5 8 2 1]
[7 8 5 3]
[9 9 7 7]]
'''

# 索引
np.random.shuffle(x) # 乱序
print(x) # [ 2 5 12 14 1 6 8 11 3 10 4 0 13 15 7 9]
print(np.argsort(x)) # 索引排序 # [11 4 0 8 10 1 5 14 6 15 9 7 2 12 3 13]
print(np.partition(x, 3)) # 快速排序一部分,根据值进行划分 # [ 0 1 2 3 7 6 8 5 4 9 10 14 13 15 12 11]
print(np.argpartition(x, 4)) # 根据索引划分 # [11 4 0 8 10 1 5 14 6 15 9 3 12 13 2 7]
print(X)
'''
[[2 9 2 3]
[5 8 1 1]
[9 4 5 7]
[7 8 7 1]]
'''
print(np.argsort(X, axis=1)) # 按行索引排序
'''
[[0 2 3 1]
[2 3 0 1]
[1 2 3 0]
[3 0 2 1]]
'''
print(np.argsort(X, axis=0)) # 按列索引排序
'''
[[0 2 1 1]
[1 1 0 3]
[3 3 2 0]
[2 0 3 2]]
'''
print(np.argpartition(X, 2, axis=1)) # 按行索引划分
'''
[[0 2 3 1]
[2 3 0 1]
[1 2 3 0]
[3 2 0 1]]
'''
print(np.argpartition(X, 2, axis=1)) # 按列索引划分
'''
[[0 2 3 1]
[2 3 0 1]
[1 2 3 0]
[3 2 0 1]]
'''

3-10 Numpy中的比较和Fancy Indexing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
x = np.arange(16)
ind = [3, 5, 8] # 待访问索引
print(x[ind]) # 得到由索引3,5,8的值组成的向量 # [3 5 8]
ind = np.array([[0, 2],
[1, 3]])
print(x[ind]) #得到二维矩阵(从一维向量索引得来的值)
'''
[[0 2]
[1 3]]
'''
X = x.reshape(4, -1) # 4x4矩阵
print(X)
'''
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
'''
row = np.array([0, 1, 2]) # 感兴趣的行
col = np.array([1, 2, 3]) # 感兴趣的列
print(X[row, col]) #获得(0,1),(1,2),(2,3)三个点的值
'''
[ 1 6 11]
'''
print(X[0, col]) #(0,1),(0,2),(0,3) # [1 2 3]
print(X[:2, col]) #前两行索引为1,2,3的列的值
'''
[[1 2 3]
[5 6 7]]
'''
col = [True, False, True, True] #使用布尔数组
print(X[1:3, col]) # 对索引[1,3)行的内容,取0,2,3列
'''
[[ 4 6 7]
[ 8 10 11]]
'''

# numpy.array的比较
print(x) # [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
print(x < 3) #x中所有元素和3比较
'''
[ True True True False False False False False False False False False
False False False False]
'''
print(2 * x == 24 - 4 * x)
'''
[False False False False True False False False False False False False
False False False False]
'''
print(np.sum(x <= 3)) # True为1,False为0 # 4
print(np.count_nonzero(x <= 3)) # 4
print(np.any(x == 0)) # 任何一个返回true则返回true # True
print(np.all(x >= 0)) # 所有返回true则返回true # True
print(np.sum(X % 2 == 0)) # 求偶数个数 # 8
print(np.sum(X % 2 == 0, axis = 1)) # 沿着列方向,即每行有多少偶数 # [2 2 2 2]
print(np.sum(X % 2 == 0, axis = 0)) # 沿着行方向,即每列有多少偶数 # [4 0 4 0]
print(np.all(X > 0, axis = 1)) # 每行是否大于0 # [False True True True]
print(x) # [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
print(np.sum((x > 3) & (x < 10))) # 6
print(np.sum((x % 2 == 0) | (x > 10))) # 11
print(np.sum(~(x==0))) # 15
print(x[x < 5]) # [0 1 2 3 4]
print(x[x % 2 == 0]) # [ 0 2 4 6 8 10 12 14]
print(X[X[:, 3] % 3 == 0, :]) # 取索引为3的列能被3整除的行
'''
[[ 0 1 2 3]
[12 13 14 15]]
'''

个人:numpy小结

  • 3-7~3-9:细粒度内容不好总结。直接翻看该小节内容即可
1
import numpy as np

数据对象常用属性

默认由numpy创建的数据对象访问

出处 属性 说明 举例
3-3 dtype 查看nparr类型 int32, float64
3-5 ndim 查看维度 1,2
3-5 shape 查看结构,返回元组 (10, ), (3, 5)
3-5 size 查看元素个数 10, 15

数据对象常用方法

出处 举例 说明
3-5 x[0] 一维向量的访问
3-5 X[0, 0] 矩阵元素的访问
3-5 x[0:5] 一维向量切片,返回引用,左闭右开
①:起始索引,不写默认0
②:终点索引,不写默认元素个数
③:步长,不写默认1。-1代表逆序
3-5 X[:2, :3] 矩阵切片,原理同上
(示例:取前两行,前三列的数据)
3-5 X[:2, :3].copy 值拷贝
3-5 x.reshape(2, 5)
x.reshape(10, -1)
重构(改变维数),返回新对象
(示例1:将一维向量x转换成2x5的矩阵)
(示例2:将x转换成10行的矩阵,列数由计算机平均分配)
3-9 ind = [3, 5, 8]
x[ind]
ind = np.array[[0, 2], [1, 3]]
x[ind]
一维向量的Fancy索引访问。预先声明格式,根据格式返回内容
(示例1:向量。得到由索引3,5,8的值组成的向量)
(示例2:矩阵。得到二维矩阵(从一维向量索引得来的值))

numpy常用方法

出处 函数 说明 举例
3-3 np.array 创建numpy数据对象。由numpy创建的一维数组可称为向量,n维数组可称为矩阵。
参数①:列表
np.array([i for i in range(10)])
3-4 np.zeros 创建全为0的数据对象
参数①:shape,结构。1维传int, n维传元组
参数②:dtype,类型。传int,float等
np.zeros(10, dtype=int)
np.zeros((3, 5))
np.zeros(shape=(3, 5), dtype=int)
3-4 np.ones 创建全为1的数据对象,其他同上
3-4 np.full 创建自定义值的数据对象,基本同上
参数②:fill_value,填充值
np.full(shape = (3, 5), fill_value = 666)
3-4 np.arange 指定范围创建数据对象,规则通常是左闭右开
传入三个参数:①起始值(闭区间);②终点值(开区间);③步长
传入两个参数:步长默认1,起始和终点为传入值
传入一个参数:起始值默认0,步长默认1,终点值为传入值
np.arange(0, 20, 2)
np.arange(0, 1, 0.2)
np.arange(0, 10)
pnp.arange(10)
3-4 np.linspace 指定范围创建数据对象,不同于arange由步长创建,该函数预先指定元素个数,自动计算平均步长创建
参数①:起始值;参数②:终点值(开区间);参数③:元素个数
np.linspace(0, 20, 11)
3-4 np.random.randint 指定范围内随机创建数据对象
参数①:起始值
参数②:终点值
参数③:尺寸,传入元组可定义维数
np.random.randint(0, 10)
np.random.randint(0, 10, 10)
np.random.randint(4, 8, size=10)
np.random.randint(4, 8, size=(3, 5))
3-4 np.random.seed 设定numpy的随机种子 np.random.seed(666)
3-4 np.random.random 生成[0, 1)的随机数
参数①:尺寸,传入元组可定义维数
np.random.random()
np.random.random(10)
np.random.random((3, 5))
3-4 np.random.normal 根据正态分布生成随机数
参数①:默认0,均值
参数②:默认1,方差
参数③:尺寸,传入元组可定义维数
np.random.normal()
np.random.normal(10, 100)
np.random.normal(0, 1, (3, 5))
3-6 np.concatenate 拼接数据,要求数据必须维数相同,返回新对象
参数①:列表(列表中元素仍是列表)
参数②:axis,默认0,沿行拼接。1为沿列拼接
np.concatenate([x, y])
np.concatenate([A, A])
np.concatenate([A, A], axis=1)
3-6 np.vstack 竖直方向拼接数据(增加行),不要求维数相同,但需要满足规则 np.vstack([A, z])
3-6 np.hstack 水平方向拼接数据(增加列) np.hstack([A, B])
3-6 np.split 数据分割
参数①:指定分割对象
参数②:列表,指定分割点
参数③:axis,默认0,基于行分割。1基于列分割
np.split(x, [3, 7])
np.split(A, [2])
np.split(A, [2], axis=1)
3-6 np.vsplit 竖直方向分割(基于行分割) np.vsplit(A, [2])
3-6 np.hsplit 水平方向分割(基于列分割) np.hsplit(A, [2])

3-11 Matplotlib数据可视化基础

image image
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np

# 直线图 通常用于表现结果
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.plot(x, y)
plt.show()

cosy = np.cos(x)
siny = y.copy()
plt.plot(x, siny)
plt.plot(x, cosy, color = 'orange', linestyle='-')
plt.show()

plt.plot(x, siny)
plt.plot(x, cosy, color = 'orange', linestyle='-')
plt.xlim(-5, 15) # 控制x范围
plt.ylim(0, 1.5) # 控制y范围
plt.axis([-1, 11, -2, 2]) # 同时调整x和y范围
plt.show()

plt.plot(x, siny, label='sin(x)')
plt.plot(x, cosy, color = 'orange', linestyle='-', label='cos(x)')
plt.xlabel('x axis')
plt.ylabel('y value')
plt.legend() # 添加图示
plt.title('Welcome to the ML World!')
plt.show()

# 散点图 Scatter Plot 通常用于绘制二维特征
plt.scatter(x, siny)
plt.scatter(x, cosy)
plt.show()

x = np.random.normal(0, 1, 10000)
y = np.random.normal(0, 1, 10000)
plt.scatter(x, y, alpha=0.1)
plt.show()

3-12 数据加载和简单的数据探索

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

iris = datasets.load_iris()
print(iris.keys()) # 数据对应特征和label
print(iris.DESCR) # 数据集文档
print(iris.data) # 特征
print(iris.data.shape)
print(iris.feature_names) # 特征名称
print(iris.target) # label对应的索引值
print(iris.target.shape)
print(iris.target_names) # label名称

X = iris.data[:, :2] # 取前两列
print(X.shape)
plt.scatter(X[:, 0], X[:, 1]) # 取X第0和1列分别作为x和y轴
plt.show()

y = iris.target
plt.scatter(X[y==0, 0], X[y==0, 1], color='red', marker='o') # 取X满足y==0的行的0和1列
plt.scatter(X[y==1, 0], X[y==1, 1], color='blue', marker='+') # 取X满足y==1的行的0和1列
plt.scatter(X[y==2, 0], X[y==2, 1], color='green', marker='x') # 取X满足y==2的行的0和1列
plt.show()

X = iris.data[:, 2:] # 此处表示取3,4列特征
plt.scatter(X[y==0, 0], X[y==0, 1], color='red', marker='o') # 取X满足y==0的行的0和1列
plt.scatter(X[y==1, 0], X[y==1, 1], color='blue', marker='+') # 取X满足y==1的行的0和1列
plt.scatter(X[y==2, 0], X[y==2, 1], color='green', marker='x') # 取X满足y==2的行的0和1列
plt.show()

第4章 最基础的分类算法-k近邻算法 kNN

4-1 k近邻算法基础

原理:选取K个数据点,当判断新的样本点属于哪一类时,找到距离新样本点最近的K个点,哪个类别距离近最多,就判断是哪个类别

image image image image
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
import matplotlib.pyplot as plt

raw_data_X = [
[3.393533211, 2.331273381],
[3.110073483, 1.781539638],
[1.343808831, 3.368360954],
[3.582294042, 4.679179110],
[2.280362439, 2.866990263],
[7.423436942, 4.696522875],
[5.745051997, 3.533989803],
[9.172168622, 2.511101045],
[7.792783481, 3.424088941],
[7.939820817, 0.791637231]
]
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
X_train = np.array(raw_data_X)
y_train = np.array(raw_data_y)
print(X_train)
'''
[[3.39353321 2.33127338]
[3.11007348 1.78153964]
[1.34380883 3.36836095]
[3.58229404 4.67917911]
[2.28036244 2.86699026]
[7.42343694 4.69652288]
[5.745052 3.5339898 ]
[9.17216862 2.51110105]
[7.79278348 3.42408894]
[7.93982082 0.79163723]]
'''
print(y_train) # [0 0 0 0 0 1 1 1 1 1]

plt.scatter(X_train[y_train==0, 0], X_train[y_train==0, 1], color='g')
plt.scatter(X_train[y_train==1, 0], X_train[y_train==1, 1], color='r')
plt.show()

x = np.array([8.093607318, 3.365731514])
plt.scatter(X_train[y_train==0, 0], X_train[y_train==0, 1], color='g')
plt.scatter(X_train[y_train==1, 0], X_train[y_train==1, 1], color='r')
plt.scatter(x[0], x[1], color='b')
plt.show()

# knn的过程
from math import sqrt
distances = []
for x_train in X_train:
d = sqrt(np.sum((x_train - x)**2))
distances.append(d)
print(distances)
distances = [sqrt(np.sum((x_train - x)**2)) for x_train in X_train]
print(distances)
# [4.812566907609877, 5.229270827235305, 6.749798999160064, 4.6986266144110695, 5.83460014556857,
# 1.4900114024329525, 2.354574897431513, 1.3761132675144652, 0.3064319992975, 2.5786840957478887]
nearest = np.argsort(distances)
print(nearest) # [8 7 5 6 9 3 0 1 4 2]

k = 6
topK_y = [y_train[i] for i in nearest[:k]] # 取nearest前6个元素
print(topK_y) # [1, 1, 1, 1, 1, 0]
# 统计
from collections import Counter
votes = Counter(topK_y) # 求不同元素的票数
print(votes) # Counter({1: 5, 0: 1})
print(votes.most_common(1)) # 取票数最高的一个元素(列表) # [(1, 5)]
print(votes.most_common(1)[0][0]) # 第一个列表元素(元组)的第一个数据 # 1
predict_y = votes.most_common(1)[0][0]
print(predict_y) # 1

4-2 scikit-learn中的机器学习算法封装

kNN算法封装

Ch4_kNN.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np
from math import sqrt
from collections import Counter

def kNN_classify(k, X_train, y_train, x):
assert 1 <= k <= X_train.shape[0], "k must be valid"
assert X_train.shape[0] == y_train.shape[0], \
"the size of X_train must equal to the size of y_train"
assert X_train.shape[1] == x.shape[0], \
"the feature number of x must be equal to X_train"
distances = [sqrt(np.sum((x_train - x)**2)) for x_train in X_train]
nearest = np.argsort(distances)
topK_y = [y_train[i] for i in nearest[:k]]
votes = Counter(topK_y)
return votes.most_common(1)[0][0]

main

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import numpy as np
import matplotlib.pyplot as plt

raw_data_X = [
[3.393533211, 2.331273381],
[3.110073483, 1.781539638],
[1.343808831, 3.368360954],
[3.582294042, 4.679179110],
[2.280362439, 2.866990263],
[7.423436942, 4.696522875],
[5.745051997, 3.533989803],
[9.172168622, 2.511101045],
[7.792783481, 3.424088941],
[7.939820817, 0.791637231]
]
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
X_train = np.array(raw_data_X)
y_train = np.array(raw_data_y)

x = np.array([8.093607318, 3.365731514])

%run liuyubobobo/Ch4_kNN.py
predict_y = kNN_classify(6, X_train, y_train, x)
print(predict_y)
image image

调用sklearn的kNN

1
2
3
4
5
6
7
8
9
10
# 使用scikit-learn中的kNN
from sklearn.neighbors import KNeighborsClassifier
kNN_classifier = KNeighborsClassifier(n_neighbors=6)
print(kNN_classifier.fit(X_train, y_train)) # 拟合
# kNN_classifier.predict(x) # x是一维向量,过时
X_predict = x.reshape(1, -1)
print(X_predict)
y_predict = kNN_classifier.predict(X_predict)
print(y_predict)
print(y_predict[0])

根据sklearn重新封装kNN

Ch4_kNN2.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
from math import sqrt
from collections import Counter

class KNNClassifier:
def __init__(self, k):
'''初始化kNN分类器'''
assert k>=1, "k must be valid"
self.k = k
self._X_train = None
self._y_train = None
def fit(self, X_train, y_train):
'''根据训练数据集X_train和y_train训练kNN分类器'''
assert X_train.shape[0] == y_train.shape[0], \
"the size of X_train must be equal to the size of y_train"
assert self.k <= X_train.shape[0], \
"the size of X_train must be at least k"
self._X_train = X_train
self._y_train = y_train
return self
def predict(self, X_predict):
'''给定待遇测数据集X_predict,返回表示X_predict的结果向量'''
assert self._X_train is not None and self._y_train is not None, \
'must fit before predict!'
assert X_predict.shape[1] == self._X_train.shape[1], \
'the feature number of X_predict must be equal to X_train'
y_predict = [self._predict(x) for x in X_predict]
return np.array(y_predict)
def _predict(self, x):
'''给定单个待预测数据x,返回x的预测结果值'''
assert x.shape[0] == self._X_train.shape[1], \
'the feature number of x must be equal to X_train'
distances = [sqrt(np.sum((x_train - x)**2)) for x_train in self._X_train]
nearest = np.argsort(distances)
topK_y = [self._y_train[i] for i in nearest[:self.k]]
votes = Counter(topK_y)
return votes.most_common(1)[0][0]
def __repr__(self):
return 'KNN(k=%d)' % self.k

main

1
2
3
4
5
6
7
# 重新整理我们的kNN代码
%run liuyubobobo/Ch4_kNN2.py
knn_clf = KNNClassifier(k=6)
knn_clf.fit(X_train, y_train)
y_predict = knn_clf.predict(X_predict)
print(y_predict)
print(y_predict[0])

4-3 训练数据集,测试数据集

image

model_selection.py

  • 将数据集切分成训练数据和测试数据两部分
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import numpy as np

def train_test_split(X, y, test_ratio=0.2, seed=None):
'''将数据X和y按照test_ratio分割成X_train, X_test, y_train, y_test'''
assert X.shape[0] == y.shape[0], \
'the size of X must be equal to the size of y'
assert 0.0 <= test_ratio <= 1.0, \
'test_ratio must be valid'
if seed:
np.random.seed(seed)
shuffle_indexes = np.random.permutation(len(X)) # 索引乱序排列
test_ratio = 0.2 # 80%训练数据,20%测试数据
test_size = int(len(X) * test_ratio)
test_indexes = shuffle_indexes[:test_size]
train_indexes = shuffle_indexes[test_size:]
X_train = X[train_indexes]
y_train = y[train_indexes]
X_test = X[test_indexes]
y_test = y[test_indexes]
return X_train, X_test, y_train, y_test

main

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
print(X.shape, y.shape)

# 使用我们的算法
from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

from playML.kNN import KNNClassifier
my_knn_clf = KNNClassifier(k=3)
my_knn_clf.fit(X_train, y_train)
y_predict = my_knn_clf.predict(X_test)
print(y_predict)
print(sum(y_predict == y_test))
print(sum(y_predict == y_test) / len(y_test))

# sklearn中的train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=666)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

4-4 分类准确度

本节中将加载手写数字的数据集,拆分成训练集(80%)和测试集(20%),使用自定义kNN算法和sklearn的kNN算法测试准确度。

metrics.py

  • 封装测试准确度的算法
1
2
3
4
5
def accuracy_score(y_true, y_predict):
'''计算y_true和y_predict之间的准确率'''
assert y_true.shape[0] == y_predict.shape[0], \
'the size of y_true must be equal to the size of y_predict'
return sum(y_true == y_predict) / len(y_true)

修改kNN.py:新增score方法

1
2
3
4
5
6
7
8
9
10
11
import numpy as np
from math import sqrt
from collections import Counter
from .metrics import accuracy_score

class KNNClassifier:
......
def score(self, X_test, y_test):
'''根据测试数据集 X_test 和 y_test 确定当前模型的准确度'''
y_predict = self.predict(X_test)
return accuracy_score(y_test, y_predict)

main

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import datasets
digits = datasets.load_digits() # 加载内置的手写汉字数据集
print(digits.keys())
print(digits.DESCR)
X = digits.data
print(X.shape) # (1797, 64)
y = digits.target
print(y)
print(digits.target_names)
print(y[:100]) # 随机排列
some_digit = X[666]
print(y[666])
some_digit_image = some_digit.reshape(8, 8)
plt.imshow(some_digit_image, cmap = matplotlib.cm.binary) # 显示图片
plt.show()

from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_ratio = 0.2)
from playML.kNN import KNNClassifier
my_knn_clf = KNNClassifier(k=3)
my_knn_clf.fit(X_train, y_train)
y_predict = my_knn_clf.predict(X_test)
print(sum(y_predict == y_test) / len(y_test))

from playML.metrics import accuracy_score
print(accuracy_score(y_test, y_predict))

print(my_knn_clf.score(X_test, y_test)) # 0.9860724233983287

# scikit-learn中的accuracy_score
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=666)
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
y_predict = knn_clf.predict(X_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_predict))
print(knn_clf.score(X_test, y_test)) # 0.9916666666666667

4-5 超参数

image image
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import matplotlib
import matplotlib.pyplot as plt
from sklearn import datasets
digits = datasets.load_digits() # 加载内置的手写汉字数据集
X = digits.data
y = digits.target

# scikit-learn中的accuracy_score
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=666)
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
print(knn_clf.score(X_test, y_test))

# 寻找最好的k
best_score = 0.0
best_k = -1
for k in range(1, 11):
knn_clf = KNeighborsClassifier(n_neighbors=k)
knn_clf.fit(x_train, y_train)
score = knn_clf.score(X_test, y_test)
if score > best_score:
best_k = k
best_score = score
print("best_k=", best_k)
print("best_score=", best_score)
image
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 考虑距离?
best_method = ""
best_score = 0.0
best_k = -1
for method in ['uniform', 'distance']:
for k in range(1, 11):
knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
knn_clf.fit(X_train, y_train)
score = knn_clf.score(X_test, y_test)
if score > best_score:
best_k = k
best_score = score
best_method = method
print('best_method=', best_method)
print("best_k=", best_k)
print("best_score=", best_score)
image image

↑ 曼哈顿距离:各维度差值之和

↑ 红蓝黄都是曼哈顿距离,绿色是欧拉距离

image

p = 1:曼哈顿距离

p = 2:欧拉距离

image
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 探索明可夫斯基距离相应的p
best_p = -1
best_score = 0.0
best_k = -1
for p in range(1, 6):
for k in range(1, 11):
knn_clf = KNeighborsClassifier(n_neighbors=k, weights='distance', p=p)
knn_clf.fit(X_train, y_train)
score = knn_clf.score(X_test, y_test)
if score > best_score:
best_p = p
best_k = k
best_score = score
print('best_p=', best_p)
print("best_k=", best_k)
print("best_score=", best_score)

4-6 网格搜索与k近邻算法中更多超参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import matplotlib
import matplotlib.pyplot as plt
from sklearn import datasets
digits = datasets.load_digits() # 加载内置的手写汉字数据集
X = digits.data
y = digits.target

# scikit-learn中的accuracy_score
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=666)
from sklearn.neighbors import KNeighborsClassifier

# Grid Search
param_grid = [
{
'weights': ['uniform'],
'n_neighbors': [i for i in range(1, 11)]
},
{
'weights': ['distance'],
'n_neighbors': [i for i in range(1, 11)],
'p': [i for i in range(1, 6)]
}
]

knn_clf = KNeighborsClassifier()
from sklearn.model_selection import GridSearchCV
grid_search = GridSerachCV(knn_clf, param_grid) # 传入knn分类器和网格参数
# %%time
grid_search.fit(X_train, y_train) # 相对比较慢
print(grid_search.best_estimator_) # 计算机计算的结果使用'单词+_'组合
print(grid_search.best_score_)
print(grid_search.best_params_)
knn_clf = grid_search.best_estimator_
print(knn_clf.score(X_test, y_test))
# n_jobs:使用CPU多少核
# verbose:输出信息,越大越详细,一般为2
grid_search = GridSerachCV(knn_clf, param_grid, n_jobs=-1, verbose=2)
image

4-7 数据归一化

image image

最值归一化

image

↑ 适用场景:考试成绩(0-100),像素颜色(0-255)

均值方差归一化(推荐)

  • 分母应该是标准差,不明白为什么一直说方差??
image
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import numpy as np
import matplotlib.pyplot as plt

# 最值归一化 Normalization
# 向量
x = np.random.randint(0, 100, size=100)
print((x - np.min(x)) / (np.max(x) - np.min(x)))
X = np.random.randint(0, 100, (50, 2))

# 矩阵
X = np.array(X, dtype=float)
# 第一列特征最值归一化
X[:, 0] = (X[:, 0] - np.min(X[:, 0])) / (np.max(X[:, 0]) - np.min(X[:, 0]))
X[:, 1] = (X[:, 1] - np.min(X[:, 1])) / (np.max(X[:, 1]) - np.min(X[:, 1]))
print(X[:10, :])
plt.scatter(X[:, 0], X[:, 1])
plt.show()

print(np.mean(X[:, 0])) # 0.5364210526315789
print(np.std(X[:, 0])) # 0.286957176339313
print(np.mean(X[:, 1])) # 0.5195833333333334
print(np.std(X[:, 1])) # 0.298706993650225

# 均值方差归一化 Standardization
X2 = np.random.randint(0, 100, (50, 2))
X2 = np.array(X2, dtype = float)
X2[:, 0] = (X2[:, 0] - np.mean(X2[:, 0])) / np.std(X2[:, 0])
X2[:, 1] = (X2[:, 1] - np.mean(X2[:, 1])) / np.std(X2[:, 1])
plt.scatter(X2[:, 0], X2[:, 1])
plt.show()

print(np.mean(X2[:, 0])) # 9.992007221626409e-17
print(np.std(X2[:, 0])) # 1.0
print(np.mean(X2[:, 1])) # 1.4488410471358292e-16
print(np.std(X2[:, 1])) # 1.0

4-8 scikit-learn中的Scaler

image image image image
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target
print(X[:10, :])

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)

# scikit-learn中的StandardScaler
from sklearn.preprocessing import StandardScaler # 使用sklearn中的StandardScalar
# from preprocessing import StandardScaler # 自定义StandardScaler
standardScaler = StandardScaler()
standardScaler.fit(X_train) # 计算得到关键信息
print(standardScaler.mean_) # 只读变量使用'单词_'命名
# print(standardScaler.std_) # 弃用
print(standardScaler.scale_)
X_train = standardScaler.transform(X_train) # 返回归一化后结果
X_test_standard = standardScaler.transform(X_test)

# 使用归一化后的数据,利用kNN分类
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
print(knn_clf.score(X_test_standard, y_test)) # 1.0
print(knn_clf.score(X_test, y_test)) # 0.3333333333333333

4-9 更多有关k近邻算法的思考

image image

↑ 使用K近邻算法解决回归问题-文档:http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

image

↑ KD-Tree:https://www.bilibili.com/video/BV1d5411w7f5

image image image

第5章 线性回归法

5-1 简单线性回归

image image image image image image image image

5-2 最小二乘法

image

b的推导

image image

a的推导

image image image

最终

image

5-3 简单线性回归的实现

python zip()函数详解:https://blog.csdn.net/weixin_47906106/article/details/121702241

main

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
import matplotlib.pyplot as plt
x = np.array([1., 2., 3., 4., 5.])
y = np.array([1., 3., 2., 3., 5.])
plt.scatter(x, y)
plt.axis([0, 6, 0, 6])
plt.show()

x_mean = np.mean(x)
y_mean = np.mean(y)
num = 0.0 # 分子
d = 0.0 # 分母
for x_i, y_i in zip(x, y): # 分别从x,y中各取一个值
num += (x_i - x_mean) * (y_i - y_mean)
d += (x_i - x_mean) ** 2
a = num / d
b = y_mean - a * x_mean
y_hat = a * x + b
plt.scatter(x, y)
plt.plot(x, y_hat, color = 'red')
plt.axis([0, 6, 0, 6])
plt.show()

x_predict = 6
y_predict = a * x_predict + b
print(y_predict) # 5.2

# 使用自己的SimpleLinearRegression
from playML.SimpleLinearRegression import SimpleLinearRegression1
reg1 = SimpleLinearRegression1()
reg1.fit(x, y)
print(reg1.predict(np.array([x_predict]))) # [5.2]
print(reg1.a_, reg1.b_) # 0.8 0.39999999999999947
y_hat1 = reg1.predict(x)
plt.scatter(x, y)
plt.plot(x, y_hat1, color='r')
plt.axis([0, 6, 0, 6])
plt.show()

SimpleLinearRegression1.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# 文件: SimpleLinearRegression1
# 作者: 聪头
# 时间: 2022/6/26 12:03
# 描述: 只处理一维特征
import numpy as np
class SimpleLinearRegression1:
def __init__(self):
'''初始化Simple Linear Regression 模型'''
self.a_ = None
self.b_ = None
def fit(self, x_train, y_train):
'''根据训练数据集x_train, y_train训练Simple Linear Regression模型'''
assert x_train.ndim == 1, \
'Simple Linear Regressor can only solve single feature training data.'
assert len(x_train) == len(y_train), \
'the size of x_train must be equal to the size of y_train'
x_mean = np.mean(x_train)
y_mean = np.mean(y_train)
num = 0.0
d = 0.0
for x, y in zip(x_train, y_train):
num += (x - x_mean) * (y - y_mean)
d += (x - x_mean) ** 2
self.a_ = num / d
self.b_ = y_mean - self.a_ * x_mean
return self

def predict(self, x_predict):
'''给定待预测数据集x_predict, 返回表示x_predict的结果向量'''
assert x_predict.ndim == 1, \
'Simple Linear Regressor can only solve single feature training data.'
assert self.a_ is not None and self.b_ is not None, \
'must fit before predict!'

return np.array([self._predict(x) for x in x_predict])

def _predict(self, x_single):
'''给定单个待预测数据x_single,返回x_single的预测结果值'''
return self.a_ * x_single + self.b_
def __repr__(self):
return "SimpleLinearRegression1()"

5-4 向量化

image
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 优化:向量运算
class SimpleLinearRegression2:
......
def fit(self, x_train, y_train):
......
x_mean = np.mean(x_train)
y_mean = np.mean(y_train)
num = 0.0
d = 0.0

#将for循环改为dot
num = (x_train - x_mean).dot(y_train - y_mean)
d = (x_train - x_mean).dot(x_train - x_mean)

self.a_ = num / d
self.b_ = y_mean - self.a_ * x_mean
return self
......
def __repr__(self):
return "SimpleLinearRegression2()"

性能测试

image

5-5 衡量线性回归法的指标:MSE,RMSE和MAE

image image image image

RMSE vs MAE

  • RMSE中将误差平方,放大了误差,故RMSE的结果通常大于MAE
  • 实际使用中,应尽可能让RMSE更小
image

5-6 最好的衡量线性回归法的指标:R Squared

image image

我们模型预测产生的错误由于考虑了x,y之间的关系,因此会比Baseline Model小

  • Baseline Model(基准模型)中不考虑x和y的关系,任何x输入都使用y均值输出
image image
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

# 波士顿房产数据
boston = datasets.load_boston()
x = boston.data[:, 5] # 只使用房间数量这个特征 RM
y = boston.target

# 使用简单线性回归算法
from playML.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)
from playML.SimpleLinearRegression import SimpleLinearRegression2
reg = SimpleLinearRegression2()
reg.fit(x_train, y_train)

# 预测
y_predict = reg.predict(x_test)

# 使用自己封装的r2_score
from playML.metrics import r2_score
print(r2_score(y_test, y_predict)) # 0.40277850092929524

# 使用sklearn提供的r2_score
from sklearn.metrics import r2_score
print(r2_score(y_test, y_predict)) # 0.4027785009292951

5-7 多元线性回归和正规方程解

image image image image image

已知:y(样本对应标签)和Xb(第0列全为1,由样本特征值组成的矩阵)

1.求导

2.求极值(导数=0)

  • 涉及矩阵求导(超纲)
image

推导过程略,本质就是对θ每一个分量求偏导,令其为0得到最终结果

image

↑ θ只是X各列的系数,没有量纲问题

5-8 实现多元线性回归

image
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

# 波士顿房产数据
boston = datasets.load_boston()
X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]
print(X.shape) # (490, 13)

from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
from playML.LinearRegression import LinearRegression
reg = LinearRegression()
reg.fit_normal(X_train, y_train)
print(reg.score(X_test,y_test)) # 0.8129794056212907

5-9 使用scikit-learn解决回归问题

从5-9开始:代码直接摘自老师提供的源码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

# 波士顿房产数据
boston = datasets.load_boston()
X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]
print(X.shape) # (490, 13)

from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
lin_reg.score(X_test, y_test)

# kNN Regressor
from sklearn.neighbors import KNeighborsRegressor

knn_reg = KNeighborsRegressor()
knn_reg.fit(X_train_standard, y_train)
knn_reg.score(X_test_standard, y_test)

# 优化:kNN网格搜索优化超参数
from sklearn.model_selection import GridSearchCV

param_grid = [
{
"weights": ["uniform"],
"n_neighbors": [i for i in range(1, 11)]
},
{
"weights": ["distance"],
"n_neighbors": [i for i in range(1, 11)],
"p": [i for i in range(1,6)]
}
]
knn_reg = KNeighborsRegressor()
grid_search = GridSearchCV(knn_reg, param_grid, n_jobs=-1, verbose=1)
grid_search.fit(X_train_standard, y_train)
grid_search.best_params_

grid_search.best_score_

grid_search.best_estimator_.score(X_test_standard, y_test) # 该值才是真实的score

5-10 线性回归的可解释性和更多思考

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
boston = datasets.load_boston()

X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)

lin_reg.coef_

np.argsort(lin_reg.coef_)

boston.feature_names[np.argsort(lin_reg.coef_)]

print(boston.DESCR)
image image image

第6章 梯度下降法

6-1 什么是梯度下降法

直观理解就是滚球,每次滚一段区域直到最低点

image image

↑ 如图导数为负:代表沿x轴负方向,y值增大。乘上一个负值可以取反,表示沿x轴正方向,y值增大

  • 具体步骤:每次得到一个theta,求其导数,如果导数大于0,就沿上式方向移动theta,继续求导直到导数为0
image

可能遇到的问题

image image image image

6-2 模拟实现梯度下降法

梯度下降核心算法

  • 实现思路:任取一个θ,对其求导,并使用eta配合导数值对其偏移,偏移前后差值小于epsilon则视为找到极值点
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import numpy as np
import matplotlib.pyplot as plt

plot_x = np.linspace(-1., 6., 141)
plot_y = (plot_x-2.5)**2 - 1.
plt.plot(plot_x, plot_y)
plt.show()

epsilon = 1e-8 # 精度误差
eta = 0.1 # 学习率

def J(theta):
return (theta-2.5)**2 - 1.

def dJ(theta):
return 2*(theta-2.5)

theta = 0.0
while True:
gradient = dJ(theta)
last_theta = theta
theta = theta - eta * gradient

if(abs(J(theta) - J(last_theta)) < epsilon):
break
print(theta)
print(J(theta))

优化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# 新增异常判断,避免溢出
def J(theta):
try:
return (theta-2.5)**2 - 1.
except:
return float('inf')

# 限制迭代次数,避免死循环
def gradient_descent(initial_theta, eta, n_iters = 1e4, epsilon=1e-8):

theta = initial_theta
i_iter = 0
theta_history.append(initial_theta)

while i_iter < n_iters:
gradient = dJ(theta)
last_theta = theta
theta = theta - eta * gradient
theta_history.append(theta)

if(abs(J(theta) - J(last_theta)) < epsilon):
break

i_iter += 1
return

# 绘制梯度下降图示
def plot_theta_history():
plt.plot(plot_x, J(plot_x))
plt.plot(np.array(theta_history), J(np.array(theta_history)), color="r", marker='+')
plt.show()

6-3 线性回归中的梯度下降法

image

Xb:由样本集合X追加全为1的第一列构成的矩阵

θ:由θ0 - θn组成的向量

二者dot结果就是各分量对应相乘再相加

image

去掉M带来的影响(不然样本越多,梯度越大)

只看一个样本,相当于去掉求和符号,此时相当于每一个维度求一次偏导得到该样本的梯度

梯度下降法相当于对所有样本进行一次操作,之后求平均得到总的平均梯度

image

6-4 实现线性回归中的梯度下降法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
 import numpy as np
import matplotlib.pyplot as plt

np.random.seed(666)
x = 2 * np.random.random(size=100)
y = x * 3. + 4. + np.random.normal(size=100)
X = x.reshape(-1, 1)
print(X.shape)
print(y.shape)

plt.scatter(x, y)
plt.show()


def J(theta, X_b, y):
try:
return np.sum((y - X_b.dot(theta))**2) / len(X_b)
except:
return float('inf')
def dJ(theta, X_b, y):
res = np.empty(len(theta))
res[0] = np.sum(X_b.dot(theta) - y)
for i in range(1, len(theta)):
res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
return res * 2 / len(X_b)

def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):
theta = initial_theta
i_iter = 0

while i_iter < n_iters:
gradient = dJ(theta, X_b, y)
last_theta = theta
theta = theta - eta * gradient

if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
break

i_iter += 1
return theta

X_b = np.hstack([np.ones((len(x), 1)), x.reshape(-1, 1)])
initial_theta = np.zeros(X_b.shape[1])
eta = 0.01

theta = gradient_descent(X_b, y, initial_theta, eta)
print(theta) # [4.02145786 3.00706277]

# 使用封装了梯度下降的线性回归算法
from playML.LinearRegression import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit_gd(X, y)
print(lin_reg.coef_) # [3.00706277]
print(lin_reg.intercept_) # 4.021457858204859

修改LinearRegression

  • 新增fit_gd方法
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def fit_gd(self, X_train, y_train, eta=0.01, n_iters=1e4):
"""根据训练数据集X_train, y_train, 使用梯度下降法训练Linear Regression模型"""
assert X_train.shape[0] == y_train.shape[0], \
"the size of X_train must be equal to the size of y_train"

def J(theta, X_b, y):
try:
return np.sum((y - X_b.dot(theta)) ** 2) / len(y)
except:
return float('inf')

def dJ(theta, X_b, y):
res = np.empty(len(theta))
res[0] = np.sum(X_b.dot(theta) - y)
for i in range(1, len(theta)):
res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
return res * 2 / len(X_b)

def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):

theta = initial_theta
cur_iter = 0

while cur_iter < n_iters:
gradient = dJ(theta, X_b, y)
last_theta = theta
theta = theta - eta * gradient
if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
break

cur_iter += 1

return theta

X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
initial_theta = np.zeros(X_b.shape[1])
self._theta = gradient_descent(X_b, y_train, initial_theta, eta, n_iters)

self.intercept_ = self._theta[0]
self.coef_ = self._theta[1:]

return self

6-5 梯度下降法的向量化和数据标准化

image image image
1
2
3
4
5
6
7
def dJ(theta, X_b, y):
# res = np.empty(len(theta))
# res[0] = np.sum(X_b.dot(theta) - y)
# for i in range(1, len(theta)):
# res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
# return res * 2 / len(X_b)
return X_b.T.dot(X_b.dot(theta) - y) * 2. / len(X_b)

测试使用梯度下降法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

# 波士顿房产数据
boston = datasets.load_boston()
X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]
print(X.shape) # (490, 13)

from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
from playML.LinearRegression import LinearRegression
reg = LinearRegression()
reg.fit_normal(X_train, y_train)
print(reg.score(X_test,y_test)) # 0.8129794056212907

# 数据没有归一化,导致模型不收敛(overflow)
# lin_reg2 = LinearRegression()
# lin_reg2.fit_gd(X_train, y_train)
# print(lin_reg2.fit_gd(X_test, y_test))

# scikit-learn中的StandardScaler
from sklearn.preprocessing import StandardScaler # 使用sklearn中的StandardScalar
standardScaler = StandardScaler()
standardScaler.fit(X_train) # 计算得到关键信息
X_train_standard = standardScaler.transform(X_train) # 返回归一化后结果
X_test_standard = standardScaler.transform(X_test)

lin_reg3 = LinearRegression()
lin_reg3.fit_gd(X_train_standard, y_train, eta=0.0001, n_iters=1e4)
print(lin_reg3.score(X_test_standard, y_test))
image

6-6 随机梯度下降法

image

学习率应该随着循环次数的增加逐渐递减

image
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
import matplotlib.pyplot as plt

m = 100000

x = np.random.normal(size=m)
X = x.reshape(-1,1)
y = 4.*x + 3. + np.random.normal(0, 3, size=m)
plt.scatter(x, y)
plt.show()


def J(theta, X_b, y):
try:
return np.sum((y - X_b.dot(theta)) ** 2) / len(y)
except:
return float('inf')


def dJ(theta, X_b, y):
return X_b.T.dot(X_b.dot(theta) - y) * 2. / len(y)


def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):
theta = initial_theta
cur_iter = 0

while cur_iter < n_iters:
gradient = dJ(theta, X_b, y)
last_theta = theta
theta = theta - eta * gradient
if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
break

cur_iter += 1

return theta

X_b = np.hstack([np.ones((len(X), 1)), X])
initial_theta = np.zeros(X_b.shape[1])
eta = 0.01
theta = gradient_descent(X_b, y, initial_theta, eta)

print(theta) # [3.00538344 3.98886975]

def dJ_sgd(theta, X_b_i, y_i):
return 2 * X_b_i.T.dot(X_b_i.dot(theta) - y_i)

def sgd(X_b, y, initial_theta, n_iters):

t0, t1 = 5, 50
def learning_rate(t):
return t0 / (t + t1)

theta = initial_theta
for cur_iter in range(n_iters):
rand_i = np.random.randint(len(X_b)) # 随机取一个样本
gradient = dJ_sgd(theta, X_b[rand_i], y[rand_i]) # 计算梯度
theta = theta - learning_rate(cur_iter) * gradient # 修改theta

return theta

X_b = np.hstack([np.ones((len(X), 1)), X])
initial_theta = np.zeros(X_b.shape[1])
theta = sgd(X_b, y, initial_theta, n_iters=m//3)
print(theta) # array([2.93222467, 4.02957069])

6-7 scikit-learn中的随机梯度下降法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sklearn import datasets

boston = datasets.load_boston()
X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]

from playML.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
from sklearn.preprocessing import StandardScaler

standardScaler = StandardScaler()
standardScaler.fit(X_train)
X_train_standard = standardScaler.transform(X_train)
X_test_standard = standardScaler.transform(X_test)

from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(n_iter_no_change=50)
%time sgd_reg.fit(X_train_standard, y_train)
sgd_reg.score(X_test_standard, y_test) # 0.8124437321129272

6-8 如何确定梯度计算的准确性?调试梯度下降法

image image image

↑ 实际使用中,梯度的调试通常很慢,可用于前期测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
import matplotlib.pyplot as plt

# 构造本节测试数据
np.random.seed(666)
X = np.random.random(size=(1000, 10))

true_theta = np.arange(1, 12, dtype=float) # 1(截距) + 10(特征) = 11个
X_b = np.hstack([np.ones((len(X), 1)), X])
y = X_b.dot(true_theta) + np.random.normal(size=1000)

print(X.shape)
print(y.shape)
print(true_theta)

def J(theta, X_b, y):
try:
return np.sum((y - X_b.dot(theta))**2) / len(X_b)
except:
return float('inf')

def dJ_math(theta, X_b, y):
return X_b.T.dot(X_b.dot(theta) - y) * 2. / len(y)

def dJ_debug(theta, X_b, y, epsilon=0.01):
res = np.empty(len(theta))
for i in range(len(theta)):
theta_1 = theta.copy()
theta_1[i] += epsilon
theta_2 = theta.copy()
theta_2[i] -= epsilon
res[i] = (J(theta_1, X_b, y) - J(theta_2, X_b, y)) / (2 * epsilon)
return res


def gradient_descent(dJ, X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):
theta = initial_theta
cur_iter = 0

while cur_iter < n_iters:
gradient = dJ(theta, X_b, y)
last_theta = theta
theta = theta - eta * gradient
if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
break
cur_iter += 1
return theta

X_b = np.hstack([np.ones((len(X), 1)), X])
initial_theta = np.zeros(X_b.shape[1])
eta = 0.01

%time theta = gradient_descent(dJ_debug, X_b, y, initial_theta, eta)
theta

%time theta = gradient_descent(dJ_math, X_b, y, initial_theta, eta)
theta

6-9 有关梯度下降法的更多深入讨论

批量梯度下降法:每次求梯度,需要把所有样本数据看一遍,优点是稳定

随机梯度下降法:每次求梯度只随机取一个样本的梯度,优点是快

小批量梯度下降法:每次随机看k个样本

image image image

第7章 PCA与梯度上升法

7-1 什么是PCA

image image image image image

步骤1:demean

image image

步骤2:求方差最大值

image

由于X均值为0,故化简为如下式子

image image image

不同于线性回归:

  • 线性回归是确保预测值和真值之间最小
  • PCA是确保各元素间差值最大(方差最大)
image

7-2 使用梯度上升法求解PCA问题

把w视为自变量,每次求导后都往极值点方向走一些(w记得每次要单位化)

image image image

X:MxN

w:Nx1

X:MxN

(Xw)^T^X = 1xM MxN = 1 x N

((Xw)^T^X)^T^ = N x 1 = X^T^(Xw) # 我们想要的列向量

image

7-3 求数据的主成分PCA

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
import matplotlib.pyplot as plt
X = np.empty((100, 2))
X[:,0] = np.random.uniform(0., 100., size=100)
X[:,1] = 0.75 * X[:,0] + 3. + np.random.normal(0, 10., size=100)
plt.scatter(X[:,0], X[:,1])
plt.show()

def demean(X):
return X - np.mean(X, axis=0) # np.mean(X, axis=0)沿行方向求均值,相当于每列的均值

X_demean = demean(X)
plt.scatter(X_demean[:,0], X_demean[:,1])
plt.show()


# 梯度上升法
def f(w, X):
return np.sum((X.dot(w) ** 2)) / len(X)


def df_math(w, X):
return X.T.dot(X.dot(w)) * 2. / len(X)


def df_debug(w, X, epsilon=0.0001):
res = np.empty(len(w))
for i in range(len(w)):
w_1 = w.copy()
w_1[i] += epsilon
w_2 = w.copy()
w_2[i] -= epsilon
res[i] = (f(w_1, X) - f(w_2, X)) / (2 * epsilon)
return res


def direction(w):
return w / np.linalg.norm(w)


def gradient_ascent(df, X, initial_w, eta, n_iters=1e4, epsilon=1e-8):
w = direction(initial_w)
cur_iter = 0

while cur_iter < n_iters:
gradient = df(w, X)
last_w = w
w = w + eta * gradient
w = direction(w) # 注意1:每次求一个单位方向
if (abs(f(w, X) - f(last_w, X)) < epsilon):
break

cur_iter += 1

return w


initial_w = np.random.random(X.shape[1]) # 注意2:不能用0向量开始
eta = 0.001
gradient_ascent(df_math, X_demean, initial_w, eta)

w = gradient_ascent(df_math, X_demean, initial_w, eta)

plt.scatter(X_demean[:,0], X_demean[:,1])
plt.plot([0, w[0]*30], [0, w[1]*30], color='r') # 表示(0,0)和(w[0] * 30, w[1] * 30)两个点构成的直线
plt.show()

# 使用极端数据集测试
X2 = np.empty((100, 2))
X2[:,0] = np.random.uniform(0., 100., size=100)
X2[:,1] = 0.75 * X2[:,0] + 3.
plt.scatter(X2[:,0], X2[:,1])
plt.show()

X2_demean = demean(X2)
w2 = gradient_ascent(df_math, X2_demean, initial_w, eta)
plt.scatter(X2_demean[:,0], X2_demean[:,1])
plt.plot([0, w2[0]*30], [0, w2[1]*30], color='r')
plt.show()

7-4 求数据的前n个主成分

image
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
import matplotlib.pyplot as plt
X = np.empty((100, 2))
X[:,0] = np.random.uniform(0., 100., size=100)
X[:,1] = 0.75 * X[:,0] + 3. + np.random.normal(0, 10., size=100)
def demean(X):
return X - np.mean(X, axis=0)
X = demean(X)
plt.scatter(X[:,0], X[:,1])
plt.show()


def f(w, X):
return np.sum((X.dot(w) ** 2)) / len(X)


def df(w, X):
return X.T.dot(X.dot(w)) * 2. / len(X)


def direction(w):
return w / np.linalg.norm(w)


def first_component(X, initial_w, eta, n_iters=1e4, epsilon=1e-8):
w = direction(initial_w)
cur_iter = 0

while cur_iter < n_iters:
gradient = df(w, X)
last_w = w
w = w + eta * gradient
w = direction(w)
if (abs(f(w, X) - f(last_w, X)) < epsilon):
break

cur_iter += 1

return w


initial_w = np.random.random(X.shape[1])
eta = 0.01
w = first_component(X, initial_w, eta)
w

# 循环求第二主成分
X2 = np.empty(X.shape)
for i in range(len(X)):
X2[i] = X[i] - X[i].dot(w) * w

plt.scatter(X2[:, 0], X2[:, 1])
plt.show()

# 向量求第二主成分
X2 = X - X.dot(w).reshape(-1, 1) * w
plt.scatter(X2[:,0], X2[:,1])
plt.show()

# 第一第二主成分dot为0
w2 = first_component(X2, initial_w, eta)
w2
w.dot(w2)

def first_n_components(n, X, eta=0.01, n_iters = 1e4, epsilon=1e-8):
X_pca = X.copy()
X_pca = demean(X_pca)
res = []
for i in range(n):
initial_w = np.random.random(X_pca.shape[1])
w = first_component(X_pca, initial_w, eta)
res.append(w)

X_pca = X_pca - X_pca.dot(w).reshape(-1, 1) * w

return res
first_n_components(2, X) # [array([0.75772863, 0.65256978]), array([ 0.65257463, -0.75772445])]

7-5 高维数据映射为低维数据

k:代表前k个主成分(k < n)

n:代表数据特征的维数

m:样本数

image image

低维再转高维后,数据前后有差异

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np


class PCA:

def __init__(self, n_components):
"""初始化PCA"""
assert n_components >= 1, "n_components must be valid"
self.n_components = n_components
self.components_ = None

def fit(self, X, eta=0.01, n_iters=1e4):
"""获得数据集X的前n个主成分"""
assert self.n_components <= X.shape[1], \
"n_components must not be greater than the feature number of X"

def demean(X):
return X - np.mean(X, axis=0)

def f(w, X):
return np.sum((X.dot(w) ** 2)) / len(X)

def df(w, X):
return X.T.dot(X.dot(w)) * 2. / len(X)

def direction(w):
return w / np.linalg.norm(w)

def first_component(X, initial_w, eta=0.01, n_iters=1e4, epsilon=1e-8):

w = direction(initial_w)
cur_iter = 0

while cur_iter < n_iters:
gradient = df(w, X)
last_w = w
w = w + eta * gradient
w = direction(w)
if (abs(f(w, X) - f(last_w, X)) < epsilon):
break

cur_iter += 1

return w

X_pca = demean(X)
self.components_ = np.empty(shape=(self.n_components, X.shape[1]))
for i in range(self.n_components):
initial_w = np.random.random(X_pca.shape[1])
w = first_component(X_pca, initial_w, eta, n_iters)
self.components_[i,:] = w

X_pca = X_pca - X_pca.dot(w).reshape(-1, 1) * w

return self

def transform(self, X):
"""将给定的X,映射到各个主成分分量中"""
assert X.shape[1] == self.components_.shape[1]

return X.dot(self.components_.T)

def inverse_transform(self, X):
"""将给定的X,反向映射回原来的特征空间"""
assert X.shape[1] == self.components_.shape[0]

return X.dot(self.components_)

def __repr__(self):
return "PCA(n_components=%d)" % self.n_components

降维后升维

image

测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
import matplotlib.pyplot as plt
X = np.empty((100, 2))
X[:,0] = np.random.uniform(0., 100., size=100)
X[:,1] = 0.75 * X[:,0] + 3. + np.random.normal(0, 10., size=100)
from playML.PCA import PCA

pca = PCA(n_components=2)
pca.fit(X)
PCA(n_components=2)
pca.components_
"""
array([[ 0.76676948, 0.64192256],
[-0.64191827, 0.76677307]])
"""

pca = PCA(n_components=1)
pca.fit(X)
PCA(n_components=1)
X_reduction = pca.transform(X)
X_reduction.shape # (100, 1)

X_restore = pca.inverse_transform(X_reduction)
X_restore.shape # (100, 2)

plt.scatter(X[:,0], X[:,1], color='b', alpha=0.5)
plt.scatter(X_restore[:,0], X_restore[:,1], color='r', alpha=0.5)
plt.show()

7-6 scikit-learn中的PCA

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
digits = datasets.load_digits()
X = digits.data
y = digits.target
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)
X_train.shape # (1347, 64)

%%time

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train) # Wall time: 67.4 ms
knn_clf.score(X_test, y_test) # 0.9866666666666667

# 降维后
from sklearn.decomposition import PCA

pca = PCA(n_components=2) # 降低到2维
pca.fit(X_train)
X_train_reduction = pca.transform(X_train)
X_test_reduction = pca.transform(X_test)

%%time
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train_reduction, y_train) # Wall time: 1.2 ms
knn_clf.score(X_test_reduction, y_test) # 0.6066666666666667

# 主成分所解释的方差
pca.explained_variance_ratio_ # array([0.14566817, 0.13735469])

from sklearn.decomposition import PCA

pca = PCA(n_components=X_train.shape[1])
pca.fit(X_train)
pca.explained_variance_ratio_ # 每一个主成分可以解释的方差是多少(随着维度提高,逐渐递减)

plt.plot([i for i in range(X_train.shape[1])],
[np.sum(pca.explained_variance_ratio_[:i]) for i in range(X_train.shape[1])])
plt.show()

pca = PCA(0.95) # 取95%信息的n_components
pca.fit(X_train)
pca.n_components_ # 28
X_train_reduction = pca.transform(X_train)
X_test_reduction = pca.transform(X_test)
%%time
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train_reduction, y_train) # Wall time: 2.32 ms
knn_clf.score(X_test_reduction, y_test) # 0.98

# 使用PCA对数据进行降维可视化
pca = PCA(n_components=2)
pca.fit(X)
X_reduction = pca.transform(X)

for i in range(10):
plt.scatter(X_reduction[y==i,0], X_reduction[y==i,1], alpha=0.8)
plt.show()

7-7 试手MNIST数据集

image
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np 

# from sklearn.datasets import fetch_mldata
# mnist = fetch_mldata('MNIST original')
# 在最新版的 sklearn 中,fetch_mldata 被弃用,改为使用 fetch_openml 获得 MNIST 数据集
# 具体见如下代码,后续代码无需改变

from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784')

X, y = mnist['data'], mnist['target']
X_train = np.array(X[:60000], dtype=float)
y_train = np.array(y[:60000], dtype=float)
X_test = np.array(X[60000:], dtype=float)
y_test = np.array(y[60000:], dtype=float)

X_train.shape # (60000, 784)
X_test.shape # (10000, 784)

# 使用kNN
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
%time knn_clf.fit(X_train, y_train)
%time knn_clf.score(X_test, y_test) #0.9688

# 使用PCA进行降维
from sklearn.decomposition import PCA
pca = PCA(0.90)
pca.fit(X_train)
X_train_reduction = pca.transform(X_train)
X_test_reduction = pca.transform(X_test)
X_train_reduction.shape # (60000, 87)

knn_clf = KNeighborsClassifier()
%time knn_clf.fit(X_train_reduction, y_train)
%time knn_clf.score(X_test_reduction, y_test) # 降维去除了噪音,有可能准确率更高! # 0.9728

第8章 多项式回归与模型泛化

8-1 什么是多项式回归

URL:https://git.imooc.com/coding-169/coding-169/src/master/08-Polynomial-Regression-and-Model-Generalization/01-What-is-Polynomial-Regression

image
1
2
3
4
5
6
7
import numpy as np 
import matplotlib.pyplot as plt
x = np.random.uniform(-3, 3, size=100)
X = x.reshape(-1, 1)
y = 0.5 * x**2 + x + 2 + np.random.normal(0, 1, 100)
plt.scatter(x, y)
plt.show()

线性回归?

1
2
3
4
5
6
7
8
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)
y_predict = lin_reg.predict(X)
plt.scatter(x, y)
plt.plot(x, y_predict, color='r')
plt.show()

解决方案, 添加一个特征

  • 这样把X^2当作一个线性回归项,就可以使用线性回归
1
2
3
4
5
6
7
8
9
10
11
12
X2 = np.hstack([X, X**2]) #  添加一个特征
X2.shape # (100, 2)

lin_reg2 = LinearRegression()
lin_reg2.fit(X2, y)
y_predict2 = lin_reg2.predict(X2)
plt.scatter(x, y)
plt.plot(np.sort(x), y_predict2[np.argsort(x)], color='r')
plt.show()

lin_reg2.coef_ # array([ 0.99870163, 0.54939125])
lin_reg2.intercept_ # 1.8855236786516001

8-2 scikit-learn中的多项式回归与Pipeline

URL:https://git.imooc.com/coding-169/coding-169/src/master/08-Polynomial-Regression-and-Model-Generalization/02-Polynomial-Regression-in-scikit-learn

image
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
import matplotlib.pyplot as plt
x = np.random.uniform(-3, 3, size=100)
X = x.reshape(-1, 1)
y = 0.5 * x**2 + x + 2 + np.random.normal(0, 1, 100)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
poly.fit(X)
X2 = poly.transform(X)
print(X2.shape) # (100, 3)

print(X[:5,:])
print(X2[:5,:])

from sklearn.linear_model import LinearRegression

lin_reg2 = LinearRegression()
lin_reg2.fit(X2, y)
y_predict2 = lin_reg2.predict(X2)
plt.scatter(x, y)
plt.plot(np.sort(x), y_predict2[np.argsort(x)], color='r')
plt.show()

print(lin_reg2.coef_) # [0. 1.00753239 0.44398241]
print(lin_reg2.intercept_) # 2.0793848909176553

# 关于PolynomialFeatures
X = np.arange(1, 11).reshape(-1, 2)
print(X)
"""
[[ 1 2]
[ 3 4]
[ 5 6]
[ 7 8]
[ 9 10]]
"""
poly = PolynomialFeatures(degree=2)
poly.fit(X)
X2 = poly.transform(X)
print(X2.shape)
print(X2)

# Pipeline:传入一系列类对象,依次执行
x = np.random.uniform(-3, 3, size=100)
X = x.reshape(-1, 1)
y = 0.5 * x**2 + x + 2 + np.random.normal(0, 1, 100)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

poly_reg = Pipeline([
("poly", PolynomialFeatures(degree=2)),
("std_scaler", StandardScaler()),
("lin_reg", LinearRegression())
])

poly_reg.fit(X, y)
y_predict = poly_reg.predict(X)

plt.scatter(x, y)
plt.plot(np.sort(x), y_predict[np.argsort(x)], color='r')
plt.show()

8-3 过拟合与欠拟合

URL:https://git.imooc.com/coding-169/coding-169/src/master/08-Polynomial-Regression-and-Model-Generalization/03-Overfitting-and-Underfitting

8-4 为什么要有训练数据集与测试数据集

URL:https://git.imooc.com/coding-169/coding-169/src/master/08-Polynomial-Regression-and-Model-Generalization/04-Why-Train-Test-Split

image image image image

8-5 学习曲线

URL:https://git.imooc.com/coding-169/coding-169/src/master/08-Polynomial-Regression-and-Model-Generalization/05-Learning-Curve

image

欠拟合:整体误差大

image

过拟合:测试数据集误差大

image

8-6 验证数据集与交叉验证

image image image

8-7 偏差方差平衡

image image image image image image image image image

8-8 模型泛化与岭回归

URL:https://git.imooc.com/coding-169/coding-169/src/master/08-Polynomial-Regression-and-Model-Generalization/08-Model-Regularization-and-Ridge-Regression

模型正则化:通俗理解,就是过拟合的系数太大,加入一项使其减小,增强其泛化能力

image image

8-9 LASSO

image image image

在Ridge中,每一步θ都是有值的,沿着梯度降为0

image

在LASSO中,θ会先沿某个轴降依次降为0

image

8-10 L1, L2和弹性网络

image image image

L0正则项:使θ尽可能少

image image

第9章 逻辑回归

解决:分类

9-1 什么是逻辑回归

一句话:将值域的(-∞, +∞)的数据映射到(0, 1)进行分类(大于0.5为1,小于0.5为0)

image image image

Sigmoid函数

image image image

9-2 逻辑回归的损失函数

image

自变量X为正数时,满足左加右减;X为负数时,满足左减右加(该规则仅适用于最后一步,即X符号已经确定之后)

image

累加所有样本的误差求平均得到损失函数

image image image

Xbθ:值域范围(-∞,+∞)

预测值-Sigmoid:值域范围(0,1)

J(θ):损失函数,当真值为1,预测值为0时,误差最大;当真值为0,预测值为1时,误差最大

9-3 逻辑回归损失函数的梯度

image image

前半部分

image

后半部分

image image

合并

image image

得到损失函数对其中某一维度的θj求导,参考所有样本求本均值的梯度

yhat:逻辑回归的估计值(介于0~1之间)

image image image

结论

image

9-7 scikit-learn中的逻辑回归

image

9-8 OvR与OvO

通俗理解:https://wenku.baidu.com/view/c13c613ac181e53a580216fc700abb68a982adc4.html

image

OvR更快,OvO更准

image image

第10章 评价分类结果

10-1 准确度的陷阱和混淆矩阵

image image

记忆诀窍:预测值是啥,第二个数就写啥。预测和真实值(第一和第二个数)不同,就是F;相同就是T

10-2 精准率和召回率

精准率:所有预测中,正确预测的概率(主观,预测结果中,正确的概率)

召回率:所有真值中,正确预测的概率(客观,真实结果中,正确的概率)

image image image

在极其有偏的数据中,准确率没意义,需看精准率和召回率

image

10-4 F1 Score

image image image

调和平均值:只有所有变量都大时,F1才大(即一个小,结果小)

image image

10-5 精准率和召回率的平衡

image

X轴:阈值

Y轴:精确率和召回率

image

X轴:精确率

Y轴:召回率

image

曲线下面积越大,效果越好

image

10-7 ROC曲线

image

TPR就是召回率

image image

在视频异常检测中:

  • TPR:所有异常视频中,预测为异常的概率(尽可能高)
  • FPR:所有正常视频中,预测为异常的概率(尽可能低)
  • 下图中五角星可当异常视频,圆圈当正常视频
image

记忆:ROC曲线下面积越大,效果越好

image

PR曲线与ROC曲线的选择:https://coding.imooc.com/learn/questiondetail/42693.html

具体到PR曲线和ROC曲线,他们的核心区别在TN。可以看出来,PR曲线其实不反应TN。所以,如果你的应用场景中,如果TN并不重要,那么PR曲线是一个很好的指标(事实上,Precision和Recall就是通过抹去TN,来去除极度的偏斜数据带来的影响,进而放大FP, FN和TP三者的关系的)。

第11章 支撑向量机 SVM

解决:分类和回归

11-1 什么是SVM

image image

Soft Margin SVM可解决线性不可分问题

11-2 SVM背后的最优化问题

image image image image

将wd改写成w,其余依此类推

image image

有条件的最优化问题

image

11-3 Soft Margin SVM

Hard Margin SVM存在的问题

可能受特殊点影响,泛化能力低

image

线性不可分

image image image

C越大,容错空间越小;C越小,容错空间越大

image image

sklearn中的超参数C实在min0.5||w||^2前面,而不是控制正则化项。因此越大,容错能力越差

11-4 scikit-learn中的SVM

标准化的原因

image image

11-6 到底什么是核函数

本节巨懵逼

怎么转换和转换后为何是这样都没有说清楚呃。。。

image

大致意思就是:将原样本数据带入K这个核函数即可完成某种特征的转换

image

大致意思:把x和y看作是两个向量,使用K函数将x,y映射到x’和y’。x’和y’形式一致,如下所示,从1开始,直到xn^2^

  • 不难发现,映射后的x’有0~2次项,相当于升维,但此法使用函数映射在运行中升维,降低存储空间
image image

11-7 RBF核函数

image image image

举例:取固定两个参考点l1l2

image image

实际使用中,每个样本都是地标点

image

11-8 RBF核函数中的gamma

image

11-9 SVM思想解决回归问题

Margin内的点越多越好,最后取平均

image

第12章 决策树

非参数学习

解决:分类、回归

12-1 什么是决策树

image image image image

12-2 信息熵

image

pi:每一类信息所占比例

右边的数据比左边的数据更确定

image image

12-3 使用信息熵寻找最优划分

一次划分思路:每次取两个不同的样本的同一维度的特征,相加求平均进行划分,计算划分后两边的信息熵,取最小信息熵的划分

N次划分:在一次划分的基础上,可以剔除信息熵最小(接近0)的部分再次划分,或者在其他特征基础上继续划分

12-4 基尼系数

image

12-5 CART与决策树中的超参数

image image image

max_depth:最大深度

min_samples_split:对一个节点至少拥有多少数据才拆分,越大越不容易过拟合(过大欠拟合)

min_samples_leaf:对叶子节点来说至少有多少样本,越大越不容易过拟合(过大欠拟合)

max_leaf_nodes:叶子节点数,间接设置深度

image

12-7 决策树的局限性

决策边界横平竖直

image

对个别数据特别敏感

详见代码

第13章 集成学习和随机森林

13-1 什么是集成学习

image image

13-2 Soft Voting Classifier

image image image

能估计概率的模型

image image image image

Soft通常比Hard好

image

13-3 Bagging 和 Pasting

image image image image image image image

13-4 oob (Out-of-Bag) 和关于Bagging的更多讨论

下图未讲原因哈

image image image image

13-5 随机森林和 Extra-Trees

sklearn的随机森林:在每一个节点上,取随机的特征划分(而不是像决策树一样,在所有特征上找最优划分)

image image image

13-6 Ada Boosting 和 Gradient Boosting

机器学习经典模型:集成学习——Boosting(Adaboost与gradient boosting):https://blog.csdn.net/jesseyule/article/details/111997597

  • Adaboost和gradient boosting的区别主要在于从不同的角度实现对上一轮训练错误的数据更多关注,Adaboost主要对数据引入权重,训练后调整学习错误的数据的权重,从而使得下一轮学习中给予学习错误的数据更多的关注。
image

犯错->改错,如此往复

image image image

看的顺序:从左到右,从上到下

  • 每次绿色的图都是对前一张绿色图中错误的弥补
  • 每次红色的图都是所有绿色图的叠加
image image

13-7 Stacking

image image image

第14章 更多机器学习算法

14-1 学习scikit-learn文档, 大家加油!

官网链接:https://scikit-learn.org/stable/

 评论