Python3入门机器学习 经典算法与应用 时间:2022年6月23日16:13:53
Git URL:https://git.imooc.com/coding-169/coding-169/src/master
第1章 欢迎来到 Python3 玩转机器学习 1-1 什么是机器学习
1-2 课程涵盖的内容和理念
1-3 课程所使用的主要技术栈
第2章 机器学习基础 2-1 机器学习世界的数据
↑ 大写代表矩阵,小写代表向量
2-2 机器学习的主要任务 机器学习的基本任务:分类、回归
分类
回归
2-3 监督学习,非监督学习,半监督学习和增强学习 监督学习 监督学习:给机器的训练数据拥有“标记”或者“答案”
非监督学习
半监督学习
增强学习
2-4 批量学习,在线学习,参数学习和非参数学习
批量学习
在线学习
参数学习
非参数学习
2-5 和机器学习相关的“哲学”思考
无数据举例:Alpha Go
本章小结
2-7 课程使用环境搭建 Anaconda官网:https://www.anaconda.com/
第3章 Jupyter Notebook, numpy和matplotlib 3-1 Jupyter Notebook基础
快捷键
![image-20220623202116627](Python3入门机器学习 经典算法与应用.assets/image-20220623202116627.png)
![image-20220623201319046](Python3入门机器学习 经典算法与应用.assets/image-20220623201319046.png)
![image-20220623201916533](Python3入门机器学习 经典算法与应用.assets/image-20220623201916533.png)
修改某单元格语法类型
![image-20220623201941657](Python3入门机器学习 经典算法与应用.assets/image-20220623201941657.png)
Notebook优势,大数据只需要加载一次永久有效
重置:此时会从上至下依次执行代码
3-2 Jupyter Notebook中的魔法命令 %run
:导入当前目录下某个python文件
%timeit
:单行性能测试,底层可能执行多次
↑ Python中使用生成表达式创建数组比for循环快
%time
:性能测试,只执行一次
%lsmagic
:展示所有魔法命令
%xxx?
:查看xxx命令的文档
3-3 Numpy 数据基础 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import numpy as npprint (np.__version__) L = [i for i in range (10 )] L[5 ] = 'Machine Learning' print (L) import arrayarr = array.array('i' , [i for i in range (10 )]) nparr = np.array([i for i in range (10 )]) print (nparr.dtype) nparr[5 ] = 3.14 print (nparr) nparr2 = np.array([1 ,2 ,3.0 ]) print (nparr2.dtype)
3-4 创建Numpy数组(和矩阵)
全0,全1,自定义方式创建向量或矩阵
arange:指定一个范围和步长,创建向量或矩阵
linspace:指定一个范围和切分次数,创建向量和矩阵(范围为左闭右闭)
random:创建随机的向量和矩阵
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 import numpy as npnp.zeros(10 ) print (np.zeros(10 ), np.zeros(10 ).dtype) print (np.zeros(10 , dtype=int )) print (np.zeros((3 , 5 )))print (np.zeros(shape=(3 , 5 ), dtype=int ))''' [[0 0 0 0 0] [0 0 0 0 0] [0 0 0 0 0]] ''' print (np.ones(10 )) print (np.ones((3 , 5 )))''' [[1. 1. 1. 1. 1.] [1. 1. 1. 1. 1.] [1. 1. 1. 1. 1.]] ''' print (np.full(shape = (3 , 5 ), fill_value = 666 ))''' [[666 666 666 666 666] [666 666 666 666 666] [666 666 666 666 666]] ''' print (np.full(shape = (3 , 5 ), fill_value = 666.0 ))''' [[666. 666. 666. 666. 666.] [666. 666. 666. 666. 666.] [666. 666. 666. 666. 666.]] ''' print ([i for i in range (0 , 20 , 2 )]) print (np.arange(0 , 20 , 2 )) print (np.arange(0 , 1 , 0.2 )) print (np.arange(0 , 10 )) print (np.arange(10 )) print (np.linspace(0 , 20 , 10 )) print (np.linspace(0 , 20 , 11 )) print (np.random.randint(0 , 10 )) print (np.random.randint(0 , 10 , 10 )) print (np.random.randint(4 , 8 , size=10 )) print (np.random.randint(4 , 8 , size=(3 , 5 )))''' [[7 6 6 7 7] [7 4 6 6 5] [5 6 4 7 7]] ''' np.random.seed(666 ) print (np.random.random()) print (np.random.random(10 )) print (np.random.random((3 , 5 )))''' [[0.74415417 0.192892 0.70084475 0.29322811 0.77447945] [0.00510884 0.11285765 0.11095367 0.24766823 0.0232363 ] [0.72732115 0.34003494 0.19750316 0.90917959 0.97834699]] ''' print (np.random.normal()) print (np.random.normal(10 , 100 )) print (np.random.normal(0 , 1 , (3 , 5 )))''' [[-1.75662522 0.84463262 0.27721986 0.85290153 0.1945996 ] [ 1.31063772 1.5438436 -0.52904802 -0.6564723 -0.2015057 ] [-0.70061583 0.68713795 -0.02607576 -0.82975832 0.29655378]] '''
查看文档
3-5 Numpy数组(和矩阵)的基本操作
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 import numpy as npx = np.arange(10 ) print (x) X = np.arange(15 ).reshape(3 , 5 ) print (X)''' [[ 0 1 2 3 4] [ 5 6 7 8 9] [10 11 12 13 14]] ''' print (x.ndim) print (X.ndim) print (x.shape) print (X.shape) print (x.size) print (X.size) print (x[0 ], x[-1 ]) print (X[0 ][0 ]) print (X[(0 , 0 )], X[2 , 2 ]) print (x[0 :5 ]) print (x[:5 ]) print (x[5 :]) print (x[::2 ]) print (x[::-1 ]) print (X[:2 , :3 ]) ''' [[0 1 2] [5 6 7]] ''' print (X[:2 ]) ''' [[0 1 2 3 4] [5 6 7 8 9]] ''' print (X[:2 ][:3 ]) ''' [[0 1 2 3 4] [5 6 7 8 9]] ''' print (X[:2 , ::2 ]) ''' [[0 2 4] [5 7 9]] ''' print (X[::-1 , ::-1 ]) ''' [[14 13 12 11 10] [ 9 8 7 6 5] [ 4 3 2 1 0]] ''' print (X[0 ], X[0 , :]) print (X[:, 0 ]) subX = X[:2 , :3 ] subX[0 , 0 ] = 100 print (subX)''' [[100 1 2] [ 5 6 7]] ''' print (X) ''' [[100 1 2 3 4] [ 5 6 7 8 9] [ 10 11 12 13 14]] ''' subX = X[:2 , :3 ].copy() subX[0 , 0 ] = 99 print (subX)''' [[99 1 2] [ 5 6 7]] ''' print (X)''' [[100 1 2 3 4] [ 5 6 7 8 9] [ 10 11 12 13 14]] ''' print (x.reshape(2 , 5 )) ''' [[0 1 2 3 4] [5 6 7 8 9]] ''' print (x.reshape(10 , -1 )) ''' [[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]] ''' print (x.reshape(-1 , 10 ))
3-6 Numpy数组(和矩阵)的合并与分割
concatenate:拼接,默认按行拼接。要求数据是同维的
vstack/hstack:竖直/水平方向堆叠。不要求数据同维
split:分割,默认按行分割
vsplit/hsplit:竖直/水平分割
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 import numpy as npx = np.array([1 , 2 , 3 ]) y = np.array([3 , 2 , 1 ]) print (np.concatenate([x, y])) z = np.array([666 , 666 , 666 ]) print (np.concatenate([x, y, z])) A = np.array([[1 , 2 , 3 ], [4 , 5 , 6 ]]) print (np.concatenate([A, A])) ''' [[1 2 3] [4 5 6] [1 2 3] [4 5 6]] ''' print (np.concatenate([A, A], axis=1 )) ''' [[1 2 3 1 2 3] [4 5 6 4 5 6]] ''' print (z.reshape(1 , -1 )) print (z.reshape(1 , -1 ).ndim) print (np.concatenate([A, z.reshape(1 , -1 )])) ''' [[ 1 2 3] [ 4 5 6] [666 666 666]] ''' A2 = np.concatenate([A, z.reshape(1 , -1 )]) print (np.vstack([A, z])) ''' [[ 1 2 3] [ 4 5 6] [666 666 666]] ''' B = np.full((2 , 2 ), 100 ) print (B)''' [[100 100] [100 100]] ''' print (np.hstack([A, B]))''' [[ 1 2 3 100 100] [ 4 5 6 100 100]] ''' x = np.arange(10 ) print (np.split(x, [3 , 7 ])) ''' [array([0, 1, 2]), array([3, 4, 5, 6]), array([7, 8, 9])] ''' print (np.split(x, [5 ]))''' [array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9])] ''' A = np.arange(16 ).reshape((4 , 4 )) A1, A2 = np.split(A, [2 ]) print (A1)''' [[0 1 2 3] [4 5 6 7]] ''' print (A2)''' [[ 8 9 10 11] [12 13 14 15]] ''' A1, A2 = np.split(A, [2 ], axis=1 ) print (A1)''' [[ 0 1] [ 4 5] [ 8 9] [12 13]] ''' print (A2)''' [[ 2 3] [ 6 7] [10 11] [14 15]] ''' upper, lower = np.vsplit(A, [2 ]) print (upper)''' [[0 1 2 3] [4 5 6 7]] ''' print (lower)''' [[ 8 9 10 11] [12 13 14 15]] ''' left, right = np.hsplit(A, [2 ]) print (left)''' [[ 0 1] [ 4 5] [ 8 9] [12 13]] ''' print (right)''' [[ 2 3] [ 6 7] [10 11] [14 15]] ''' data = np.arange(16 ).reshape((4 , 4 )) print (data)''' [[ 0 1 2 3] [ 4 5 6 7] [ 8 9 10 11] [12 13 14 15]] ''' X, y = np.hsplit(data, [-1 ]) print (X)''' [[ 0 1 2] [ 4 5 6] [ 8 9 10] [12 13 14]] ''' print (y)''' [[ 3] [ 7] [11] [15]] ''' print (y[:, 0 ])
3-7 Numpy中的矩阵运算 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 n = 10 L = [i for i in range (n)] print (2 * L) A = [] for e in L: A.append(2 * e) print (A) A = [2 *e for e in L] import numpy as npL = np.arange(n) A = np.array([2 *e for e in L]) A = 2 * L print (A) X = np.arange(1 , 16 ).reshape((3 , 5 )) print (X)''' [[ 1 2 3 4 5] [ 6 7 8 9 10] [11 12 13 14 15]] ''' print (X + 1 )''' [[ 2 3 4 5 6] [ 7 8 9 10 11] [12 13 14 15 16]] ''' print (X - 1 )''' [[ 0 1 2 3 4] [ 5 6 7 8 9] [10 11 12 13 14]] ''' print (X * 2 )''' [[ 2 4 6 8 10] [12 14 16 18 20] [22 24 26 28 30]] ''' print (X / 2 ) ''' [[0.5 1. 1.5 2. 2.5] [3. 3.5 4. 4.5 5. ] [5.5 6. 6.5 7. 7.5]] ''' print (X // 2 ) ''' [[0 1 1 2 2] [3 3 4 4 5] [5 6 6 7 7]] ''' print (X ** 2 ) ''' [[ 1 4 9 16 25] [ 36 49 64 81 100] [121 144 169 196 225]] ''' print (X % 2 ) ''' [[1 0 1 0 1] [0 1 0 1 0] [1 0 1 0 1]] ''' print (1 / X) print (np.abs (X)) print (np.sin(X)) print (np.cos(X))print (np.tan(X))print (np.exp(X)) print (np.power(3 , X)) print (np.log(X)) print (np.log2(X))print (np.log10(X))A = np.arange(4 ).reshape(2 , 2 ) B = np.full((2 , 2 ), 10 ) print (A + B)''' [[10 11] [12 13]] ''' print (A - B)''' [[-10 -9] [ -8 -7]] ''' print (A * B) ''' [[ 0 10] [20 30]] ''' print (A / B)''' [[0. 0.1] [0.2 0.3]] ''' print (A.dot(B)) ''' [[10 10] [50 50]] ''' print (A.T) ''' [[0 2] [1 3]] ''' v = np.array([1 , 2 ]) print (A)''' [[0 1] [2 3]] ''' print (v + A) ''' [[1 3] [3 5]] ''' print (np.vstack([v] * A.shape[0 ])) ''' [[1 2] [1 2]] ''' print (np.tile(v, (2 , 1 ))) ''' [[1 2] [1 2]] ''' print (v * A) ''' [[0 2] [2 6]] ''' print (v.dot(A)) print (A.dot(v)) invA = np.linalg.inv(A) print (invA)''' [[-1.5 0.5] [ 1. 0. ]] ''' print (A.dot(invA))''' [[1. 0.] [0. 1.]] ''' X = np.arange(16 ).reshape((2 , 8 )) pinvX = np.linalg.pinv(X) print (pinvX.shape) print (X.dot(pinvX))''' [[ 1.00000000e+00 -2.49800181e-16] [ 6.66133815e-16 1.00000000e+00]] '''
3-8 Numpy中的聚合运算 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 import numpy as npnp.random.seed(19991101 ) L = np.random.random(100 ) print (sum (L)) print (np.sum (L)) print (np.min (L)) print (np.max (L)) print (L.min ()) print (L.max ())print (L.sum ())X = np.arange(16 ).reshape(4 , -1 ) print (X)''' [[ 0 1 2 3] [ 4 5 6 7] [ 8 9 10 11] [12 13 14 15]] ''' print (np.sum (X)) print (np.sum (X, axis=0 )) print (np.sum (X, axis=1 )) print (np.prod(X)) print (np.prod(X + 1 )) print (np.mean(X)) print (np.median(X)) print (np.median(L)) print (np.percentile(L, q=50 )) for percent in [0 , 25 , 50 , 75 , 100 ]: print (np.percentile(L, q=percent)) ''' 0.0003132760524273692 0.2559220514470348 0.45331017330323375 0.6964296186068865 0.9931441196685156 ''' print (np.var(L)) print (np.std(L)) x = np.random.normal(0 , 1 , size=1000000 ) print (np.mean(x)) print (np.std(x))
3-9 Numpy中的arg运算 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 import numpy as npnp.random.seed(19991101 ) x = np.random.normal(0 , 1 , size=1000000 ) print (np.min (x)) print (np.argmin(x)) print (x[np.argmin(x)]) print (np.argmax(x)) x = np.arange(16 ) print (x) np.random.shuffle(x) print (x) x.sort() print (x) X = np.random.randint(10 , size=(4 , 4 )) print (X)''' [[2 9 2 3] [5 8 1 1] [9 4 5 7] [7 8 7 1]] ''' print (np.sort(X)) ''' [[2 2 3 9] [1 1 5 8] [4 5 7 9] [1 7 7 8]] ''' print (np.sort(X, axis=0 )) ''' [[2 4 1 1] [5 8 2 1] [7 8 5 3] [9 9 7 7]] ''' np.random.shuffle(x) print (x) print (np.argsort(x)) print (np.partition(x, 3 )) print (np.argpartition(x, 4 )) print (X)''' [[2 9 2 3] [5 8 1 1] [9 4 5 7] [7 8 7 1]] ''' print (np.argsort(X, axis=1 )) ''' [[0 2 3 1] [2 3 0 1] [1 2 3 0] [3 0 2 1]] ''' print (np.argsort(X, axis=0 )) ''' [[0 2 1 1] [1 1 0 3] [3 3 2 0] [2 0 3 2]] ''' print (np.argpartition(X, 2 , axis=1 )) ''' [[0 2 3 1] [2 3 0 1] [1 2 3 0] [3 2 0 1]] ''' print (np.argpartition(X, 2 , axis=1 )) ''' [[0 2 3 1] [2 3 0 1] [1 2 3 0] [3 2 0 1]] '''
3-10 Numpy中的比较和Fancy Indexing 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 import numpy as npx = np.arange(16 ) ind = [3 , 5 , 8 ] print (x[ind]) ind = np.array([[0 , 2 ], [1 , 3 ]]) print (x[ind]) ''' [[0 2] [1 3]] ''' X = x.reshape(4 , -1 ) print (X)''' [[ 0 1 2 3] [ 4 5 6 7] [ 8 9 10 11] [12 13 14 15]] ''' row = np.array([0 , 1 , 2 ]) col = np.array([1 , 2 , 3 ]) print (X[row, col]) ''' [ 1 6 11] ''' print (X[0 , col]) print (X[:2 , col]) ''' [[1 2 3] [5 6 7]] ''' col = [True , False , True , True ] print (X[1 :3 , col]) ''' [[ 4 6 7] [ 8 10 11]] ''' print (x) print (x < 3 ) ''' [ True True True False False False False False False False False False False False False False] ''' print (2 * x == 24 - 4 * x)''' [False False False False True False False False False False False False False False False False] ''' print (np.sum (x <= 3 )) print (np.count_nonzero(x <= 3 )) print (np.any (x == 0 )) print (np.all (x >= 0 )) print (np.sum (X % 2 == 0 )) print (np.sum (X % 2 == 0 , axis = 1 )) print (np.sum (X % 2 == 0 , axis = 0 )) print (np.all (X > 0 , axis = 1 )) print (x) print (np.sum ((x > 3 ) & (x < 10 ))) print (np.sum ((x % 2 == 0 ) | (x > 10 ))) print (np.sum (~(x==0 ))) print (x[x < 5 ]) print (x[x % 2 == 0 ]) print (X[X[:, 3 ] % 3 == 0 , :]) ''' [[ 0 1 2 3] [12 13 14 15]] '''
个人:numpy小结
3-7~3-9:细粒度内容不好总结。直接翻看该小节内容即可
数据对象常用属性
默认由numpy创建的数据对象访问
出处
属性
说明
举例
3-3
dtype
查看nparr类型
int32, float64
3-5
ndim
查看维度
1,2
3-5
shape
查看结构,返回元组
(10, ), (3, 5)
3-5
size
查看元素个数
10, 15
数据对象常用方法
出处
举例
说明
3-5
x[0]
一维向量的访问
3-5
X[0, 0]
矩阵元素的访问
3-5
x[0:5]
一维向量切片,返回引用,左闭右开 ①:起始索引,不写默认0 ②:终点索引,不写默认元素个数 ③:步长,不写默认1。-1代表逆序
3-5
X[:2, :3]
矩阵切片,原理同上 (示例:取前两行,前三列的数据)
3-5
X[:2, :3].copy
值拷贝
3-5
x.reshape(2, 5) x.reshape(10, -1)
重构(改变维数),返回新对象 (示例1:将一维向量x转换成2x5的矩阵) (示例2:将x转换成10行的矩阵,列数由计算机平均分配)
3-9
ind = [3, 5, 8] x[ind] ind = np.array[[0, 2], [1, 3]] x[ind]
一维向量的Fancy索引访问。预先声明格式,根据格式返回内容 (示例1:向量。得到由索引3,5,8的值组成的向量) (示例2:矩阵。得到二维矩阵(从一维向量索引得来的值))
numpy常用方法
出处
函数
说明
举例
3-3
np.array
创建numpy数据对象。由numpy创建的一维数组可称为向量,n维数组可称为矩阵。 参数①:列表
np.array([i for i in range(10)])
3-4
np.zeros
创建全为0的数据对象 参数①:shape,结构。1维传int, n维传元组 参数②:dtype,类型。传int,float等
np.zeros(10, dtype=int) np.zeros((3, 5)) np.zeros(shape=(3, 5), dtype=int)
3-4
np.ones
创建全为1的数据对象,其他同上
3-4
np.full
创建自定义值的数据对象,基本同上 参数②:fill_value,填充值
np.full(shape = (3, 5), fill_value = 666)
3-4
np.arange
指定范围创建数据对象,规则通常是左闭右开 传入三个参数:①起始值(闭区间);②终点值(开区间);③步长 传入两个参数:步长默认1,起始和终点为传入值 传入一个参数:起始值默认0,步长默认1,终点值为传入值
np.arange(0, 20, 2) np.arange(0, 1, 0.2) np.arange(0, 10) pnp.arange(10)
3-4
np.linspace
指定范围创建数据对象,不同于arange由步长创建,该函数预先指定元素个数,自动计算平均步长创建 参数①:起始值;参数②:终点值(开区间);参数③:元素个数
np.linspace(0, 20, 11)
3-4
np.random.randint
指定范围内随机创建数据对象 参数①:起始值 参数②:终点值 参数③:尺寸,传入元组可定义维数
np.random.randint(0, 10) np.random.randint(0, 10, 10) np.random.randint(4, 8, size=10) np.random.randint(4, 8, size=(3, 5))
3-4
np.random.seed
设定numpy的随机种子
np.random.seed(666)
3-4
np.random.random
生成[0, 1)的随机数 参数①:尺寸,传入元组可定义维数
np.random.random() np.random.random(10) np.random.random((3, 5))
3-4
np.random.normal
根据正态分布生成随机数 参数①:默认0,均值 参数②:默认1,方差 参数③:尺寸,传入元组可定义维数
np.random.normal() np.random.normal(10, 100) np.random.normal(0, 1, (3, 5))
3-6
np.concatenate
拼接数据,要求数据必须维数相同,返回新对象 参数①:列表(列表中元素仍是列表) 参数②:axis,默认0,沿行拼接。1为沿列拼接
np.concatenate([x, y]) np.concatenate([A, A]) np.concatenate([A, A], axis=1)
3-6
np.vstack
竖直方向拼接数据(增加行),不要求维数相同,但需要满足规则
np.vstack([A, z])
3-6
np.hstack
水平方向拼接数据(增加列)
np.hstack([A, B])
3-6
np.split
数据分割 参数①:指定分割对象 参数②:列表,指定分割点 参数③:axis,默认0,基于行分割。1基于列分割
np.split(x, [3, 7]) np.split(A, [2]) np.split(A, [2], axis=1)
3-6
np.vsplit
竖直方向分割(基于行分割)
np.vsplit(A, [2])
3-6
np.hsplit
水平方向分割(基于列分割)
np.hsplit(A, [2])
3-11 Matplotlib数据可视化基础
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 import matplotlib as mplimport matplotlib.pyplot as pltimport numpy as npx = np.linspace(0 , 10 , 100 ) y = np.sin(x) plt.plot(x, y) plt.show() cosy = np.cos(x) siny = y.copy() plt.plot(x, siny) plt.plot(x, cosy, color = 'orange' , linestyle='-' ) plt.show() plt.plot(x, siny) plt.plot(x, cosy, color = 'orange' , linestyle='-' ) plt.xlim(-5 , 15 ) plt.ylim(0 , 1.5 ) plt.axis([-1 , 11 , -2 , 2 ]) plt.show() plt.plot(x, siny, label='sin(x)' ) plt.plot(x, cosy, color = 'orange' , linestyle='-' , label='cos(x)' ) plt.xlabel('x axis' ) plt.ylabel('y value' ) plt.legend() plt.title('Welcome to the ML World!' ) plt.show() plt.scatter(x, siny) plt.scatter(x, cosy) plt.show() x = np.random.normal(0 , 1 , 10000 ) y = np.random.normal(0 , 1 , 10000 ) plt.scatter(x, y, alpha=0.1 ) plt.show()
3-12 数据加载和简单的数据探索 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 import numpy as npimport matplotlib.pyplot as pltfrom sklearn import datasetsiris = datasets.load_iris() print (iris.keys()) print (iris.DESCR) print (iris.data) print (iris.data.shape)print (iris.feature_names) print (iris.target) print (iris.target.shape)print (iris.target_names) X = iris.data[:, :2 ] print (X.shape)plt.scatter(X[:, 0 ], X[:, 1 ]) plt.show() y = iris.target plt.scatter(X[y==0 , 0 ], X[y==0 , 1 ], color='red' , marker='o' ) plt.scatter(X[y==1 , 0 ], X[y==1 , 1 ], color='blue' , marker='+' ) plt.scatter(X[y==2 , 0 ], X[y==2 , 1 ], color='green' , marker='x' ) plt.show() X = iris.data[:, 2 :] plt.scatter(X[y==0 , 0 ], X[y==0 , 1 ], color='red' , marker='o' ) plt.scatter(X[y==1 , 0 ], X[y==1 , 1 ], color='blue' , marker='+' ) plt.scatter(X[y==2 , 0 ], X[y==2 , 1 ], color='green' , marker='x' ) plt.show()
第4章 最基础的分类算法-k近邻算法 kNN 4-1 k近邻算法基础 原理:选取K个数据点,当判断新的样本点属于哪一类时,找到距离新样本点最近的K个点,哪个类别距离近最多,就判断是哪个类别
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 import numpy as npimport matplotlib.pyplot as pltraw_data_X = [ [3.393533211 , 2.331273381 ], [3.110073483 , 1.781539638 ], [1.343808831 , 3.368360954 ], [3.582294042 , 4.679179110 ], [2.280362439 , 2.866990263 ], [7.423436942 , 4.696522875 ], [5.745051997 , 3.533989803 ], [9.172168622 , 2.511101045 ], [7.792783481 , 3.424088941 ], [7.939820817 , 0.791637231 ] ] raw_data_y = [0 , 0 , 0 , 0 , 0 , 1 , 1 , 1 , 1 , 1 ] X_train = np.array(raw_data_X) y_train = np.array(raw_data_y) print (X_train)''' [[3.39353321 2.33127338] [3.11007348 1.78153964] [1.34380883 3.36836095] [3.58229404 4.67917911] [2.28036244 2.86699026] [7.42343694 4.69652288] [5.745052 3.5339898 ] [9.17216862 2.51110105] [7.79278348 3.42408894] [7.93982082 0.79163723]] ''' print (y_train) plt.scatter(X_train[y_train==0 , 0 ], X_train[y_train==0 , 1 ], color='g' ) plt.scatter(X_train[y_train==1 , 0 ], X_train[y_train==1 , 1 ], color='r' ) plt.show() x = np.array([8.093607318 , 3.365731514 ]) plt.scatter(X_train[y_train==0 , 0 ], X_train[y_train==0 , 1 ], color='g' ) plt.scatter(X_train[y_train==1 , 0 ], X_train[y_train==1 , 1 ], color='r' ) plt.scatter(x[0 ], x[1 ], color='b' ) plt.show() from math import sqrtdistances = [] for x_train in X_train: d = sqrt(np.sum ((x_train - x)**2 )) distances.append(d) print (distances)distances = [sqrt(np.sum ((x_train - x)**2 )) for x_train in X_train] print (distances)nearest = np.argsort(distances) print (nearest) k = 6 topK_y = [y_train[i] for i in nearest[:k]] print (topK_y) from collections import Countervotes = Counter(topK_y) print (votes) print (votes.most_common(1 )) print (votes.most_common(1 )[0 ][0 ]) predict_y = votes.most_common(1 )[0 ][0 ] print (predict_y)
4-2 scikit-learn中的机器学习算法封装 kNN算法封装 Ch4_kNN.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import numpy as npfrom math import sqrtfrom collections import Counterdef kNN_classify (k, X_train, y_train, x ): assert 1 <= k <= X_train.shape[0 ], "k must be valid" assert X_train.shape[0 ] == y_train.shape[0 ], \ "the size of X_train must equal to the size of y_train" assert X_train.shape[1 ] == x.shape[0 ], \ "the feature number of x must be equal to X_train" distances = [sqrt(np.sum ((x_train - x)**2 )) for x_train in X_train] nearest = np.argsort(distances) topK_y = [y_train[i] for i in nearest[:k]] votes = Counter(topK_y) return votes.most_common(1 )[0 ][0 ]
main
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 import numpy as npimport matplotlib.pyplot as pltraw_data_X = [ [3.393533211 , 2.331273381 ], [3.110073483 , 1.781539638 ], [1.343808831 , 3.368360954 ], [3.582294042 , 4.679179110 ], [2.280362439 , 2.866990263 ], [7.423436942 , 4.696522875 ], [5.745051997 , 3.533989803 ], [9.172168622 , 2.511101045 ], [7.792783481 , 3.424088941 ], [7.939820817 , 0.791637231 ] ] raw_data_y = [0 , 0 , 0 , 0 , 0 , 1 , 1 , 1 , 1 , 1 ] X_train = np.array(raw_data_X) y_train = np.array(raw_data_y) x = np.array([8.093607318 , 3.365731514 ]) %run liuyubobobo/Ch4_kNN.py predict_y = kNN_classify(6 , X_train, y_train, x) print (predict_y)
调用sklearn的kNN 1 2 3 4 5 6 7 8 9 10 from sklearn.neighbors import KNeighborsClassifierkNN_classifier = KNeighborsClassifier(n_neighbors=6 ) print (kNN_classifier.fit(X_train, y_train)) X_predict = x.reshape(1 , -1 ) print (X_predict)y_predict = kNN_classifier.predict(X_predict) print (y_predict)print (y_predict[0 ])
根据sklearn重新封装kNN Ch4_kNN2.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 import numpy as npfrom math import sqrtfrom collections import Counterclass KNNClassifier : def __init__ (self, k ): '''初始化kNN分类器''' assert k>=1 , "k must be valid" self.k = k self._X_train = None self._y_train = None def fit (self, X_train, y_train ): '''根据训练数据集X_train和y_train训练kNN分类器''' assert X_train.shape[0 ] == y_train.shape[0 ], \ "the size of X_train must be equal to the size of y_train" assert self.k <= X_train.shape[0 ], \ "the size of X_train must be at least k" self._X_train = X_train self._y_train = y_train return self def predict (self, X_predict ): '''给定待遇测数据集X_predict,返回表示X_predict的结果向量''' assert self._X_train is not None and self._y_train is not None , \ 'must fit before predict!' assert X_predict.shape[1 ] == self._X_train.shape[1 ], \ 'the feature number of X_predict must be equal to X_train' y_predict = [self._predict(x) for x in X_predict] return np.array(y_predict) def _predict (self, x ): '''给定单个待预测数据x,返回x的预测结果值''' assert x.shape[0 ] == self._X_train.shape[1 ], \ 'the feature number of x must be equal to X_train' distances = [sqrt(np.sum ((x_train - x)**2 )) for x_train in self._X_train] nearest = np.argsort(distances) topK_y = [self._y_train[i] for i in nearest[:self.k]] votes = Counter(topK_y) return votes.most_common(1 )[0 ][0 ] def __repr__ (self ): return 'KNN(k=%d)' % self.k
main
1 2 3 4 5 6 7 %run liuyubobobo/Ch4_kNN2.py knn_clf = KNNClassifier(k=6 ) knn_clf.fit(X_train, y_train) y_predict = knn_clf.predict(X_predict) print (y_predict)print (y_predict[0 ])
4-3 训练数据集,测试数据集
model_selection.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 import numpy as npdef train_test_split (X, y, test_ratio=0.2 , seed=None ): '''将数据X和y按照test_ratio分割成X_train, X_test, y_train, y_test''' assert X.shape[0 ] == y.shape[0 ], \ 'the size of X must be equal to the size of y' assert 0.0 <= test_ratio <= 1.0 , \ 'test_ratio must be valid' if seed: np.random.seed(seed) shuffle_indexes = np.random.permutation(len (X)) test_ratio = 0.2 test_size = int (len (X) * test_ratio) test_indexes = shuffle_indexes[:test_size] train_indexes = shuffle_indexes[test_size:] X_train = X[train_indexes] y_train = y[train_indexes] X_test = X[test_indexes] y_test = y[test_indexes] return X_train, X_test, y_train, y_test
main
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 import numpy as npimport matplotlib.pyplot as pltfrom sklearn import datasetsiris = datasets.load_iris() X = iris.data y = iris.target print (X.shape, y.shape)from playML.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y) print (X_train.shape)print (y_train.shape)print (X_test.shape)print (y_test.shape)from playML.kNN import KNNClassifiermy_knn_clf = KNNClassifier(k=3 ) my_knn_clf.fit(X_train, y_train) y_predict = my_knn_clf.predict(X_test) print (y_predict)print (sum (y_predict == y_test))print (sum (y_predict == y_test) / len (y_test))from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state=666 ) print (X_train.shape)print (y_train.shape)print (X_test.shape)print (y_test.shape)
4-4 分类准确度 本节中将加载手写数字的数据集,拆分成训练集(80%)和测试集(20%),使用自定义kNN算法和sklearn的kNN算法测试准确度。
metrics.py
1 2 3 4 5 def accuracy_score (y_true, y_predict ): '''计算y_true和y_predict之间的准确率''' assert y_true.shape[0 ] == y_predict.shape[0 ], \ 'the size of y_true must be equal to the size of y_predict' return sum (y_true == y_predict) / len (y_true)
修改kNN.py:新增score方法
1 2 3 4 5 6 7 8 9 10 11 import numpy as npfrom math import sqrtfrom collections import Counterfrom .metrics import accuracy_scoreclass KNNClassifier : ...... def score (self, X_test, y_test ): '''根据测试数据集 X_test 和 y_test 确定当前模型的准确度''' y_predict = self.predict(X_test) return accuracy_score(y_test, y_predict)
main
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 import numpy as npimport matplotlibimport matplotlib.pyplot as pltfrom sklearn import datasetsdigits = datasets.load_digits() print (digits.keys())print (digits.DESCR)X = digits.data print (X.shape) y = digits.target print (y)print (digits.target_names)print (y[:100 ]) some_digit = X[666 ] print (y[666 ])some_digit_image = some_digit.reshape(8 , 8 ) plt.imshow(some_digit_image, cmap = matplotlib.cm.binary) plt.show() from playML.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_ratio = 0.2 ) from playML.kNN import KNNClassifiermy_knn_clf = KNNClassifier(k=3 ) my_knn_clf.fit(X_train, y_train) y_predict = my_knn_clf.predict(X_test) print (sum (y_predict == y_test) / len (y_test))from playML.metrics import accuracy_scoreprint (accuracy_score(y_test, y_predict))print (my_knn_clf.score(X_test, y_test)) from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state=666 ) from sklearn.neighbors import KNeighborsClassifierknn_clf = KNeighborsClassifier(n_neighbors=3 ) knn_clf.fit(X_train, y_train) y_predict = knn_clf.predict(X_test) from sklearn.metrics import accuracy_scoreprint (accuracy_score(y_test, y_predict))print (knn_clf.score(X_test, y_test))
4-5 超参数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 import matplotlibimport matplotlib.pyplot as pltfrom sklearn import datasetsdigits = datasets.load_digits() X = digits.data y = digits.target from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state=666 ) from sklearn.neighbors import KNeighborsClassifierknn_clf = KNeighborsClassifier(n_neighbors=3 ) knn_clf.fit(X_train, y_train) print (knn_clf.score(X_test, y_test))best_score = 0.0 best_k = -1 for k in range (1 , 11 ): knn_clf = KNeighborsClassifier(n_neighbors=k) knn_clf.fit(x_train, y_train) score = knn_clf.score(X_test, y_test) if score > best_score: best_k = k best_score = score print ("best_k=" , best_k)print ("best_score=" , best_score)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 best_method = "" best_score = 0.0 best_k = -1 for method in ['uniform' , 'distance' ]: for k in range (1 , 11 ): knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method) knn_clf.fit(X_train, y_train) score = knn_clf.score(X_test, y_test) if score > best_score: best_k = k best_score = score best_method = method print ('best_method=' , best_method)print ("best_k=" , best_k)print ("best_score=" , best_score)
↑ 曼哈顿距离:各维度差值之和
↑ 红蓝黄都是曼哈顿距离,绿色是欧拉距离
p = 1:曼哈顿距离
p = 2:欧拉距离
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 best_p = -1 best_score = 0.0 best_k = -1 for p in range (1 , 6 ): for k in range (1 , 11 ): knn_clf = KNeighborsClassifier(n_neighbors=k, weights='distance' , p=p) knn_clf.fit(X_train, y_train) score = knn_clf.score(X_test, y_test) if score > best_score: best_p = p best_k = k best_score = score print ('best_p=' , best_p)print ("best_k=" , best_k)print ("best_score=" , best_score)
4-6 网格搜索与k近邻算法中更多超参数 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 import matplotlibimport matplotlib.pyplot as pltfrom sklearn import datasetsdigits = datasets.load_digits() X = digits.data y = digits.target from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state=666 ) from sklearn.neighbors import KNeighborsClassifierparam_grid = [ { 'weights' : ['uniform' ], 'n_neighbors' : [i for i in range (1 , 11 )] }, { 'weights' : ['distance' ], 'n_neighbors' : [i for i in range (1 , 11 )], 'p' : [i for i in range (1 , 6 )] } ] knn_clf = KNeighborsClassifier() from sklearn.model_selection import GridSearchCVgrid_search = GridSerachCV(knn_clf, param_grid) grid_search.fit(X_train, y_train) print (grid_search.best_estimator_) print (grid_search.best_score_)print (grid_search.best_params_)knn_clf = grid_search.best_estimator_ print (knn_clf.score(X_test, y_test))grid_search = GridSerachCV(knn_clf, param_grid, n_jobs=-1 , verbose=2 )
4-7 数据归一化
最值归一化
↑ 适用场景:考试成绩(0-100),像素颜色(0-255)
均值方差归一化(推荐)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 import numpy as npimport matplotlib.pyplot as pltx = np.random.randint(0 , 100 , size=100 ) print ((x - np.min (x)) / (np.max (x) - np.min (x)))X = np.random.randint(0 , 100 , (50 , 2 )) X = np.array(X, dtype=float ) X[:, 0 ] = (X[:, 0 ] - np.min (X[:, 0 ])) / (np.max (X[:, 0 ]) - np.min (X[:, 0 ])) X[:, 1 ] = (X[:, 1 ] - np.min (X[:, 1 ])) / (np.max (X[:, 1 ]) - np.min (X[:, 1 ])) print (X[:10 , :])plt.scatter(X[:, 0 ], X[:, 1 ]) plt.show() print (np.mean(X[:, 0 ])) print (np.std(X[:, 0 ])) print (np.mean(X[:, 1 ])) print (np.std(X[:, 1 ])) X2 = np.random.randint(0 , 100 , (50 , 2 )) X2 = np.array(X2, dtype = float ) X2[:, 0 ] = (X2[:, 0 ] - np.mean(X2[:, 0 ])) / np.std(X2[:, 0 ]) X2[:, 1 ] = (X2[:, 1 ] - np.mean(X2[:, 1 ])) / np.std(X2[:, 1 ]) plt.scatter(X2[:, 0 ], X2[:, 1 ]) plt.show() print (np.mean(X2[:, 0 ])) print (np.std(X2[:, 0 ])) print (np.mean(X2[:, 1 ])) print (np.std(X2[:, 1 ]))
4-8 scikit-learn中的Scaler
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 import numpy as npfrom sklearn import datasetsiris = datasets.load_iris() X = iris.data y = iris.target print (X[:10 , :])from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2 , random_state=666 ) from sklearn.preprocessing import StandardScaler standardScaler = StandardScaler() standardScaler.fit(X_train) print (standardScaler.mean_) print (standardScaler.scale_)X_train = standardScaler.transform(X_train) X_test_standard = standardScaler.transform(X_test) from sklearn.neighbors import KNeighborsClassifierknn_clf = KNeighborsClassifier(n_neighbors=3 ) knn_clf.fit(X_train, y_train) print (knn_clf.score(X_test_standard, y_test)) print (knn_clf.score(X_test, y_test))
4-9 更多有关k近邻算法的思考
↑ 使用K近邻算法解决回归问题-文档:http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html
↑ KD-Tree:https://www.bilibili.com/video/BV1d5411w7f5
第5章 线性回归法 5-1 简单线性回归
5-2 最小二乘法
b的推导
a的推导
最终
5-3 简单线性回归的实现 python zip()函数详解:https://blog.csdn.net/weixin_47906106/article/details/121702241
main
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 import numpy as npimport matplotlib.pyplot as pltx = np.array([1. , 2. , 3. , 4. , 5. ]) y = np.array([1. , 3. , 2. , 3. , 5. ]) plt.scatter(x, y) plt.axis([0 , 6 , 0 , 6 ]) plt.show() x_mean = np.mean(x) y_mean = np.mean(y) num = 0.0 d = 0.0 for x_i, y_i in zip (x, y): num += (x_i - x_mean) * (y_i - y_mean) d += (x_i - x_mean) ** 2 a = num / d b = y_mean - a * x_mean y_hat = a * x + b plt.scatter(x, y) plt.plot(x, y_hat, color = 'red' ) plt.axis([0 , 6 , 0 , 6 ]) plt.show() x_predict = 6 y_predict = a * x_predict + b print (y_predict) from playML.SimpleLinearRegression import SimpleLinearRegression1reg1 = SimpleLinearRegression1() reg1.fit(x, y) print (reg1.predict(np.array([x_predict]))) print (reg1.a_, reg1.b_) y_hat1 = reg1.predict(x) plt.scatter(x, y) plt.plot(x, y_hat1, color='r' ) plt.axis([0 , 6 , 0 , 6 ]) plt.show()
SimpleLinearRegression1.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 import numpy as npclass SimpleLinearRegression1 : def __init__ (self ): '''初始化Simple Linear Regression 模型''' self.a_ = None self.b_ = None def fit (self, x_train, y_train ): '''根据训练数据集x_train, y_train训练Simple Linear Regression模型''' assert x_train.ndim == 1 , \ 'Simple Linear Regressor can only solve single feature training data.' assert len (x_train) == len (y_train), \ 'the size of x_train must be equal to the size of y_train' x_mean = np.mean(x_train) y_mean = np.mean(y_train) num = 0.0 d = 0.0 for x, y in zip (x_train, y_train): num += (x - x_mean) * (y - y_mean) d += (x - x_mean) ** 2 self.a_ = num / d self.b_ = y_mean - self.a_ * x_mean return self def predict (self, x_predict ): '''给定待预测数据集x_predict, 返回表示x_predict的结果向量''' assert x_predict.ndim == 1 , \ 'Simple Linear Regressor can only solve single feature training data.' assert self.a_ is not None and self.b_ is not None , \ 'must fit before predict!' return np.array([self._predict(x) for x in x_predict]) def _predict (self, x_single ): '''给定单个待预测数据x_single,返回x_single的预测结果值''' return self.a_ * x_single + self.b_ def __repr__ (self ): return "SimpleLinearRegression1()"
5-4 向量化
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 class SimpleLinearRegression2 : ...... def fit (self, x_train, y_train ): ...... x_mean = np.mean(x_train) y_mean = np.mean(y_train) num = 0.0 d = 0.0 num = (x_train - x_mean).dot(y_train - y_mean) d = (x_train - x_mean).dot(x_train - x_mean) self.a_ = num / d self.b_ = y_mean - self.a_ * x_mean return self ...... def __repr__ (self ): return "SimpleLinearRegression2()"
性能测试
5-5 衡量线性回归法的指标:MSE,RMSE和MAE
RMSE vs MAE
RMSE中将误差平方,放大了误差,故RMSE的结果通常大于MAE
实际使用中,应尽可能让RMSE更小
5-6 最好的衡量线性回归法的指标:R Squared
我们模型预测产生的错误由于考虑了x,y之间的关系,因此会比Baseline Model小
Baseline Model(基准模型)中不考虑x和y的关系,任何x输入都使用y均值输出
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 import numpy as npimport matplotlib.pyplot as pltfrom sklearn import datasetsboston = datasets.load_boston() x = boston.data[:, 5 ] y = boston.target from playML.model_selection import train_test_splitx_train, x_test, y_train, y_test = train_test_split(x, y, seed=666 ) from playML.SimpleLinearRegression import SimpleLinearRegression2reg = SimpleLinearRegression2() reg.fit(x_train, y_train) y_predict = reg.predict(x_test) from playML.metrics import r2_scoreprint (r2_score(y_test, y_predict)) from sklearn.metrics import r2_scoreprint (r2_score(y_test, y_predict))
5-7 多元线性回归和正规方程解
已知:y(样本对应标签)和Xb(第0列全为1,由样本特征值组成的矩阵)
1.求导
2.求极值(导数=0)
推导过程略,本质就是对θ每一个分量求偏导,令其为0得到最终结果
↑ θ只是X各列的系数,没有量纲问题
5-8 实现多元线性回归
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 import numpy as npimport matplotlib.pyplot as pltfrom sklearn import datasetsboston = datasets.load_boston() X = boston.data y = boston.target X = X[y < 50.0 ] y = y[y < 50.0 ] print (X.shape) from playML.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, seed=666 ) from playML.LinearRegression import LinearRegressionreg = LinearRegression() reg.fit_normal(X_train, y_train) print (reg.score(X_test,y_test))
5-9 使用scikit-learn解决回归问题
从5-9开始:代码直接摘自老师提供的源码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 import numpy as npimport matplotlib.pyplot as pltfrom sklearn import datasetsboston = datasets.load_boston() X = boston.data y = boston.target X = X[y < 50.0 ] y = y[y < 50.0 ] print (X.shape) from playML.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, seed=666 ) from sklearn.linear_model import LinearRegressionlin_reg = LinearRegression() lin_reg.fit(X_train, y_train) lin_reg.score(X_test, y_test) from sklearn.neighbors import KNeighborsRegressorknn_reg = KNeighborsRegressor() knn_reg.fit(X_train_standard, y_train) knn_reg.score(X_test_standard, y_test) from sklearn.model_selection import GridSearchCVparam_grid = [ { "weights" : ["uniform" ], "n_neighbors" : [i for i in range (1 , 11 )] }, { "weights" : ["distance" ], "n_neighbors" : [i for i in range (1 , 11 )], "p" : [i for i in range (1 ,6 )] } ] knn_reg = KNeighborsRegressor() grid_search = GridSearchCV(knn_reg, param_grid, n_jobs=-1 , verbose=1 ) grid_search.fit(X_train_standard, y_train) grid_search.best_params_ grid_search.best_score_ grid_search.best_estimator_.score(X_test_standard, y_test)
5-10 线性回归的可解释性和更多思考 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 import numpy as npimport matplotlib.pyplot as pltfrom sklearn import datasetsboston = datasets.load_boston() X = boston.data y = boston.target X = X[y < 50.0 ] y = y[y < 50.0 ] from sklearn.linear_model import LinearRegressionlin_reg = LinearRegression() lin_reg.fit(X, y) lin_reg.coef_ np.argsort(lin_reg.coef_) boston.feature_names[np.argsort(lin_reg.coef_)] print (boston.DESCR)
第6章 梯度下降法 6-1 什么是梯度下降法
直观理解就是滚球,每次滚一段区域直到最低点
↑ 如图导数为负:代表沿x轴负方向,y值增大。乘上一个负值可以取反,表示沿x轴正方向,y值增大
具体步骤:每次得到一个theta,求其导数,如果导数大于0,就沿上式方向移动theta,继续求导直到导数为0
可能遇到的问题
6-2 模拟实现梯度下降法
梯度下降核心算法
实现思路:任取一个θ,对其求导,并使用eta配合导数值对其偏移,偏移前后差值小于epsilon则视为找到极值点
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 import numpy as npimport matplotlib.pyplot as pltplot_x = np.linspace(-1. , 6. , 141 ) plot_y = (plot_x-2.5 )**2 - 1. plt.plot(plot_x, plot_y) plt.show() epsilon = 1e-8 eta = 0.1 def J (theta ): return (theta-2.5 )**2 - 1. def dJ (theta ): return 2 *(theta-2.5 ) theta = 0.0 while True : gradient = dJ(theta) last_theta = theta theta = theta - eta * gradient if (abs (J(theta) - J(last_theta)) < epsilon): break print (theta)print (J(theta))
优化
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 def J (theta ): try : return (theta-2.5 )**2 - 1. except : return float ('inf' ) def gradient_descent (initial_theta, eta, n_iters = 1e4 , epsilon=1e-8 ): theta = initial_theta i_iter = 0 theta_history.append(initial_theta) while i_iter < n_iters: gradient = dJ(theta) last_theta = theta theta = theta - eta * gradient theta_history.append(theta) if (abs (J(theta) - J(last_theta)) < epsilon): break i_iter += 1 return def plot_theta_history (): plt.plot(plot_x, J(plot_x)) plt.plot(np.array(theta_history), J(np.array(theta_history)), color="r" , marker='+' ) plt.show()
6-3 线性回归中的梯度下降法
Xb:由样本集合X追加全为1的第一列构成的矩阵
θ:由θ0 - θn组成的向量
二者dot结果就是各分量对应相乘再相加
去掉M带来的影响(不然样本越多,梯度越大)
只看一个样本,相当于去掉求和符号,此时相当于每一个维度求一次偏导得到该样本的梯度
梯度下降法相当于对所有样本进行一次操作,之后求平均得到总的平均梯度
6-4 实现线性回归中的梯度下降法 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 import numpy as np import matplotlib.pyplot as pltnp.random.seed(666 ) x = 2 * np.random.random(size=100 ) y = x * 3. + 4. + np.random.normal(size=100 ) X = x.reshape(-1 , 1 ) print (X.shape)print (y.shape)plt.scatter(x, y) plt.show() def J (theta, X_b, y ): try : return np.sum ((y - X_b.dot(theta))**2 ) / len (X_b) except : return float ('inf' ) def dJ (theta, X_b, y ): res = np.empty(len (theta)) res[0 ] = np.sum (X_b.dot(theta) - y) for i in range (1 , len (theta)): res[i] = (X_b.dot(theta) - y).dot(X_b[:, i]) return res * 2 / len (X_b) def gradient_descent (X_b, y, initial_theta, eta, n_iters=1e4 , epsilon=1e-8 ): theta = initial_theta i_iter = 0 while i_iter < n_iters: gradient = dJ(theta, X_b, y) last_theta = theta theta = theta - eta * gradient if (abs (J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon): break i_iter += 1 return theta X_b = np.hstack([np.ones((len (x), 1 )), x.reshape(-1 , 1 )]) initial_theta = np.zeros(X_b.shape[1 ]) eta = 0.01 theta = gradient_descent(X_b, y, initial_theta, eta) print (theta) from playML.LinearRegression import LinearRegressionlin_reg = LinearRegression() lin_reg.fit_gd(X, y) print (lin_reg.coef_) print (lin_reg.intercept_)
修改LinearRegression
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 def fit_gd (self, X_train, y_train, eta=0.01 , n_iters=1e4 ):"""根据训练数据集X_train, y_train, 使用梯度下降法训练Linear Regression模型""" assert X_train.shape[0 ] == y_train.shape[0 ], \ "the size of X_train must be equal to the size of y_train" def J (theta, X_b, y ): try : return np.sum ((y - X_b.dot(theta)) ** 2 ) / len (y) except : return float ('inf' ) def dJ (theta, X_b, y ): res = np.empty(len (theta)) res[0 ] = np.sum (X_b.dot(theta) - y) for i in range (1 , len (theta)): res[i] = (X_b.dot(theta) - y).dot(X_b[:, i]) return res * 2 / len (X_b) def gradient_descent (X_b, y, initial_theta, eta, n_iters=1e4 , epsilon=1e-8 ): theta = initial_theta cur_iter = 0 while cur_iter < n_iters: gradient = dJ(theta, X_b, y) last_theta = theta theta = theta - eta * gradient if (abs (J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon): break cur_iter += 1 return theta X_b = np.hstack([np.ones((len (X_train), 1 )), X_train]) initial_theta = np.zeros(X_b.shape[1 ]) self._theta = gradient_descent(X_b, y_train, initial_theta, eta, n_iters) self.intercept_ = self._theta[0 ] self.coef_ = self._theta[1 :] return self
6-5 梯度下降法的向量化和数据标准化
1 2 3 4 5 6 7 def dJ (theta, X_b, y ): return X_b.T.dot(X_b.dot(theta) - y) * 2. / len (X_b)
测试使用梯度下降法
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 import numpy as npimport matplotlib.pyplot as pltfrom sklearn import datasetsboston = datasets.load_boston() X = boston.data y = boston.target X = X[y < 50.0 ] y = y[y < 50.0 ] print (X.shape) from playML.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, seed=666 ) from playML.LinearRegression import LinearRegressionreg = LinearRegression() reg.fit_normal(X_train, y_train) print (reg.score(X_test,y_test)) from sklearn.preprocessing import StandardScaler standardScaler = StandardScaler() standardScaler.fit(X_train) X_train_standard = standardScaler.transform(X_train) X_test_standard = standardScaler.transform(X_test) lin_reg3 = LinearRegression() lin_reg3.fit_gd(X_train_standard, y_train, eta=0.0001 , n_iters=1e4 ) print (lin_reg3.score(X_test_standard, y_test))
6-6 随机梯度下降法
学习率应该随着循环次数的增加逐渐递减
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 import numpy as npimport matplotlib.pyplot as pltm = 100000 x = np.random.normal(size=m) X = x.reshape(-1 ,1 ) y = 4. *x + 3. + np.random.normal(0 , 3 , size=m) plt.scatter(x, y) plt.show() def J (theta, X_b, y ): try : return np.sum ((y - X_b.dot(theta)) ** 2 ) / len (y) except : return float ('inf' ) def dJ (theta, X_b, y ): return X_b.T.dot(X_b.dot(theta) - y) * 2. / len (y) def gradient_descent (X_b, y, initial_theta, eta, n_iters=1e4 , epsilon=1e-8 ): theta = initial_theta cur_iter = 0 while cur_iter < n_iters: gradient = dJ(theta, X_b, y) last_theta = theta theta = theta - eta * gradient if (abs (J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon): break cur_iter += 1 return theta X_b = np.hstack([np.ones((len (X), 1 )), X]) initial_theta = np.zeros(X_b.shape[1 ]) eta = 0.01 theta = gradient_descent(X_b, y, initial_theta, eta) print (theta) def dJ_sgd (theta, X_b_i, y_i ): return 2 * X_b_i.T.dot(X_b_i.dot(theta) - y_i) def sgd (X_b, y, initial_theta, n_iters ): t0, t1 = 5 , 50 def learning_rate (t ): return t0 / (t + t1) theta = initial_theta for cur_iter in range (n_iters): rand_i = np.random.randint(len (X_b)) gradient = dJ_sgd(theta, X_b[rand_i], y[rand_i]) theta = theta - learning_rate(cur_iter) * gradient return theta X_b = np.hstack([np.ones((len (X), 1 )), X]) initial_theta = np.zeros(X_b.shape[1 ]) theta = sgd(X_b, y, initial_theta, n_iters=m//3 ) print (theta)
6-7 scikit-learn中的随机梯度下降法 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 from sklearn import datasetsboston = datasets.load_boston() X = boston.data y = boston.target X = X[y < 50.0 ] y = y[y < 50.0 ] from playML.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, seed=666 ) from sklearn.preprocessing import StandardScalerstandardScaler = StandardScaler() standardScaler.fit(X_train) X_train_standard = standardScaler.transform(X_train) X_test_standard = standardScaler.transform(X_test) from sklearn.linear_model import SGDRegressorsgd_reg = SGDRegressor(n_iter_no_change=50 ) %time sgd_reg.fit(X_train_standard, y_train) sgd_reg.score(X_test_standard, y_test)
6-8 如何确定梯度计算的准确性?调试梯度下降法
↑ 实际使用中,梯度的调试通常很慢,可用于前期测试
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 import numpy as npimport matplotlib.pyplot as pltnp.random.seed(666 ) X = np.random.random(size=(1000 , 10 )) true_theta = np.arange(1 , 12 , dtype=float ) X_b = np.hstack([np.ones((len (X), 1 )), X]) y = X_b.dot(true_theta) + np.random.normal(size=1000 ) print (X.shape)print (y.shape)print (true_theta)def J (theta, X_b, y ): try : return np.sum ((y - X_b.dot(theta))**2 ) / len (X_b) except : return float ('inf' ) def dJ_math (theta, X_b, y ): return X_b.T.dot(X_b.dot(theta) - y) * 2. / len (y) def dJ_debug (theta, X_b, y, epsilon=0.01 ): res = np.empty(len (theta)) for i in range (len (theta)): theta_1 = theta.copy() theta_1[i] += epsilon theta_2 = theta.copy() theta_2[i] -= epsilon res[i] = (J(theta_1, X_b, y) - J(theta_2, X_b, y)) / (2 * epsilon) return res def gradient_descent (dJ, X_b, y, initial_theta, eta, n_iters=1e4 , epsilon=1e-8 ): theta = initial_theta cur_iter = 0 while cur_iter < n_iters: gradient = dJ(theta, X_b, y) last_theta = theta theta = theta - eta * gradient if (abs (J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon): break cur_iter += 1 return theta X_b = np.hstack([np.ones((len (X), 1 )), X]) initial_theta = np.zeros(X_b.shape[1 ]) eta = 0.01 %time theta = gradient_descent(dJ_debug, X_b, y, initial_theta, eta) theta %time theta = gradient_descent(dJ_math, X_b, y, initial_theta, eta) theta
6-9 有关梯度下降法的更多深入讨论 批量梯度下降法:每次求梯度,需要把所有样本数据看一遍,优点是稳定
随机梯度下降法:每次求梯度只随机取一个样本的梯度,优点是快
小批量梯度下降法:每次随机看k个样本
第7章 PCA与梯度上升法 7-1 什么是PCA
步骤1:demean
步骤2:求方差最大值
由于X均值为0,故化简为如下式子
不同于线性回归:
线性回归是确保预测值和真值之间最小
PCA是确保各元素间差值最大(方差最大)
7-2 使用梯度上升法求解PCA问题
把w视为自变量,每次求导后都往极值点方向走一些(w记得每次要单位化)
X:MxN
w:Nx1
X:MxN
(Xw)^T^X = 1xM MxN = 1 x N
((Xw)^T^X)^T^ = N x 1 = X^T^(Xw) # 我们想要的列向量
7-3 求数据的主成分PCA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 import numpy as npimport matplotlib.pyplot as pltX = np.empty((100 , 2 )) X[:,0 ] = np.random.uniform(0. , 100. , size=100 ) X[:,1 ] = 0.75 * X[:,0 ] + 3. + np.random.normal(0 , 10. , size=100 ) plt.scatter(X[:,0 ], X[:,1 ]) plt.show() def demean (X ): return X - np.mean(X, axis=0 ) X_demean = demean(X) plt.scatter(X_demean[:,0 ], X_demean[:,1 ]) plt.show() def f (w, X ): return np.sum ((X.dot(w) ** 2 )) / len (X) def df_math (w, X ): return X.T.dot(X.dot(w)) * 2. / len (X) def df_debug (w, X, epsilon=0.0001 ): res = np.empty(len (w)) for i in range (len (w)): w_1 = w.copy() w_1[i] += epsilon w_2 = w.copy() w_2[i] -= epsilon res[i] = (f(w_1, X) - f(w_2, X)) / (2 * epsilon) return res def direction (w ): return w / np.linalg.norm(w) def gradient_ascent (df, X, initial_w, eta, n_iters=1e4 , epsilon=1e-8 ): w = direction(initial_w) cur_iter = 0 while cur_iter < n_iters: gradient = df(w, X) last_w = w w = w + eta * gradient w = direction(w) if (abs (f(w, X) - f(last_w, X)) < epsilon): break cur_iter += 1 return w initial_w = np.random.random(X.shape[1 ]) eta = 0.001 gradient_ascent(df_math, X_demean, initial_w, eta) w = gradient_ascent(df_math, X_demean, initial_w, eta) plt.scatter(X_demean[:,0 ], X_demean[:,1 ]) plt.plot([0 , w[0 ]*30 ], [0 , w[1 ]*30 ], color='r' ) plt.show() X2 = np.empty((100 , 2 )) X2[:,0 ] = np.random.uniform(0. , 100. , size=100 ) X2[:,1 ] = 0.75 * X2[:,0 ] + 3. plt.scatter(X2[:,0 ], X2[:,1 ]) plt.show() X2_demean = demean(X2) w2 = gradient_ascent(df_math, X2_demean, initial_w, eta) plt.scatter(X2_demean[:,0 ], X2_demean[:,1 ]) plt.plot([0 , w2[0 ]*30 ], [0 , w2[1 ]*30 ], color='r' ) plt.show()
7-4 求数据的前n个主成分
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 import numpy as npimport matplotlib.pyplot as pltX = np.empty((100 , 2 )) X[:,0 ] = np.random.uniform(0. , 100. , size=100 ) X[:,1 ] = 0.75 * X[:,0 ] + 3. + np.random.normal(0 , 10. , size=100 ) def demean (X ): return X - np.mean(X, axis=0 ) X = demean(X) plt.scatter(X[:,0 ], X[:,1 ]) plt.show() def f (w, X ): return np.sum ((X.dot(w) ** 2 )) / len (X) def df (w, X ): return X.T.dot(X.dot(w)) * 2. / len (X) def direction (w ): return w / np.linalg.norm(w) def first_component (X, initial_w, eta, n_iters=1e4 , epsilon=1e-8 ): w = direction(initial_w) cur_iter = 0 while cur_iter < n_iters: gradient = df(w, X) last_w = w w = w + eta * gradient w = direction(w) if (abs (f(w, X) - f(last_w, X)) < epsilon): break cur_iter += 1 return w initial_w = np.random.random(X.shape[1 ]) eta = 0.01 w = first_component(X, initial_w, eta) w X2 = np.empty(X.shape) for i in range (len (X)): X2[i] = X[i] - X[i].dot(w) * w plt.scatter(X2[:, 0 ], X2[:, 1 ]) plt.show() X2 = X - X.dot(w).reshape(-1 , 1 ) * w plt.scatter(X2[:,0 ], X2[:,1 ]) plt.show() w2 = first_component(X2, initial_w, eta) w2 w.dot(w2) def first_n_components (n, X, eta=0.01 , n_iters = 1e4 , epsilon=1e-8 ): X_pca = X.copy() X_pca = demean(X_pca) res = [] for i in range (n): initial_w = np.random.random(X_pca.shape[1 ]) w = first_component(X_pca, initial_w, eta) res.append(w) X_pca = X_pca - X_pca.dot(w).reshape(-1 , 1 ) * w return res first_n_components(2 , X)
7-5 高维数据映射为低维数据 k:代表前k个主成分(k < n)
n:代表数据特征的维数
m:样本数
低维再转高维后,数据前后有差异
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 import numpy as npclass PCA : def __init__ (self, n_components ): """初始化PCA""" assert n_components >= 1 , "n_components must be valid" self.n_components = n_components self.components_ = None def fit (self, X, eta=0.01 , n_iters=1e4 ): """获得数据集X的前n个主成分""" assert self.n_components <= X.shape[1 ], \ "n_components must not be greater than the feature number of X" def demean (X ): return X - np.mean(X, axis=0 ) def f (w, X ): return np.sum ((X.dot(w) ** 2 )) / len (X) def df (w, X ): return X.T.dot(X.dot(w)) * 2. / len (X) def direction (w ): return w / np.linalg.norm(w) def first_component (X, initial_w, eta=0.01 , n_iters=1e4 , epsilon=1e-8 ): w = direction(initial_w) cur_iter = 0 while cur_iter < n_iters: gradient = df(w, X) last_w = w w = w + eta * gradient w = direction(w) if (abs (f(w, X) - f(last_w, X)) < epsilon): break cur_iter += 1 return w X_pca = demean(X) self.components_ = np.empty(shape=(self.n_components, X.shape[1 ])) for i in range (self.n_components): initial_w = np.random.random(X_pca.shape[1 ]) w = first_component(X_pca, initial_w, eta, n_iters) self.components_[i,:] = w X_pca = X_pca - X_pca.dot(w).reshape(-1 , 1 ) * w return self def transform (self, X ): """将给定的X,映射到各个主成分分量中""" assert X.shape[1 ] == self.components_.shape[1 ] return X.dot(self.components_.T) def inverse_transform (self, X ): """将给定的X,反向映射回原来的特征空间""" assert X.shape[1 ] == self.components_.shape[0 ] return X.dot(self.components_) def __repr__ (self ): return "PCA(n_components=%d)" % self.n_components
降维后升维
测试
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 import numpy as npimport matplotlib.pyplot as pltX = np.empty((100 , 2 )) X[:,0 ] = np.random.uniform(0. , 100. , size=100 ) X[:,1 ] = 0.75 * X[:,0 ] + 3. + np.random.normal(0 , 10. , size=100 ) from playML.PCA import PCApca = PCA(n_components=2 ) pca.fit(X) PCA(n_components=2 ) pca.components_ """ array([[ 0.76676948, 0.64192256], [-0.64191827, 0.76677307]]) """ pca = PCA(n_components=1 ) pca.fit(X) PCA(n_components=1 ) X_reduction = pca.transform(X) X_reduction.shape X_restore = pca.inverse_transform(X_reduction) X_restore.shape plt.scatter(X[:,0 ], X[:,1 ], color='b' , alpha=0.5 ) plt.scatter(X_restore[:,0 ], X_restore[:,1 ], color='r' , alpha=0.5 ) plt.show()
7-6 scikit-learn中的PCA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 import numpy as npimport matplotlib.pyplot as pltfrom sklearn import datasetsdigits = datasets.load_digits() X = digits.data y = digits.target from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666 ) X_train.shape %%time from sklearn.neighbors import KNeighborsClassifierknn_clf = KNeighborsClassifier() knn_clf.fit(X_train, y_train) knn_clf.score(X_test, y_test) from sklearn.decomposition import PCApca = PCA(n_components=2 ) pca.fit(X_train) X_train_reduction = pca.transform(X_train) X_test_reduction = pca.transform(X_test) %%time knn_clf = KNeighborsClassifier() knn_clf.fit(X_train_reduction, y_train) knn_clf.score(X_test_reduction, y_test) pca.explained_variance_ratio_ from sklearn.decomposition import PCApca = PCA(n_components=X_train.shape[1 ]) pca.fit(X_train) pca.explained_variance_ratio_ plt.plot([i for i in range (X_train.shape[1 ])], [np.sum (pca.explained_variance_ratio_[:i]) for i in range (X_train.shape[1 ])]) plt.show() pca = PCA(0.95 ) pca.fit(X_train) pca.n_components_ X_train_reduction = pca.transform(X_train) X_test_reduction = pca.transform(X_test) %%time knn_clf = KNeighborsClassifier() knn_clf.fit(X_train_reduction, y_train) knn_clf.score(X_test_reduction, y_test) pca = PCA(n_components=2 ) pca.fit(X) X_reduction = pca.transform(X) for i in range (10 ): plt.scatter(X_reduction[y==i,0 ], X_reduction[y==i,1 ], alpha=0.8 ) plt.show()
7-7 试手MNIST数据集
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 import numpy as np from sklearn.datasets import fetch_openml mnist = fetch_openml('mnist_784' ) X, y = mnist['data' ], mnist['target' ] X_train = np.array(X[:60000 ], dtype=float ) y_train = np.array(y[:60000 ], dtype=float ) X_test = np.array(X[60000 :], dtype=float ) y_test = np.array(y[60000 :], dtype=float ) X_train.shape X_test.shape from sklearn.neighbors import KNeighborsClassifierknn_clf = KNeighborsClassifier() %time knn_clf.fit(X_train, y_train) %time knn_clf.score(X_test, y_test) from sklearn.decomposition import PCA pca = PCA(0.90 ) pca.fit(X_train) X_train_reduction = pca.transform(X_train) X_test_reduction = pca.transform(X_test) X_train_reduction.shape knn_clf = KNeighborsClassifier() %time knn_clf.fit(X_train_reduction, y_train) %time knn_clf.score(X_test_reduction, y_test)
第8章 多项式回归与模型泛化 8-1 什么是多项式回归 URL:https://git.imooc.com/coding-169/coding-169/src/master/08-Polynomial-Regression-and-Model-Generalization/01-What-is-Polynomial-Regression
1 2 3 4 5 6 7 import numpy as np import matplotlib.pyplot as pltx = np.random.uniform(-3 , 3 , size=100 ) X = x.reshape(-1 , 1 ) y = 0.5 * x**2 + x + 2 + np.random.normal(0 , 1 , 100 ) plt.scatter(x, y) plt.show()
线性回归?
1 2 3 4 5 6 7 8 from sklearn.linear_model import LinearRegressionlin_reg = LinearRegression() lin_reg.fit(X, y) y_predict = lin_reg.predict(X) plt.scatter(x, y) plt.plot(x, y_predict, color='r' ) plt.show()
解决方案, 添加一个特征
这样把X^2当作一个线性回归项,就可以使用线性回归
1 2 3 4 5 6 7 8 9 10 11 12 X2 = np.hstack([X, X**2 ]) X2.shape lin_reg2 = LinearRegression() lin_reg2.fit(X2, y) y_predict2 = lin_reg2.predict(X2) plt.scatter(x, y) plt.plot(np.sort(x), y_predict2[np.argsort(x)], color='r' ) plt.show() lin_reg2.coef_ lin_reg2.intercept_
8-2 scikit-learn中的多项式回归与Pipeline URL:https://git.imooc.com/coding-169/coding-169/src/master/08-Polynomial-Regression-and-Model-Generalization/02-Polynomial-Regression-in-scikit-learn
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 import numpy as npimport matplotlib.pyplot as pltx = np.random.uniform(-3 , 3 , size=100 ) X = x.reshape(-1 , 1 ) y = 0.5 * x**2 + x + 2 + np.random.normal(0 , 1 , 100 ) from sklearn.preprocessing import PolynomialFeaturespoly = PolynomialFeatures(degree=2 ) poly.fit(X) X2 = poly.transform(X) print (X2.shape) print (X[:5 ,:])print (X2[:5 ,:])from sklearn.linear_model import LinearRegressionlin_reg2 = LinearRegression() lin_reg2.fit(X2, y) y_predict2 = lin_reg2.predict(X2) plt.scatter(x, y) plt.plot(np.sort(x), y_predict2[np.argsort(x)], color='r' ) plt.show() print (lin_reg2.coef_) print (lin_reg2.intercept_) X = np.arange(1 , 11 ).reshape(-1 , 2 ) print (X)""" [[ 1 2] [ 3 4] [ 5 6] [ 7 8] [ 9 10]] """ poly = PolynomialFeatures(degree=2 ) poly.fit(X) X2 = poly.transform(X) print (X2.shape)print (X2)x = np.random.uniform(-3 , 3 , size=100 ) X = x.reshape(-1 , 1 ) y = 0.5 * x**2 + x + 2 + np.random.normal(0 , 1 , 100 ) from sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScalerpoly_reg = Pipeline([ ("poly" , PolynomialFeatures(degree=2 )), ("std_scaler" , StandardScaler()), ("lin_reg" , LinearRegression()) ]) poly_reg.fit(X, y) y_predict = poly_reg.predict(X) plt.scatter(x, y) plt.plot(np.sort(x), y_predict[np.argsort(x)], color='r' ) plt.show()
8-3 过拟合与欠拟合 URL:https://git.imooc.com/coding-169/coding-169/src/master/08-Polynomial-Regression-and-Model-Generalization/03-Overfitting-and-Underfitting
8-4 为什么要有训练数据集与测试数据集 URL:https://git.imooc.com/coding-169/coding-169/src/master/08-Polynomial-Regression-and-Model-Generalization/04-Why-Train-Test-Split
8-5 学习曲线 URL:https://git.imooc.com/coding-169/coding-169/src/master/08-Polynomial-Regression-and-Model-Generalization/05-Learning-Curve
欠拟合:整体误差大
过拟合:测试数据集误差大
8-6 验证数据集与交叉验证
8-7 偏差方差平衡
8-8 模型泛化与岭回归 URL:https://git.imooc.com/coding-169/coding-169/src/master/08-Polynomial-Regression-and-Model-Generalization/08-Model-Regularization-and-Ridge-Regression
模型正则化:通俗理解,就是过拟合的系数太大,加入一项使其减小,增强其泛化能力
8-9 LASSO
在Ridge中,每一步θ都是有值的,沿着梯度降为0
在LASSO中,θ会先沿某个轴降依次降为0
8-10 L1, L2和弹性网络
L0正则项:使θ尽可能少
第9章 逻辑回归 解决:分类
9-1 什么是逻辑回归 一句话:将值域的(-∞, +∞)的数据映射到(0, 1)进行分类(大于0.5为1,小于0.5为0)
Sigmoid函数
9-2 逻辑回归的损失函数
自变量X为正数时,满足左加右减;X为负数时,满足左减右加(该规则仅适用于最后一步,即X符号已经确定之后)
累加所有样本的误差求平均得到损失函数
Xbθ:值域范围(-∞,+∞)
预测值-Sigmoid:值域范围(0,1)
J(θ):损失函数,当真值为1,预测值为0时,误差最大;当真值为0,预测值为1时,误差最大
9-3 逻辑回归损失函数的梯度
前半部分
后半部分
合并
得到损失函数对其中某一维度的θj求导,参考所有样本求本均值的梯度
yhat:逻辑回归的估计值(介于0~1之间)
结论
9-7 scikit-learn中的逻辑回归
9-8 OvR与OvO 通俗理解:https://wenku.baidu.com/view/c13c613ac181e53a580216fc700abb68a982adc4.html
OvR更快,OvO更准
第10章 评价分类结果 10-1 准确度的陷阱和混淆矩阵
记忆诀窍:预测值是啥,第二个数就写啥。预测和真实值(第一和第二个数)不同,就是F;相同就是T
10-2 精准率和召回率 精准率:所有预测中,正确预测的概率(主观,预测结果中,正确的概率)
召回率:所有真值中,正确预测的概率(客观,真实结果中,正确的概率)
在极其有偏的数据中,准确率没意义,需看精准率和召回率
10-4 F1 Score
调和平均值:只有所有变量都大时,F1才大(即一个小,结果小)
10-5 精准率和召回率的平衡
X轴:阈值
Y轴:精确率和召回率
X轴:精确率
Y轴:召回率
曲线下面积越大,效果越好
10-7 ROC曲线
TPR就是召回率
在视频异常检测中:
TPR:所有异常视频中,预测为异常的概率(尽可能高)
FPR:所有正常视频中,预测为异常的概率(尽可能低)
下图中五角星可当异常视频,圆圈当正常视频
记忆:ROC曲线下面积越大,效果越好
PR曲线与ROC曲线的选择:https://coding.imooc.com/learn/questiondetail/42693.html
具体到PR曲线和ROC曲线,他们的核心区别在TN。可以看出来,PR曲线其实不反应TN。所以,如果你的应用场景中,如果TN并不重要,那么PR曲线是一个很好的指标(事实上,Precision和Recall就是通过抹去TN,来去除极度的偏斜数据带来的影响,进而放大FP, FN和TP三者的关系的)。
第11章 支撑向量机 SVM 解决:分类和回归
11-1 什么是SVM
Soft Margin SVM可解决线性不可分问题
11-2 SVM背后的最优化问题
将wd改写成w,其余依此类推
有条件的最优化问题
11-3 Soft Margin SVM Hard Margin SVM存在的问题
可能受特殊点影响,泛化能力低
线性不可分
C越大,容错空间越小;C越小,容错空间越大
sklearn中的超参数C实在min0.5||w||^2
前面,而不是控制正则化项。因此越大,容错能力越差
11-4 scikit-learn中的SVM
标准化的原因
11-6 到底什么是核函数 本节巨懵逼
怎么转换和转换后为何是这样都没有说清楚呃。。。
大致意思就是:将原样本数据带入K这个核函数即可完成某种特征的转换
大致意思:把x和y看作是两个向量,使用K函数将x,y映射到x’和y’。x’和y’形式一致,如下所示,从1开始,直到xn^2^
不难发现,映射后的x’有0~2次项,相当于升维,但此法使用函数映射在运行中升维,降低存储空间
11-7 RBF核函数
举例:取固定两个参考点l1
和l2
实际使用中,每个样本都是地标点
11-8 RBF核函数中的gamma
11-9 SVM思想解决回归问题 Margin内的点越多越好,最后取平均
第12章 决策树 非参数学习
解决:分类、回归
12-1 什么是决策树
12-2 信息熵
pi:每一类信息所占比例
右边的数据比左边的数据更确定
12-3 使用信息熵寻找最优划分 一次划分思路:每次取两个不同的样本的同一维度的特征,相加求平均进行划分,计算划分后两边的信息熵,取最小信息熵的划分
N次划分:在一次划分的基础上,可以剔除信息熵最小(接近0)的部分再次划分,或者在其他特征基础上继续划分
12-4 基尼系数
12-5 CART与决策树中的超参数
max_depth:最大深度
min_samples_split:对一个节点至少拥有多少数据才拆分,越大越不容易过拟合(过大欠拟合)
min_samples_leaf:对叶子节点来说至少有多少样本,越大越不容易过拟合(过大欠拟合)
max_leaf_nodes:叶子节点数,间接设置深度
12-7 决策树的局限性
决策边界横平竖直
对个别数据特别敏感
详见代码
第13章 集成学习和随机森林 13-1 什么是集成学习
13-2 Soft Voting Classifier
能估计概率的模型
Soft通常比Hard好
13-3 Bagging 和 Pasting
13-4 oob (Out-of-Bag) 和关于Bagging的更多讨论
下图未讲原因哈
sklearn的随机森林:在每一个节点上,取随机的特征划分(而不是像决策树一样,在所有特征上找最优划分)
13-6 Ada Boosting 和 Gradient Boosting
机器学习经典模型:集成学习——Boosting(Adaboost与gradient boosting):https://blog.csdn.net/jesseyule/article/details/111997597
Adaboost和gradient boosting的区别主要在于从不同的角度实现对上一轮训练错误的数据更多关注,Adaboost主要对数据引入权重,训练后调整学习错误的数据的权重,从而使得下一轮学习中给予学习错误的数据更多的关注。
犯错->改错,如此往复
看的顺序:从左到右,从上到下
每次绿色的图都是对前一张绿色图中错误的弥补
每次红色的图都是所有绿色图的叠加
13-7 Stacking
第14章 更多机器学习算法 14-1 学习scikit-learn文档, 大家加油! 官网链接:https://scikit-learn.org/stable/