【计算机基础】机器学习入门

聪头游戏开发萌新

2024-03-22 15:15:27 2024-03-22 15:15 2024-03-22 15:15:27

计算机基础

机器学习

16.6k 字 84 分钟

Python3入门机器学习经典算法与应用

时间：2022年6月23日16:13:53

Git URL：https://git.imooc.com/coding-169/coding-169/src/master

第1章欢迎来到 Python3 玩转机器学习

1-1 什么是机器学习

1-2 课程涵盖的内容和理念

1-3 课程所使用的主要技术栈

第2章机器学习基础

2-1 机器学习世界的数据

↑ 大写代表矩阵，小写代表向量

2-2 机器学习的主要任务

机器学习的基本任务：分类、回归

分类

回归

2-3 监督学习，非监督学习，半监督学习和增强学习

监督学习

监督学习：给机器的训练数据拥有“标记”或者“答案”

非监督学习

半监督学习

增强学习

2-4 批量学习，在线学习，参数学习和非参数学习

批量学习

在线学习

参数学习

非参数学习

2-5 和机器学习相关的“哲学”思考

无数据举例：Alpha Go

本章小结

2-7 课程使用环境搭建

Anaconda官网：https://www.anaconda.com/

第3章 Jupyter Notebook, numpy和matplotlib

3-1 Jupyter Notebook基础

快捷键

![image-20220623202116627](Python3入门机器学习经典算法与应用.assets/image-20220623202116627.png)

![image-20220623201319046](Python3入门机器学习经典算法与应用.assets/image-20220623201319046.png)

![image-20220623201916533](Python3入门机器学习经典算法与应用.assets/image-20220623201916533.png)

修改某单元格语法类型

![image-20220623201941657](Python3入门机器学习经典算法与应用.assets/image-20220623201941657.png)

Notebook优势，大数据只需要加载一次永久有效

重置：此时会从上至下依次执行代码

3-2 Jupyter Notebook中的魔法命令

%run：导入当前目录下某个python文件

%timeit：单行性能测试，底层可能执行多次

%%：多行测试

↑ Python中使用生成表达式创建数组比for循环快

%time：性能测试，只执行一次

%%：多行测试

%lsmagic：展示所有魔法命令

%xxx?：查看xxx命令的文档

3-3 Numpy 数据基础

import numpy as np
print(np.__version__) #1.21.5

# Python List的特点
L = [i for i in range(10)] # 数组生成表达式
L[5] = 'Machine Learning' # List对数据类型没有要求
print(L) #[0, 1, 2, 3, 4, 'Machine Learning', 6, 7, 8, 9]

# 使用array，限定类型，效率和安全性更高
import array
arr = array.array('i', [i for i in range(10)]) # i代表整型数组
# arr[5] = 'Hello' # 报错

# numpy.array
nparr = np.array([i for i in range(10)])
# nparr[5] = 'Hello' # 报错
print(nparr.dtype) # 查看nparr的类型，此为int32
nparr[5] = 3.14
print(nparr) # 3.14自动转换成3 [0 1 2 3 4 3 6 7 8 9]
nparr2 = np.array([1,2,3.0])
print(nparr2.dtype) # float64

3-4 创建Numpy数组(和矩阵)

全0，全1，自定义方式创建向量或矩阵
arange：指定一个范围和步长，创建向量或矩阵
linspace：指定一个范围和切分次数，创建向量和矩阵（范围为左闭右闭）
random：创建随机的向量和矩阵

import numpy as np
# 全0的向量或矩阵
np.zeros(10)
print(np.zeros(10), np.zeros(10).dtype) # [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] float64
print(np.zeros(10, dtype=int)) # [0 0 0 0 0 0 0 0 0 0]
print(np.zeros((3, 5)))
print(np.zeros(shape=(3, 5), dtype=int))
'''
[[0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]]
'''

# 全1的向量或矩阵
print(np.ones(10)) # [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
print(np.ones((3, 5)))
'''
[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]
'''

# 自定义
print(np.full(shape = (3, 5), fill_value = 666))
'''
[[666 666 666 666 666]
 [666 666 666 666 666]
 [666 666 666 666 666]]
'''
print(np.full(shape = (3, 5), fill_value = 666.0))
'''
[[666. 666. 666. 666. 666.]
 [666. 666. 666. 666. 666.]
 [666. 666. 666. 666. 666.]]
'''

# arange
print([i for i in range(0, 20, 2)]) # [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
print(np.arange(0, 20, 2)) # [ 0  2  4  6  8 10 12 14 16 18]
# print([i for i in range(0, 20, 0.2)] #步长不能为浮点数，报错
print(np.arange(0, 1, 0.2)) # [0.  0.2 0.4 0.6 0.8]
print(np.arange(0, 10)) # [0 1 2 3 4 5 6 7 8 9]
print(np.arange(10))  # [0 1 2 3 4 5 6 7 8 9]

# linspace
print(np.linspace(0, 20, 10)) #[0,20]等长截取10个点
# [ 0.          2.22222222  4.44444444  6.66666667  8.88888889 11.11111111
#  13.33333333 15.55555556 17.77777778 20.        ]
print(np.linspace(0, 20, 11)) # [ 0.  2.  4.  6.  8. 10. 12. 14. 16. 18. 20.]

# random
print(np.random.randint(0, 10)) # 生成[0, 10)的一个随机数 # 2
print(np.random.randint(0, 10, 10)) # 生成[0, 10)的数组(第三个参数指明数组大小) # [1 6 1 7 8 0 3 0 8 4]
print(np.random.randint(4, 8, size=10)) # [4 5 6 5 5 5 5 4 7 5]
print(np.random.randint(4, 8, size=(3, 5)))
'''
[[7 6 6 7 7]
 [7 4 6 6 5]
 [5 6 4 7 7]]
'''
np.random.seed(666) # 随机种子的使用
print(np.random.random()) # [0, 1)的随机数 # 0.7004371218578347
print(np.random.random(10)) # [0, 1)随机数数组（10个元素）
# [0.84418664 0.67651434 0.72785806 0.95145796 0.0127032  0.4135877
#  0.04881279 0.09992856 0.50806631 0.20024754]
print(np.random.random((3, 5)))
'''
[[0.74415417 0.192892   0.70084475 0.29322811 0.77447945]
 [0.00510884 0.11285765 0.11095367 0.24766823 0.0232363 ]
 [0.72732115 0.34003494 0.19750316 0.90917959 0.97834699]]
'''
print(np.random.normal()) # 服从标准正态分布的浮点数 # -1.6829007709843886
print(np.random.normal(10, 100)) # 生成服从均值为10，标准差为100的正态分布浮点数 # 32.91852477040214
print(np.random.normal(0, 1, (3, 5)))
'''
[[-1.75662522  0.84463262  0.27721986  0.85290153  0.1945996 ]
 [ 1.31063772  1.5438436  -0.52904802 -0.6564723  -0.2015057 ]
 [-0.70061583  0.68713795 -0.02607576 -0.82975832  0.29655378]]
'''

查看文档

1 2	# np.random.normal? #查看文档 # help(np.random.normal) #在notebook中查看文档

3-5 Numpy数组(和矩阵)的基本操作

基本属性
数据访问
reshape：改变维度

import numpy as np
x = np.arange(10)
print(x) # [0 1 2 3 4 5 6 7 8 9]
X = np.arange(15).reshape(3, 5)
print(X)
'''
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]
'''

# numpy.array 基本属性
# 查看维度
print(x.ndim) # 1
print(X.ndim) # 2
# 查看结构，返回元组
print(x.shape) # (10,)
print(X.shape) # (3, 5)
# 元素个数
print(x.size) # 10
print(X.size) # 15

# 数据访问
print(x[0], x[-1]) # 0 9
print(X[0][0]) # 不建议这样写 # 0
print(X[(0, 0)], X[2, 2]) # 推荐X[2, 2]的写法 # 0 12
print(x[0:5]) # 切片[0, 5)的元素 # [0 1 2 3 4]
print(x[:5]) # [第一个元素(0), 5) 切片 # [0 1 2 3 4]
print(x[5:]) # [5, 结尾(10)) 切片 # [5 6 7 8 9]
print(x[::2]) # 从头到尾取步长2切片 # [0 2 4 6 8]
print(x[::-1]) # 从头到尾取步长-1切片（逆序） # [9 8 7 6 5 4 3 2 1 0]
print(X[:2, :3]) # 前两行，前三列
'''
[[0 1 2]
 [5 6 7]]
'''
# [][]无法表达正确的语义
print(X[:2]) # 前两行
'''
[[0 1 2 3 4]
 [5 6 7 8 9]]
'''
print(X[:2][:3]) # 不推荐酱紫写：本意想取前两行前三列，可取的是前两行（[:3]表示取三个元素，而[:2]是前两行，故仍旧是取前两行）
'''
[[0 1 2 3 4]
 [5 6 7 8 9]]
'''
print(X[:2, ::2]) # 前两行，每行从头到尾间隔为2的元素
'''
[[0 2 4]
 [5 7 9]]
'''
print(X[::-1, ::-1]) # 矩阵逆序
'''
[[14 13 12 11 10]
 [ 9  8  7  6  5]
 [ 4  3  2  1  0]]
'''
print(X[0], X[0, :]) #取第一行 # [0 1 2 3 4] [0 1 2 3 4]
print(X[:, 0]) #取第一列 # [ 0  5 10]
subX = X[:2, :3]
subX[0, 0] = 100
print(subX)
'''
[[100   1   2]
 [  5   6   7]]
'''
print(X) # numpy中使用引用的方式获取子矩阵
'''
[[100   1   2   3   4]
 [  5   6   7   8   9]
 [ 10  11  12  13  14]]
'''
subX = X[:2, :3].copy() # 复制值
subX[0, 0] = 99
print(subX)
'''
[[99  1  2]
 [ 5  6  7]]
'''
print(X)
'''
[[100   1   2   3   4]
 [  5   6   7   8   9]
 [ 10  11  12  13  14]]
'''

# Reshape
print(x.reshape(2, 5)) # 将x转换成2行5列的矩阵，不改变x本身
'''
[[0 1 2 3 4]
 [5 6 7 8 9]]
'''
print(x.reshape(10, -1)) # 10行，列数由计算机决定
'''
[[0]
 [1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]
 [8]
 [9]]
'''
print(x.reshape(-1, 10)) # 10列，行数由计算机决定 # [[0 1 2 3 4 5 6 7 8 9]]
# x.reshape(3, -1) # 10个元素无法均分给3行，报错

3-6 Numpy数组(和矩阵)的合并与分割

concatenate：拼接，默认按行拼接。要求数据是同维的
- axis：1时按列拼接
vstack/hstack：竖直/水平方向堆叠。不要求数据同维
split：分割，默认按行分割
- axis：1时按列分割
vsplit/hsplit：竖直/水平分割

import numpy as np
# 数据合并
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
print(np.concatenate([x, y])) # [1 2 3 3 2 1]
z = np.array([666, 666, 666])
print(np.concatenate([x, y, z])) # [  1   2   3   3   2   1 666 666 666]
A = np.array([[1, 2, 3],
              [4, 5, 6]])
print(np.concatenate([A, A])) # 默认沿行拼接
'''
[[1 2 3]
 [4 5 6]
 [1 2 3]
 [4 5 6]]
'''
print(np.concatenate([A, A], axis=1)) # 沿列拼接
'''
[[1 2 3 1 2 3]
 [4 5 6 4 5 6]]
'''
# np.concatenate([A, z]) # 维数不同无法连接，A是二维矩阵，z是一维向量，报错
print(z.reshape(1, -1)) # 1行，列数自动填充 # [[666 666 666]]
print(z.reshape(1, -1).ndim) # 2
print(np.concatenate([A, z.reshape(1, -1)])) # A矩阵和z向量连接，产生新矩阵
'''
[[  1   2   3]
 [  4   5   6]
 [666 666 666]]
'''
A2 = np.concatenate([A, z.reshape(1, -1)])
print(np.vstack([A, z])) # 竖直方向数据堆叠
'''
[[  1   2   3]
 [  4   5   6]
 [666 666 666]]
'''
B = np.full((2, 2), 100)
print(B)
'''
[[100 100]
 [100 100]]
'''
print(np.hstack([A, B]))
'''
[[  1   2   3 100 100]
 [  4   5   6 100 100]]
'''

# 数据分割
x = np.arange(10)
print(np.split(x, [3, 7])) # 分割x，分割点为3,7(规则为左闭右开)
''' [array([0, 1, 2]), array([3, 4, 5, 6]), array([7, 8, 9])] '''
print(np.split(x, [5]))
''' [array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9])] '''
A = np.arange(16).reshape((4, 4))
A1, A2 = np.split(A, [2]) # 基于行分割
print(A1)
'''
[[0 1 2 3]
 [4 5 6 7]]
'''
print(A2)
'''
[[ 8  9 10 11]
 [12 13 14 15]]
'''
A1, A2 = np.split(A, [2], axis=1) # 基于列分割
print(A1)
'''
[[ 0  1]
 [ 4  5]
 [ 8  9]
 [12 13]]
'''
print(A2)
'''
[[ 2  3]
 [ 6  7]
 [10 11]
 [14 15]]
'''
upper, lower = np.vsplit(A, [2]) # 竖直方向分割
print(upper)
'''
[[0 1 2 3]
 [4 5 6 7]]
'''
print(lower)
'''
[[ 8  9 10 11]
 [12 13 14 15]]
'''
left, right = np.hsplit(A, [2]) # 水平方向分割
print(left)
'''
[[ 0  1]
 [ 4  5]
 [ 8  9]
 [12 13]]
'''
print(right)
'''
[[ 2  3]
 [ 6  7]
 [10 11]
 [14 15]]
'''
data = np.arange(16).reshape((4, 4))
print(data)
'''
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]
'''
X, y = np.hsplit(data, [-1]) # 分割最后一列
print(X)
'''
[[ 0  1  2]
 [ 4  5  6]
 [ 8  9 10]
 [12 13 14]]
'''
print(y)
'''
[[ 3]
 [ 7]
 [11]
 [15]]
'''
print(y[:, 0]) # 将y矩阵转换成向量 # [ 3  7 11 15]

3-7 Numpy中的矩阵运算

n = 10
L = [i for i in range(n)]
print(2 * L) # 将两个L首尾相接 # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
A = []
for e in L:
    A.append(2 * e)
print(A) # [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
A = [2*e for e in L] # 比for快

import numpy as np
L = np.arange(n)
A = np.array([2*e for e in L]) # 比原生List的生成表达式快
A = 2 * L # 向量*2，速度很快
print(A) # [ 0  2  4  6  8 10 12 14 16 18]

# Universal Function
X = np.arange(1, 16).reshape((3, 5))
print(X)
'''
[[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]]
'''
print(X + 1)
'''
[[ 2  3  4  5  6]
 [ 7  8  9 10 11]
 [12 13 14 15 16]]
'''
print(X - 1)
'''
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]
'''
print(X * 2)
'''
[[ 2  4  6  8 10]
 [12 14 16 18 20]
 [22 24 26 28 30]]
'''
print(X / 2) # 浮点数除法
'''
[[0.5 1.  1.5 2.  2.5]
 [3.  3.5 4.  4.5 5. ]
 [5.5 6.  6.5 7.  7.5]]
'''
print(X // 2) # 整数除法
'''
[[0 1 1 2 2]
 [3 3 4 4 5]
 [5 6 6 7 7]]
'''
print(X ** 2) # 幂运算
'''
[[  1   4   9  16  25]
 [ 36  49  64  81 100]
 [121 144 169 196 225]]
'''
print(X % 2) # 求余
'''
[[1 0 1 0 1]
 [0 1 0 1 0]
 [1 0 1 0 1]]
'''
print(1 / X) # 求倒数
print(np.abs(X)) # 求绝对值
print(np.sin(X)) # 求正弦
print(np.cos(X))
print(np.tan(X))
print(np.exp(X)) # e^x次方
print(np.power(3, X)) # 等价于3**X
print(np.log(X)) # ln
print(np.log2(X))
print(np.log10(X))

# 矩阵运算
A = np.arange(4).reshape(2, 2)
B = np.full((2, 2), 10)
print(A + B)
'''
[[10 11]
 [12 13]]
'''
print(A - B)
'''
[[-10  -9]
 [ -8  -7]]
'''
print(A * B) # 对应元素相乘，并非矩阵乘法
'''
[[ 0 10]
 [20 30]]
'''
print(A / B)
'''
[[0.  0.1]
 [0.2 0.3]]
'''
print(A.dot(B)) # 矩阵乘法
'''
[[10 10]
 [50 50]]
'''
print(A.T) # 转置
'''
[[0 2]
 [1 3]]
'''

# 向量和矩阵的运算
v = np.array([1, 2])
print(A)
'''
[[0 1]
 [2 3]]
'''
print(v + A) # A矩阵每一行和v做加法
'''
[[1 3]
 [3 5]]
'''
print(np.vstack([v] * A.shape[0])) # [v] * A的行数(拼接次数) = 竖直拼接结果
'''
[[1 2]
 [1 2]]
'''
print(np.tile(v, (2, 1))) # 行向量堆叠2次，列向量堆叠1次
'''
[[1 2]
 [1 2]]
'''
print(v * A) # v各元素和A逐行逐元素相乘
'''
[[0 2]
 [2 6]]
'''
print(v.dot(A)) # 行向量 * 矩阵 # [4 7]
print(A.dot(v)) # 矩阵 * 列向量 # [2 8]

# 矩阵的逆
invA = np.linalg.inv(A)
print(invA)
'''
[[-1.5  0.5]
 [ 1.   0. ]]
'''
print(A.dot(invA))
'''
[[1. 0.]
 [0. 1.]]
'''
X = np.arange(16).reshape((2, 8))
pinvX = np.linalg.pinv(X) # 伪逆矩阵
print(pinvX.shape) # (8, 2)
print(X.dot(pinvX))
'''
[[ 1.00000000e+00 -2.49800181e-16]
 [ 6.66133815e-16  1.00000000e+00]]
'''

3-8 Numpy中的聚合运算

import numpy as np
np.random.seed(19991101)
L = np.random.random(100) # 100个(0, 1]的数组成的数组
# print(L) 
print(sum(L)) # 47.94254974738839
print(np.sum(L)) # 效率更高
print(np.min(L)) # 0.0003132760524273692
print(np.max(L)) # 0.9931441196685156
print(L.min()) # 根据个人喜好，不过更推荐使用np.min，可以显式指明调用numpy库
print(L.max())
print(L.sum())
# 矩阵的聚合运算
X = np.arange(16).reshape(4, -1)
print(X)
'''
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]
'''
print(np.sum(X)) # 120
print(np.sum(X, axis=0)) # 沿着行方向运算，即每列求和（诀窍：行压缩）# [24 28 32 36]
print(np.sum(X, axis=1)) # 沿着列方向运算，即每行求和（诀窍：列压缩）# [ 6 22 38 54]
print(np.prod(X)) # 逐元素相乘的乘积 # 0
print(np.prod(X + 1)) # 2004189184
print(np.mean(X)) # 求平均值 # 7.5
print(np.median(X)) # 求中位数 # 7.5

print(np.median(L)) # 0.45331017330323375
print(np.percentile(L, q=50)) # 这组元素50%的元素小于结果值，即中位数 # 0.45331017330323375
for percent in [0, 25, 50, 75, 100]:
    print(np.percentile(L, q=percent))
'''
0.0003132760524273692
0.2559220514470348
0.45331017330323375
0.6964296186068865
0.9931441196685156
'''
print(np.var(L)) # 求方差 # 0.07149006634669938
print(np.std(L)) # 求标准差 # 0.26737626361870526
x = np.random.normal(0, 1, size=1000000) # 取均值为0，标准差为1，100000个随机数
print(np.mean(x)) # -0.0010810265562775107
print(np.std(x)) # 1.0004066018104387

3-9 Numpy中的arg运算

import numpy as np
np.random.seed(19991101)
x = np.random.normal(0, 1, size=1000000) # 取均值为0，标准差为1，100000个随机数
print(np.min(x)) # 返回最小值 # -4.9121140331754525
print(np.argmin(x)) # 返回最小值的索引值 # 450222
print(x[np.argmin(x)]) # -4.9121140331754525
print(np.argmax(x)) # 459822

# 排序和使用索引
x = np.arange(16)
print(x) # [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]
np.random.shuffle(x) # 乱序处理
print(x) # [ 1 10  3  0  5  6 15  8 14  2  7 13  9 12  4 11]
# np.sort(x)
x.sort()
print(x) # [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]
X = np.random.randint(10, size=(4, 4)) # [0,10)之间4x4矩阵
print(X)
'''
[[2 9 2 3]
 [5 8 1 1]
 [9 4 5 7]
 [7 8 7 1]]
'''
print(np.sort(X)) # 默认axis=1，沿着列方向排序，即每行排序
'''
[[2 2 3 9]
 [1 1 5 8]
 [4 5 7 9]
 [1 7 7 8]]
'''
print(np.sort(X, axis=0)) # 每列有序
'''
[[2 4 1 1]
 [5 8 2 1]
 [7 8 5 3]
 [9 9 7 7]]
'''

# 索引
np.random.shuffle(x) # 乱序
print(x) # [ 2  5 12 14  1  6  8 11  3 10  4  0 13 15  7  9]
print(np.argsort(x)) # 索引排序 # [11  4  0  8 10  1  5 14  6 15  9  7  2 12  3 13]
print(np.partition(x, 3)) # 快速排序一部分，根据值进行划分 # [ 0  1  2  3  7  6  8  5  4  9 10 14 13 15 12 11]
print(np.argpartition(x, 4)) # 根据索引划分 # [11  4  0  8 10  1  5 14  6 15  9  3 12 13  2  7]
print(X)
'''
[[2 9 2 3]
 [5 8 1 1]
 [9 4 5 7]
 [7 8 7 1]]
'''
print(np.argsort(X, axis=1)) # 按行索引排序
'''
[[0 2 3 1]
 [2 3 0 1]
 [1 2 3 0]
 [3 0 2 1]]
'''
print(np.argsort(X, axis=0)) # 按列索引排序
'''
[[0 2 1 1]
 [1 1 0 3]
 [3 3 2 0]
 [2 0 3 2]]
'''
print(np.argpartition(X, 2, axis=1)) # 按行索引划分
'''
[[0 2 3 1]
 [2 3 0 1]
 [1 2 3 0]
 [3 2 0 1]]
'''
print(np.argpartition(X, 2, axis=1)) # 按列索引划分
'''
[[0 2 3 1]
 [2 3 0 1]
 [1 2 3 0]
 [3 2 0 1]]
'''

3-10 Numpy中的比较和Fancy Indexing

import numpy as np
x = np.arange(16)
ind = [3, 5, 8] # 待访问索引
print(x[ind]) # 得到由索引3,5,8的值组成的向量 # [3 5 8]
ind = np.array([[0, 2],
                [1, 3]])
print(x[ind]) #得到二维矩阵(从一维向量索引得来的值)
'''
[[0 2]
 [1 3]]
'''
X = x.reshape(4, -1) # 4x4矩阵
print(X)
'''
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]
'''
row = np.array([0, 1, 2]) # 感兴趣的行
col = np.array([1, 2, 3]) # 感兴趣的列
print(X[row, col]) #获得(0,1),(1,2),(2,3)三个点的值
'''
[ 1  6 11]
'''
print(X[0, col]) #(0,1),(0,2),(0,3) # [1 2 3]
print(X[:2, col]) #前两行索引为1,2,3的列的值
'''
[[1 2 3]
 [5 6 7]]
'''
col = [True, False, True, True] #使用布尔数组
print(X[1:3, col]) # 对索引[1,3)行的内容，取0,2,3列
'''
[[ 4  6  7]
 [ 8 10 11]]
'''

# numpy.array的比较
print(x) # [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]
print(x < 3) #x中所有元素和3比较
'''
[ True  True  True False False False False False False False False False
 False False False False]
'''
print(2 * x == 24 - 4 * x)
'''
[False False False False  True False False False False False False False
 False False False False]
'''
print(np.sum(x <= 3)) # True为1,False为0 # 4
print(np.count_nonzero(x <= 3)) # 4
print(np.any(x == 0)) # 任何一个返回true则返回true # True
print(np.all(x >= 0)) # 所有返回true则返回true # True
print(np.sum(X % 2 == 0)) # 求偶数个数 # 8
print(np.sum(X % 2 == 0, axis = 1)) # 沿着列方向，即每行有多少偶数 # [2 2 2 2]
print(np.sum(X % 2 == 0, axis = 0)) # 沿着行方向，即每列有多少偶数 # [4 0 4 0]
print(np.all(X > 0, axis = 1)) # 每行是否大于0 # [False  True  True  True]
print(x) # [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]
print(np.sum((x > 3) & (x < 10))) # 6
print(np.sum((x % 2 == 0) | (x > 10))) # 11
print(np.sum(~(x==0))) # 15
print(x[x < 5]) # [0 1 2 3 4]
print(x[x % 2 == 0]) # [ 0  2  4  6  8 10 12 14]
print(X[X[:, 3] % 3 == 0, :]) # 取索引为3的列能被3整除的行
'''
[[ 0  1  2  3]
 [12 13 14 15]]
'''

个人：numpy小结

3-7~3-9：细粒度内容不好总结。直接翻看该小节内容即可

1	import numpy as np

数据对象常用属性

默认由numpy创建的数据对象访问

出处	属性	说明	举例
3-3	dtype	查看nparr类型	int32, float64
3-5	ndim	查看维度	1,2
3-5	shape	查看结构，返回元组	(10, ), (3, 5)
3-5	size	查看元素个数	10, 15

数据对象常用方法

出处	举例	说明
3-5	x[0]	一维向量的访问
3-5	X[0, 0]	矩阵元素的访问
3-5	x[0:5]	一维向量切片，返回引用，左闭右开 ①：起始索引，不写默认0 ②：终点索引，不写默认元素个数 ③：步长，不写默认1。-1代表逆序
3-5	X[:2, :3]	矩阵切片，原理同上（示例：取前两行，前三列的数据）
3-5	X[:2, :3].copy	值拷贝
3-5	x.reshape(2, 5) x.reshape(10, -1)	重构（改变维数），返回新对象（示例1：将一维向量x转换成2x5的矩阵）（示例2：将x转换成10行的矩阵，列数由计算机平均分配）
3-9	ind = [3, 5, 8] x[ind] ind = np.array[[0, 2], [1, 3]] x[ind]	一维向量的Fancy索引访问。预先声明格式，根据格式返回内容（示例1：向量。得到由索引3,5,8的值组成的向量）（示例2：矩阵。得到二维矩阵(从一维向量索引得来的值)）

numpy常用方法

出处	函数	说明	举例
3-3	np.array	创建numpy数据对象。由numpy创建的一维数组可称为向量，n维数组可称为矩阵。参数①：列表	np.array([i for i in range(10)])
3-4	np.zeros	创建全为0的数据对象参数①：shape，结构。1维传int， n维传元组参数②：dtype，类型。传int，float等	np.zeros(10, dtype=int) np.zeros((3, 5)) np.zeros(shape=(3, 5), dtype=int)
3-4	np.ones	创建全为1的数据对象，其他同上
3-4	np.full	创建自定义值的数据对象，基本同上参数②：fill_value，填充值	np.full(shape = (3, 5), fill_value = 666)
3-4	np.arange	指定范围创建数据对象，规则通常是左闭右开传入三个参数：①起始值（闭区间）；②终点值（开区间）；③步长传入两个参数：步长默认1，起始和终点为传入值传入一个参数：起始值默认0，步长默认1，终点值为传入值	np.arange(0, 20, 2) np.arange(0, 1, 0.2) np.arange(0, 10) pnp.arange(10)
3-4	np.linspace	指定范围创建数据对象，不同于arange由步长创建，该函数预先指定元素个数，自动计算平均步长创建参数①：起始值；参数②：终点值(开区间)；参数③：元素个数	np.linspace(0, 20, 11)
3-4	np.random.randint	指定范围内随机创建数据对象参数①：起始值参数②：终点值参数③：尺寸，传入元组可定义维数	np.random.randint(0, 10) np.random.randint(0, 10, 10) np.random.randint(4, 8, size=10) np.random.randint(4, 8, size=(3, 5))
3-4	np.random.seed	设定numpy的随机种子	np.random.seed(666)
3-4	np.random.random	生成[0, 1)的随机数参数①：尺寸，传入元组可定义维数	np.random.random() np.random.random(10) np.random.random((3, 5))
3-4	np.random.normal	根据正态分布生成随机数参数①：默认0，均值参数②：默认1，方差参数③：尺寸，传入元组可定义维数	np.random.normal() np.random.normal(10, 100) np.random.normal(0, 1, (3, 5))
3-6	np.concatenate	拼接数据，要求数据必须维数相同，返回新对象参数①：列表（列表中元素仍是列表）参数②：axis，默认0，沿行拼接。1为沿列拼接	np.concatenate([x, y]) np.concatenate([A, A]) np.concatenate([A, A], axis=1)
3-6	np.vstack	竖直方向拼接数据（增加行），不要求维数相同，但需要满足规则	np.vstack([A, z])
3-6	np.hstack	水平方向拼接数据（增加列）	np.hstack([A, B])
3-6	np.split	数据分割参数①：指定分割对象参数②：列表，指定分割点参数③：axis，默认0，基于行分割。1基于列分割	np.split(x, [3, 7]) np.split(A, [2]) np.split(A, [2], axis=1)
3-6	np.vsplit	竖直方向分割（基于行分割）	np.vsplit(A, [2])
3-6	np.hsplit	水平方向分割（基于列分割）	np.hsplit(A, [2])

3-11 Matplotlib数据可视化基础

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np

# 直线图 通常用于表现结果
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.plot(x, y)
plt.show()

cosy = np.cos(x)
siny = y.copy()
plt.plot(x, siny)
plt.plot(x, cosy, color = 'orange', linestyle='-')
plt.show()

plt.plot(x, siny)
plt.plot(x, cosy, color = 'orange', linestyle='-')
plt.xlim(-5, 15) # 控制x范围
plt.ylim(0, 1.5) # 控制y范围
plt.axis([-1, 11, -2, 2]) # 同时调整x和y范围
plt.show()

plt.plot(x, siny, label='sin(x)')
plt.plot(x, cosy, color = 'orange', linestyle='-', label='cos(x)')
plt.xlabel('x axis')
plt.ylabel('y value')
plt.legend() # 添加图示
plt.title('Welcome to the ML World!')
plt.show()

# 散点图 Scatter Plot 通常用于绘制二维特征
plt.scatter(x, siny)
plt.scatter(x, cosy)
plt.show()

x = np.random.normal(0, 1, 10000)
y = np.random.normal(0, 1, 10000)
plt.scatter(x, y, alpha=0.1)
plt.show()

3-12 数据加载和简单的数据探索

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

iris = datasets.load_iris()
print(iris.keys()) # 数据对应特征和label
print(iris.DESCR) # 数据集文档
print(iris.data) # 特征
print(iris.data.shape)
print(iris.feature_names) # 特征名称
print(iris.target) # label对应的索引值
print(iris.target.shape)
print(iris.target_names) # label名称

X = iris.data[:, :2] # 取前两列
print(X.shape)
plt.scatter(X[:, 0], X[:, 1]) # 取X第0和1列分别作为x和y轴
plt.show()

y = iris.target
plt.scatter(X[y==0, 0], X[y==0, 1], color='red', marker='o') # 取X满足y==0的行的0和1列
plt.scatter(X[y==1, 0], X[y==1, 1], color='blue', marker='+') # 取X满足y==1的行的0和1列
plt.scatter(X[y==2, 0], X[y==2, 1], color='green', marker='x') # 取X满足y==2的行的0和1列
plt.show()

X = iris.data[:, 2:] # 此处表示取3,4列特征
plt.scatter(X[y==0, 0], X[y==0, 1], color='red', marker='o') # 取X满足y==0的行的0和1列
plt.scatter(X[y==1, 0], X[y==1, 1], color='blue', marker='+') # 取X满足y==1的行的0和1列
plt.scatter(X[y==2, 0], X[y==2, 1], color='green', marker='x') # 取X满足y==2的行的0和1列
plt.show()

第4章最基础的分类算法-k近邻算法 kNN

4-1 k近邻算法基础

原理：选取K个数据点，当判断新的样本点属于哪一类时，找到距离新样本点最近的K个点，哪个类别距离近最多，就判断是哪个类别

import numpy as np
import matplotlib.pyplot as plt

raw_data_X = [
    [3.393533211, 2.331273381],
    [3.110073483, 1.781539638],
    [1.343808831, 3.368360954],
    [3.582294042, 4.679179110],
    [2.280362439, 2.866990263],
    [7.423436942, 4.696522875],
    [5.745051997, 3.533989803],
    [9.172168622, 2.511101045],
    [7.792783481, 3.424088941],
    [7.939820817, 0.791637231]
]
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
X_train = np.array(raw_data_X)
y_train = np.array(raw_data_y)
print(X_train)
'''
[[3.39353321 2.33127338]
 [3.11007348 1.78153964]
 [1.34380883 3.36836095]
 [3.58229404 4.67917911]
 [2.28036244 2.86699026]
 [7.42343694 4.69652288]
 [5.745052   3.5339898 ]
 [9.17216862 2.51110105]
 [7.79278348 3.42408894]
 [7.93982082 0.79163723]]
'''
print(y_train) # [0 0 0 0 0 1 1 1 1 1]

plt.scatter(X_train[y_train==0, 0], X_train[y_train==0, 1], color='g')
plt.scatter(X_train[y_train==1, 0], X_train[y_train==1, 1], color='r')
plt.show()

x = np.array([8.093607318, 3.365731514])
plt.scatter(X_train[y_train==0, 0], X_train[y_train==0, 1], color='g')
plt.scatter(X_train[y_train==1, 0], X_train[y_train==1, 1], color='r')
plt.scatter(x[0], x[1], color='b')
plt.show()

# knn的过程
from math import sqrt
distances = []
for x_train in X_train:
    d = sqrt(np.sum((x_train - x)**2))
    distances.append(d)
print(distances)
distances = [sqrt(np.sum((x_train - x)**2)) for x_train in X_train]
print(distances)
# [4.812566907609877, 5.229270827235305, 6.749798999160064, 4.6986266144110695, 5.83460014556857,
# 1.4900114024329525, 2.354574897431513, 1.3761132675144652, 0.3064319992975, 2.5786840957478887]
nearest = np.argsort(distances)
print(nearest) # [8 7 5 6 9 3 0 1 4 2]

k = 6
topK_y = [y_train[i] for i in nearest[:k]] # 取nearest前6个元素
print(topK_y) # [1, 1, 1, 1, 1, 0]
# 统计
from collections import Counter
votes = Counter(topK_y) # 求不同元素的票数
print(votes) # Counter({1: 5, 0: 1})
print(votes.most_common(1)) # 取票数最高的一个元素（列表） # [(1, 5)]
print(votes.most_common(1)[0][0]) # 第一个列表元素（元组）的第一个数据 # 1
predict_y = votes.most_common(1)[0][0]
print(predict_y) # 1

4-2 scikit-learn中的机器学习算法封装

kNN算法封装

Ch4_kNN.py

import numpy as np
from math import sqrt
from collections import Counter

def kNN_classify(k, X_train, y_train, x):
    assert 1 <= k <= X_train.shape[0], "k must be valid"
    assert X_train.shape[0] == y_train.shape[0], \
        "the size of X_train must equal to the size of y_train"
    assert X_train.shape[1] == x.shape[0], \
        "the feature number of x must be equal to X_train"
    distances = [sqrt(np.sum((x_train - x)**2)) for x_train in X_train]
    nearest = np.argsort(distances)
    topK_y = [y_train[i] for i in nearest[:k]]
    votes = Counter(topK_y)
    return votes.most_common(1)[0][0]

main

import numpy as np
import matplotlib.pyplot as plt

raw_data_X = [
    [3.393533211, 2.331273381],
    [3.110073483, 1.781539638],
    [1.343808831, 3.368360954],
    [3.582294042, 4.679179110],
    [2.280362439, 2.866990263],
    [7.423436942, 4.696522875],
    [5.745051997, 3.533989803],
    [9.172168622, 2.511101045],
    [7.792783481, 3.424088941],
    [7.939820817, 0.791637231]
]
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
X_train = np.array(raw_data_X)
y_train = np.array(raw_data_y)

x = np.array([8.093607318, 3.365731514])

%run liuyubobobo/Ch4_kNN.py
predict_y = kNN_classify(6, X_train, y_train, x)
print(predict_y)

调用sklearn的kNN

# 使用scikit-learn中的kNN
from sklearn.neighbors import KNeighborsClassifier
kNN_classifier = KNeighborsClassifier(n_neighbors=6)
print(kNN_classifier.fit(X_train, y_train)) # 拟合
# kNN_classifier.predict(x) # x是一维向量，过时
X_predict = x.reshape(1, -1)
print(X_predict)
y_predict = kNN_classifier.predict(X_predict)
print(y_predict)
print(y_predict[0])

根据sklearn重新封装kNN

Ch4_kNN2.py

import numpy as np
from math import sqrt
from collections import Counter

class KNNClassifier:
    def __init__(self, k):
        '''初始化kNN分类器'''
        assert k>=1, "k must be valid"
        self.k = k
        self._X_train = None
        self._y_train = None
    def fit(self, X_train, y_train):
        '''根据训练数据集X_train和y_train训练kNN分类器'''
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"
        assert self.k <= X_train.shape[0], \
            "the size of X_train must be at least k"
        self._X_train = X_train
        self._y_train = y_train
        return self
    def predict(self, X_predict):
        '''给定待遇测数据集X_predict，返回表示X_predict的结果向量'''
        assert self._X_train is not None and self._y_train is not None, \
            'must fit before predict!'
        assert X_predict.shape[1] == self._X_train.shape[1], \
            'the feature number of X_predict must be equal to X_train'
        y_predict = [self._predict(x) for x in X_predict]
        return np.array(y_predict)
    def _predict(self, x):
        '''给定单个待预测数据x，返回x的预测结果值'''
        assert x.shape[0] == self._X_train.shape[1], \
            'the feature number of x must be equal to X_train'
        distances = [sqrt(np.sum((x_train - x)**2)) for x_train in self._X_train]
        nearest = np.argsort(distances)
        topK_y = [self._y_train[i] for i in nearest[:self.k]]
        votes = Counter(topK_y)
        return votes.most_common(1)[0][0]
    def __repr__(self):
        return 'KNN(k=%d)' % self.k

main

# 重新整理我们的kNN代码
%run liuyubobobo/Ch4_kNN2.py
knn_clf = KNNClassifier(k=6)
knn_clf.fit(X_train, y_train)
y_predict = knn_clf.predict(X_predict)
print(y_predict)
print(y_predict[0])

4-3 训练数据集，测试数据集

model_selection.py

将数据集切分成训练数据和测试数据两部分

import numpy as np

def train_test_split(X, y, test_ratio=0.2, seed=None):
    '''将数据X和y按照test_ratio分割成X_train, X_test, y_train, y_test'''
    assert X.shape[0] == y.shape[0], \
        'the size of X must be equal to the size of y'
    assert 0.0 <= test_ratio <= 1.0, \
        'test_ratio must be valid'
    if seed:
        np.random.seed(seed)
    shuffle_indexes = np.random.permutation(len(X))  # 索引乱序排列
    test_ratio = 0.2  # 80%训练数据，20%测试数据
    test_size = int(len(X) * test_ratio)
    test_indexes = shuffle_indexes[:test_size]
    train_indexes = shuffle_indexes[test_size:]
    X_train = X[train_indexes]
    y_train = y[train_indexes]
    X_test = X[test_indexes]
    y_test = y[test_indexes]
    return X_train, X_test, y_train, y_test

main

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
print(X.shape, y.shape)

# 使用我们的算法
from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

from playML.kNN import KNNClassifier
my_knn_clf = KNNClassifier(k=3)
my_knn_clf.fit(X_train, y_train)
y_predict = my_knn_clf.predict(X_test)
print(y_predict)
print(sum(y_predict == y_test))
print(sum(y_predict == y_test) / len(y_test))

# sklearn中的train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=666)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

4-4 分类准确度

本节中将加载手写数字的数据集，拆分成训练集（80%）和测试集（20%），使用自定义kNN算法和sklearn的kNN算法测试准确度。

metrics.py

封装测试准确度的算法

def accuracy_score(y_true, y_predict):
    '''计算y_true和y_predict之间的准确率'''
    assert y_true.shape[0] == y_predict.shape[0], \
        'the size of y_true must be equal to the size of y_predict'
    return sum(y_true == y_predict) / len(y_true)

修改kNN.py：新增score方法

import numpy as np
from math import sqrt
from collections import Counter
from .metrics import accuracy_score

class KNNClassifier:
		......
    def score(self, X_test, y_test):
        '''根据测试数据集 X_test 和 y_test 确定当前模型的准确度'''
        y_predict = self.predict(X_test)
        return accuracy_score(y_test, y_predict)

main

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import datasets
digits = datasets.load_digits() # 加载内置的手写汉字数据集
print(digits.keys())
print(digits.DESCR)
X = digits.data
print(X.shape) # (1797, 64)
y = digits.target
print(y)
print(digits.target_names)
print(y[:100]) # 随机排列
some_digit = X[666]
print(y[666])
some_digit_image = some_digit.reshape(8, 8)
plt.imshow(some_digit_image, cmap = matplotlib.cm.binary) # 显示图片
plt.show()

from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_ratio = 0.2)
from playML.kNN import KNNClassifier
my_knn_clf = KNNClassifier(k=3)
my_knn_clf.fit(X_train, y_train)
y_predict = my_knn_clf.predict(X_test)
print(sum(y_predict == y_test) / len(y_test))

from playML.metrics import accuracy_score
print(accuracy_score(y_test, y_predict))

print(my_knn_clf.score(X_test, y_test)) # 0.9860724233983287

# scikit-learn中的accuracy_score
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=666)
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
y_predict = knn_clf.predict(X_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_predict))
print(knn_clf.score(X_test, y_test)) # 0.9916666666666667

4-5 超参数

import matplotlib
import matplotlib.pyplot as plt
from sklearn import datasets
digits = datasets.load_digits() # 加载内置的手写汉字数据集
X = digits.data
y = digits.target

# scikit-learn中的accuracy_score
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=666)
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
print(knn_clf.score(X_test, y_test))

# 寻找最好的k
best_score = 0.0
best_k = -1
for k in range(1, 11):
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    knn_clf.fit(x_train, y_train)
    score = knn_clf.score(X_test, y_test)
    if score > best_score:
        best_k = k
        best_score = score
print("best_k=", best_k)
print("best_score=", best_score)

# 考虑距离?
best_method = ""
best_score = 0.0
best_k = -1
for method in ['uniform', 'distance']:
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
            best_method = method
print('best_method=', best_method)
print("best_k=", best_k)
print("best_score=", best_score)

↑ 曼哈顿距离：各维度差值之和

↑ 红蓝黄都是曼哈顿距离，绿色是欧拉距离

p = 1：曼哈顿距离

p = 2：欧拉距离

# 探索明可夫斯基距离相应的p
best_p = -1
best_score = 0.0
best_k = -1
for p in range(1, 6):
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights='distance', p=p)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_p = p
            best_k = k
            best_score = score
print('best_p=', best_p)
print("best_k=", best_k)
print("best_score=", best_score)

4-6 网格搜索与k近邻算法中更多超参数

import matplotlib
import matplotlib.pyplot as plt
from sklearn import datasets
digits = datasets.load_digits() # 加载内置的手写汉字数据集
X = digits.data
y = digits.target

# scikit-learn中的accuracy_score
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=666)
from sklearn.neighbors import KNeighborsClassifier

# Grid Search
param_grid = [
  {
    'weights': ['uniform'],
    'n_neighbors': [i for i in range(1, 11)]
  },
  {
    'weights': ['distance'],
    'n_neighbors': [i for i in range(1, 11)],
    'p': [i for i in range(1, 6)]
  }
]

knn_clf = KNeighborsClassifier()
from sklearn.model_selection import GridSearchCV
grid_search = GridSerachCV(knn_clf, param_grid) # 传入knn分类器和网格参数
# %%time
grid_search.fit(X_train, y_train) # 相对比较慢
print(grid_search.best_estimator_) # 计算机计算的结果使用'单词+_'组合
print(grid_search.best_score_)
print(grid_search.best_params_)
knn_clf = grid_search.best_estimator_
print(knn_clf.score(X_test, y_test))
# n_jobs：使用CPU多少核
# verbose：输出信息，越大越详细，一般为2
grid_search = GridSerachCV(knn_clf, param_grid, n_jobs=-1, verbose=2)

4-7 数据归一化

最值归一化

↑ 适用场景：考试成绩（0-100），像素颜色（0-255）

均值方差归一化（推荐）

分母应该是标准差，不明白为什么一直说方差？？

import numpy as np
import matplotlib.pyplot as plt

# 最值归一化 Normalization
# 向量
x = np.random.randint(0, 100, size=100)
print((x - np.min(x)) / (np.max(x) - np.min(x)))
X = np.random.randint(0, 100, (50, 2))

# 矩阵
X = np.array(X, dtype=float)
# 第一列特征最值归一化
X[:, 0] = (X[:, 0] - np.min(X[:, 0])) / (np.max(X[:, 0]) - np.min(X[:, 0]))
X[:, 1] = (X[:, 1] - np.min(X[:, 1])) / (np.max(X[:, 1]) - np.min(X[:, 1]))
print(X[:10, :])
plt.scatter(X[:, 0], X[:, 1])
plt.show()

print(np.mean(X[:, 0])) # 0.5364210526315789
print(np.std(X[:, 0])) # 0.286957176339313
print(np.mean(X[:, 1])) # 0.5195833333333334
print(np.std(X[:, 1])) # 0.298706993650225

# 均值方差归一化 Standardization
X2 = np.random.randint(0, 100, (50, 2))
X2 = np.array(X2, dtype = float)
X2[:, 0] = (X2[:, 0] - np.mean(X2[:, 0])) / np.std(X2[:, 0])
X2[:, 1] = (X2[:, 1] - np.mean(X2[:, 1])) / np.std(X2[:, 1])
plt.scatter(X2[:, 0], X2[:, 1])
plt.show()

print(np.mean(X2[:, 0])) # 9.992007221626409e-17
print(np.std(X2[:, 0])) # 1.0
print(np.mean(X2[:, 1])) # 1.4488410471358292e-16
print(np.std(X2[:, 1])) # 1.0

4-8 scikit-learn中的Scaler

import numpy as np
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target
print(X[:10, :])

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)

# scikit-learn中的StandardScaler
from sklearn.preprocessing import StandardScaler # 使用sklearn中的StandardScalar
# from preprocessing import StandardScaler # 自定义StandardScaler
standardScaler = StandardScaler()
standardScaler.fit(X_train) # 计算得到关键信息
print(standardScaler.mean_) # 只读变量使用'单词_'命名
# print(standardScaler.std_) # 弃用
print(standardScaler.scale_)
X_train = standardScaler.transform(X_train) # 返回归一化后结果
X_test_standard = standardScaler.transform(X_test)

# 使用归一化后的数据，利用kNN分类
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
print(knn_clf.score(X_test_standard, y_test)) # 1.0
print(knn_clf.score(X_test, y_test)) # 0.3333333333333333

4-9 更多有关k近邻算法的思考

↑ 使用K近邻算法解决回归问题-文档：http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

↑ KD-Tree：https://www.bilibili.com/video/BV1d5411w7f5

第5章线性回归法

5-1 简单线性回归

5-2 最小二乘法

b的推导

a的推导

最终

5-3 简单线性回归的实现

python zip()函数详解：https://blog.csdn.net/weixin_47906106/article/details/121702241

main

import numpy as np
import matplotlib.pyplot as plt
x = np.array([1., 2., 3., 4., 5.])
y = np.array([1., 3., 2., 3., 5.])
plt.scatter(x, y)
plt.axis([0, 6, 0, 6])
plt.show()

x_mean = np.mean(x)
y_mean = np.mean(y)
num = 0.0 # 分子
d = 0.0 # 分母
for x_i, y_i in zip(x, y): # 分别从x,y中各取一个值
    num += (x_i - x_mean) * (y_i - y_mean)
    d += (x_i - x_mean) ** 2
a = num / d
b = y_mean - a * x_mean
y_hat = a * x + b
plt.scatter(x, y)
plt.plot(x, y_hat, color = 'red')
plt.axis([0, 6, 0, 6])
plt.show()

x_predict = 6
y_predict = a * x_predict + b
print(y_predict) # 5.2

# 使用自己的SimpleLinearRegression
from playML.SimpleLinearRegression import SimpleLinearRegression1
reg1 = SimpleLinearRegression1()
reg1.fit(x, y)
print(reg1.predict(np.array([x_predict]))) # [5.2]
print(reg1.a_, reg1.b_) # 0.8 0.39999999999999947
y_hat1 = reg1.predict(x)
plt.scatter(x, y)
plt.plot(x, y_hat1, color='r')
plt.axis([0, 6, 0, 6])
plt.show()

SimpleLinearRegression1.py

# 文件: SimpleLinearRegression1
# 作者: 聪头
# 时间: 2022/6/26 12:03
# 描述: 只处理一维特征
import numpy as np
class SimpleLinearRegression1:
    def __init__(self):
        '''初始化Simple Linear Regression 模型'''
        self.a_ = None
        self.b_ = None
    def fit(self, x_train, y_train):
        '''根据训练数据集x_train, y_train训练Simple Linear Regression模型'''
        assert x_train.ndim == 1, \
            'Simple Linear Regressor can only solve single feature training data.'
        assert len(x_train) == len(y_train), \
            'the size of x_train must be equal to the size of y_train'
        x_mean = np.mean(x_train)
        y_mean = np.mean(y_train)
        num = 0.0
        d = 0.0
        for x, y in zip(x_train, y_train):
            num += (x - x_mean) * (y - y_mean)
            d += (x - x_mean) ** 2
        self.a_ = num / d
        self.b_ = y_mean - self.a_ * x_mean
        return self

    def predict(self, x_predict):
        '''给定待预测数据集x_predict, 返回表示x_predict的结果向量'''
        assert x_predict.ndim == 1, \
            'Simple Linear Regressor can only solve single feature training data.'
        assert self.a_ is not None and self.b_ is not None, \
            'must fit before predict!'

        return np.array([self._predict(x) for x in x_predict])

    def _predict(self, x_single):
        '''给定单个待预测数据x_single，返回x_single的预测结果值'''
        return self.a_ * x_single + self.b_
    def __repr__(self):
        return "SimpleLinearRegression1()"

5-4 向量化

# 优化：向量运算
class SimpleLinearRegression2:
		......
    def fit(self, x_train, y_train):
      	......
        x_mean = np.mean(x_train)
        y_mean = np.mean(y_train)
        num = 0.0
        d = 0.0
				
        #将for循环改为dot
        num = (x_train - x_mean).dot(y_train - y_mean)
        d = (x_train - x_mean).dot(x_train - x_mean)

        self.a_ = num / d
        self.b_ = y_mean - self.a_ * x_mean
        return self
		......
    def __repr__(self):
        return "SimpleLinearRegression2()"

性能测试

5-5 衡量线性回归法的指标：MSE，RMSE和MAE

RMSE vs MAE

RMSE中将误差平方，放大了误差，故RMSE的结果通常大于MAE

实际使用中，应尽可能让RMSE更小

5-6 最好的衡量线性回归法的指标：R Squared

我们模型预测产生的错误由于考虑了x,y之间的关系，因此会比Baseline Model小

Baseline Model（基准模型）中不考虑x和y的关系，任何x输入都使用y均值输出

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

# 波士顿房产数据
boston = datasets.load_boston()
x = boston.data[:, 5] # 只使用房间数量这个特征 RM
y = boston.target

# 使用简单线性回归算法
from playML.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)
from playML.SimpleLinearRegression import SimpleLinearRegression2
reg = SimpleLinearRegression2()
reg.fit(x_train, y_train)

# 预测
y_predict = reg.predict(x_test)

# 使用自己封装的r2_score
from playML.metrics import r2_score
print(r2_score(y_test, y_predict)) # 0.40277850092929524

# 使用sklearn提供的r2_score
from sklearn.metrics import r2_score
print(r2_score(y_test, y_predict)) # 0.4027785009292951

5-7 多元线性回归和正规方程解

已知：y（样本对应标签）和Xb（第0列全为1，由样本特征值组成的矩阵）

1.求导

2.求极值（导数=0）

涉及矩阵求导（超纲）

推导过程略，本质就是对θ每一个分量求偏导，令其为0得到最终结果

↑ θ只是X各列的系数，没有量纲问题

5-8 实现多元线性回归

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

# 波士顿房产数据
boston = datasets.load_boston()
X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]
print(X.shape) # (490, 13)

from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
from playML.LinearRegression import LinearRegression
reg = LinearRegression()
reg.fit_normal(X_train, y_train)
print(reg.score(X_test,y_test)) # 0.8129794056212907

5-9 使用scikit-learn解决回归问题

从5-9开始：代码直接摘自老师提供的源码

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

# 波士顿房产数据
boston = datasets.load_boston()
X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]
print(X.shape) # (490, 13)

from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
lin_reg.score(X_test, y_test)

# kNN Regressor
from sklearn.neighbors import KNeighborsRegressor

knn_reg = KNeighborsRegressor()
knn_reg.fit(X_train_standard, y_train)
knn_reg.score(X_test_standard, y_test)

# 优化：kNN网格搜索优化超参数
from sklearn.model_selection import GridSearchCV

param_grid = [
    {
        "weights": ["uniform"],
        "n_neighbors": [i for i in range(1, 11)]
    },
    {
        "weights": ["distance"],
        "n_neighbors": [i for i in range(1, 11)],
        "p": [i for i in range(1,6)]
    }
]
knn_reg = KNeighborsRegressor()
grid_search = GridSearchCV(knn_reg, param_grid, n_jobs=-1, verbose=1)
grid_search.fit(X_train_standard, y_train)
grid_search.best_params_

grid_search.best_score_

grid_search.best_estimator_.score(X_test_standard, y_test) # 该值才是真实的score

5-10 线性回归的可解释性和更多思考

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
boston = datasets.load_boston()

X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)

lin_reg.coef_

np.argsort(lin_reg.coef_)

boston.feature_names[np.argsort(lin_reg.coef_)]

print(boston.DESCR)

第6章梯度下降法

6-1 什么是梯度下降法

直观理解就是滚球，每次滚一段区域直到最低点

↑ 如图导数为负：代表沿x轴负方向，y值增大。乘上一个负值可以取反，表示沿x轴正方向，y值增大

具体步骤：每次得到一个theta，求其导数，如果导数大于0，就沿上式方向移动theta，继续求导直到导数为0

可能遇到的问题

6-2 模拟实现梯度下降法

梯度下降核心算法

实现思路：任取一个θ，对其求导，并使用eta配合导数值对其偏移，偏移前后差值小于epsilon则视为找到极值点

import numpy as np
import matplotlib.pyplot as plt

plot_x = np.linspace(-1., 6., 141)
plot_y = (plot_x-2.5)**2 - 1.
plt.plot(plot_x, plot_y)
plt.show()

epsilon = 1e-8 # 精度误差
eta = 0.1 # 学习率

def J(theta):
    return (theta-2.5)**2 - 1.

def dJ(theta):
    return 2*(theta-2.5)

theta = 0.0
while True:
    gradient = dJ(theta)
    last_theta = theta
    theta = theta - eta * gradient
    
    if(abs(J(theta) - J(last_theta)) < epsilon):
        break
print(theta)
print(J(theta))

优化

# 新增异常判断，避免溢出
def J(theta):
    try:
        return (theta-2.5)**2 - 1.
    except:
        return float('inf')

# 限制迭代次数，避免死循环
def gradient_descent(initial_theta, eta, n_iters = 1e4, epsilon=1e-8):
    
    theta = initial_theta
    i_iter = 0
    theta_history.append(initial_theta)

    while i_iter < n_iters:
        gradient = dJ(theta)
        last_theta = theta
        theta = theta - eta * gradient
        theta_history.append(theta)
    
        if(abs(J(theta) - J(last_theta)) < epsilon):
            break
            
        i_iter += 1  
    return
  
# 绘制梯度下降图示
def plot_theta_history():
    plt.plot(plot_x, J(plot_x))
    plt.plot(np.array(theta_history), J(np.array(theta_history)), color="r", marker='+')
    plt.show()

6-3 线性回归中的梯度下降法

Xb：由样本集合X追加全为1的第一列构成的矩阵

θ：由θ0 - θn组成的向量

二者dot结果就是各分量对应相乘再相加

去掉M带来的影响（不然样本越多，梯度越大）

只看一个样本，相当于去掉求和符号，此时相当于每一个维度求一次偏导得到该样本的梯度

梯度下降法相当于对所有样本进行一次操作，之后求平均得到总的平均梯度

6-4 实现线性回归中的梯度下降法

 import numpy as np
import matplotlib.pyplot as plt

np.random.seed(666)
x = 2 * np.random.random(size=100)
y = x * 3. + 4. + np.random.normal(size=100)
X = x.reshape(-1, 1)
print(X.shape)
print(y.shape)

plt.scatter(x, y)
plt.show()


def J(theta, X_b, y):
    try:
        return np.sum((y - X_b.dot(theta))**2) / len(X_b)
    except:
        return float('inf')
def dJ(theta, X_b, y):
    res = np.empty(len(theta))
    res[0] = np.sum(X_b.dot(theta) - y)
    for i in range(1, len(theta)):
        res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
    return res * 2 / len(X_b)

def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):
    theta = initial_theta
    i_iter = 0

    while i_iter < n_iters:
        gradient = dJ(theta, X_b, y)
        last_theta = theta
        theta = theta - eta * gradient

        if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
            break

        i_iter += 1
    return theta

X_b = np.hstack([np.ones((len(x), 1)), x.reshape(-1, 1)])
initial_theta = np.zeros(X_b.shape[1])
eta = 0.01

theta = gradient_descent(X_b, y, initial_theta, eta)
print(theta) # [4.02145786 3.00706277]

# 使用封装了梯度下降的线性回归算法
from playML.LinearRegression import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit_gd(X, y)
print(lin_reg.coef_) # [3.00706277]
print(lin_reg.intercept_) # 4.021457858204859

修改LinearRegression

新增fit_gd方法

def fit_gd(self, X_train, y_train, eta=0.01, n_iters=1e4):
"""根据训练数据集X_train, y_train, 使用梯度下降法训练Linear Regression模型"""
assert X_train.shape[0] == y_train.shape[0], \
    "the size of X_train must be equal to the size of y_train"

def J(theta, X_b, y):
    try:
        return np.sum((y - X_b.dot(theta)) ** 2) / len(y)
    except:
        return float('inf')

def dJ(theta, X_b, y):
    res = np.empty(len(theta))
    res[0] = np.sum(X_b.dot(theta) - y)
    for i in range(1, len(theta)):
        res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
    return res * 2 / len(X_b)

def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):

    theta = initial_theta
    cur_iter = 0

    while cur_iter < n_iters:
        gradient = dJ(theta, X_b, y)
        last_theta = theta
        theta = theta - eta * gradient
        if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
            break

        cur_iter += 1

    return theta

X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
initial_theta = np.zeros(X_b.shape[1])
self._theta = gradient_descent(X_b, y_train, initial_theta, eta, n_iters)

self.intercept_ = self._theta[0]
self.coef_ = self._theta[1:]

return self

6-5 梯度下降法的向量化和数据标准化

def dJ(theta, X_b, y):
    # res = np.empty(len(theta))
    # res[0] = np.sum(X_b.dot(theta) - y)
    # for i in range(1, len(theta)):
    #     res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
    # return res * 2 / len(X_b)
    return X_b.T.dot(X_b.dot(theta) - y) * 2. / len(X_b)

测试使用梯度下降法

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

# 波士顿房产数据
boston = datasets.load_boston()
X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]
print(X.shape) # (490, 13)

from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
from playML.LinearRegression import LinearRegression
reg = LinearRegression()
reg.fit_normal(X_train, y_train)
print(reg.score(X_test,y_test)) # 0.8129794056212907

# 数据没有归一化，导致模型不收敛(overflow)
# lin_reg2 = LinearRegression()
# lin_reg2.fit_gd(X_train, y_train)
# print(lin_reg2.fit_gd(X_test, y_test))

# scikit-learn中的StandardScaler
from sklearn.preprocessing import StandardScaler # 使用sklearn中的StandardScalar
standardScaler = StandardScaler()
standardScaler.fit(X_train) # 计算得到关键信息
X_train_standard = standardScaler.transform(X_train) # 返回归一化后结果
X_test_standard = standardScaler.transform(X_test)

lin_reg3 = LinearRegression()
lin_reg3.fit_gd(X_train_standard, y_train, eta=0.0001, n_iters=1e4)
print(lin_reg3.score(X_test_standard, y_test))

6-6 随机梯度下降法

学习率应该随着循环次数的增加逐渐递减

import numpy as np
import matplotlib.pyplot as plt

m = 100000

x = np.random.normal(size=m)
X = x.reshape(-1,1)
y = 4.*x + 3. + np.random.normal(0, 3, size=m)
plt.scatter(x, y)
plt.show()


def J(theta, X_b, y):
    try:
        return np.sum((y - X_b.dot(theta)) ** 2) / len(y)
    except:
        return float('inf')


def dJ(theta, X_b, y):
    return X_b.T.dot(X_b.dot(theta) - y) * 2. / len(y)


def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):
    theta = initial_theta
    cur_iter = 0

    while cur_iter < n_iters:
        gradient = dJ(theta, X_b, y)
        last_theta = theta
        theta = theta - eta * gradient
        if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
            break

        cur_iter += 1

    return theta

X_b = np.hstack([np.ones((len(X), 1)), X])
initial_theta = np.zeros(X_b.shape[1])
eta = 0.01
theta = gradient_descent(X_b, y, initial_theta, eta)

print(theta) # [3.00538344 3.98886975]

def dJ_sgd(theta, X_b_i, y_i):
    return 2 * X_b_i.T.dot(X_b_i.dot(theta) - y_i)

def sgd(X_b, y, initial_theta, n_iters):

    t0, t1 = 5, 50
    def learning_rate(t):
        return t0 / (t + t1)

    theta = initial_theta
    for cur_iter in range(n_iters):
        rand_i = np.random.randint(len(X_b)) # 随机取一个样本
        gradient = dJ_sgd(theta, X_b[rand_i], y[rand_i]) # 计算梯度
        theta = theta - learning_rate(cur_iter) * gradient # 修改theta

    return theta

X_b = np.hstack([np.ones((len(X), 1)), X])
initial_theta = np.zeros(X_b.shape[1])
theta = sgd(X_b, y, initial_theta, n_iters=m//3)
print(theta) # array([2.93222467, 4.02957069])

6-7 scikit-learn中的随机梯度下降法

from sklearn import datasets

boston = datasets.load_boston()
X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]

from playML.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
from sklearn.preprocessing import StandardScaler

standardScaler = StandardScaler()
standardScaler.fit(X_train)
X_train_standard = standardScaler.transform(X_train)
X_test_standard = standardScaler.transform(X_test)

from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(n_iter_no_change=50)
%time sgd_reg.fit(X_train_standard, y_train)
sgd_reg.score(X_test_standard, y_test) # 0.8124437321129272

6-8 如何确定梯度计算的准确性？调试梯度下降法

↑ 实际使用中，梯度的调试通常很慢，可用于前期测试

import numpy as np
import matplotlib.pyplot as plt

# 构造本节测试数据
np.random.seed(666)
X = np.random.random(size=(1000, 10))

true_theta = np.arange(1, 12, dtype=float) # 1(截距) + 10(特征) = 11个
X_b = np.hstack([np.ones((len(X), 1)), X])
y = X_b.dot(true_theta) + np.random.normal(size=1000)

print(X.shape)
print(y.shape)
print(true_theta)

def J(theta, X_b, y):
    try:
        return np.sum((y - X_b.dot(theta))**2) / len(X_b)
    except:
        return float('inf')

def dJ_math(theta, X_b, y):
    return X_b.T.dot(X_b.dot(theta) - y) * 2. / len(y)

def dJ_debug(theta, X_b, y, epsilon=0.01):
    res = np.empty(len(theta))
    for i in range(len(theta)):
        theta_1 = theta.copy()
        theta_1[i] += epsilon
        theta_2 = theta.copy()
        theta_2[i] -= epsilon
        res[i] = (J(theta_1, X_b, y) - J(theta_2, X_b, y)) / (2 * epsilon)
    return res


def gradient_descent(dJ, X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):
    theta = initial_theta
    cur_iter = 0

    while cur_iter < n_iters:
        gradient = dJ(theta, X_b, y)
        last_theta = theta
        theta = theta - eta * gradient
        if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
            break
        cur_iter += 1
    return theta

X_b = np.hstack([np.ones((len(X), 1)), X])
initial_theta = np.zeros(X_b.shape[1])
eta = 0.01

%time theta = gradient_descent(dJ_debug, X_b, y, initial_theta, eta)
theta

%time theta = gradient_descent(dJ_math, X_b, y, initial_theta, eta)
theta

6-9 有关梯度下降法的更多深入讨论

批量梯度下降法：每次求梯度，需要把所有样本数据看一遍，优点是稳定

随机梯度下降法：每次求梯度只随机取一个样本的梯度，优点是快

小批量梯度下降法：每次随机看k个样本

第7章 PCA与梯度上升法

7-1 什么是PCA

步骤1：demean

步骤2：求方差最大值

由于X均值为0，故化简为如下式子

不同于线性回归：

线性回归是确保预测值和真值之间最小

PCA是确保各元素间差值最大（方差最大）

7-2 使用梯度上升法求解PCA问题

把w视为自变量，每次求导后都往极值点方向走一些（w记得每次要单位化）

X：MxN

w：Nx1

X：MxN

(Xw)^T^X = 1xM MxN = 1 x N

((Xw)^T^X)^T^ = N x 1 = X^T^(Xw) # 我们想要的列向量

7-3 求数据的主成分PCA

import numpy as np
import matplotlib.pyplot as plt
X = np.empty((100, 2))
X[:,0] = np.random.uniform(0., 100., size=100)
X[:,1] = 0.75 * X[:,0] + 3. + np.random.normal(0, 10., size=100)
plt.scatter(X[:,0], X[:,1])
plt.show()

def demean(X):
    return X - np.mean(X, axis=0) # np.mean(X, axis=0)沿行方向求均值，相当于每列的均值

X_demean = demean(X)
plt.scatter(X_demean[:,0], X_demean[:,1])
plt.show()


# 梯度上升法
def f(w, X):
    return np.sum((X.dot(w) ** 2)) / len(X)


def df_math(w, X):
    return X.T.dot(X.dot(w)) * 2. / len(X)


def df_debug(w, X, epsilon=0.0001):
    res = np.empty(len(w))
    for i in range(len(w)):
        w_1 = w.copy()
        w_1[i] += epsilon
        w_2 = w.copy()
        w_2[i] -= epsilon
        res[i] = (f(w_1, X) - f(w_2, X)) / (2 * epsilon)
    return res


def direction(w):
    return w / np.linalg.norm(w)


def gradient_ascent(df, X, initial_w, eta, n_iters=1e4, epsilon=1e-8):
    w = direction(initial_w)
    cur_iter = 0

    while cur_iter < n_iters:
        gradient = df(w, X)
        last_w = w
        w = w + eta * gradient
        w = direction(w)  # 注意1：每次求一个单位方向
        if (abs(f(w, X) - f(last_w, X)) < epsilon):
            break

        cur_iter += 1

    return w


initial_w = np.random.random(X.shape[1])  # 注意2：不能用0向量开始
eta = 0.001
gradient_ascent(df_math, X_demean, initial_w, eta)

w = gradient_ascent(df_math, X_demean, initial_w, eta)

plt.scatter(X_demean[:,0], X_demean[:,1])
plt.plot([0, w[0]*30], [0, w[1]*30], color='r') # 表示(0,0)和(w[0] * 30, w[1] * 30)两个点构成的直线
plt.show()

# 使用极端数据集测试
X2 = np.empty((100, 2))
X2[:,0] = np.random.uniform(0., 100., size=100)
X2[:,1] = 0.75 * X2[:,0] + 3.
plt.scatter(X2[:,0], X2[:,1])
plt.show()

X2_demean = demean(X2)
w2 = gradient_ascent(df_math, X2_demean, initial_w, eta)
plt.scatter(X2_demean[:,0], X2_demean[:,1])
plt.plot([0, w2[0]*30], [0, w2[1]*30], color='r')
plt.show()

7-4 求数据的前n个主成分

import numpy as np
import matplotlib.pyplot as plt
X = np.empty((100, 2))
X[:,0] = np.random.uniform(0., 100., size=100)
X[:,1] = 0.75 * X[:,0] + 3. + np.random.normal(0, 10., size=100)
def demean(X):
    return X - np.mean(X, axis=0)
X = demean(X)
plt.scatter(X[:,0], X[:,1])
plt.show()


def f(w, X):
    return np.sum((X.dot(w) ** 2)) / len(X)


def df(w, X):
    return X.T.dot(X.dot(w)) * 2. / len(X)


def direction(w):
    return w / np.linalg.norm(w)


def first_component(X, initial_w, eta, n_iters=1e4, epsilon=1e-8):
    w = direction(initial_w)
    cur_iter = 0

    while cur_iter < n_iters:
        gradient = df(w, X)
        last_w = w
        w = w + eta * gradient
        w = direction(w)
        if (abs(f(w, X) - f(last_w, X)) < epsilon):
            break

        cur_iter += 1

    return w


initial_w = np.random.random(X.shape[1])
eta = 0.01
w = first_component(X, initial_w, eta)
w

# 循环求第二主成分
X2 = np.empty(X.shape)
for i in range(len(X)):
    X2[i] = X[i] - X[i].dot(w) * w

plt.scatter(X2[:, 0], X2[:, 1])
plt.show()

# 向量求第二主成分
X2 = X - X.dot(w).reshape(-1, 1) * w
plt.scatter(X2[:,0], X2[:,1])
plt.show()

# 第一第二主成分dot为0
w2 = first_component(X2, initial_w, eta)
w2
w.dot(w2)

def first_n_components(n, X, eta=0.01, n_iters = 1e4, epsilon=1e-8):
    X_pca = X.copy()
    X_pca = demean(X_pca)
    res = []
    for i in range(n):
        initial_w = np.random.random(X_pca.shape[1])
        w = first_component(X_pca, initial_w, eta)
        res.append(w)
        
        X_pca = X_pca - X_pca.dot(w).reshape(-1, 1) * w
        
    return res
first_n_components(2, X) # [array([0.75772863, 0.65256978]), array([ 0.65257463, -0.75772445])]

7-5 高维数据映射为低维数据

k：代表前k个主成分（k < n）

n：代表数据特征的维数

m：样本数

低维再转高维后，数据前后有差异

import numpy as np


class PCA:

    def __init__(self, n_components):
        """初始化PCA"""
        assert n_components >= 1, "n_components must be valid"
        self.n_components = n_components
        self.components_ = None

    def fit(self, X, eta=0.01, n_iters=1e4):
        """获得数据集X的前n个主成分"""
        assert self.n_components <= X.shape[1], \
            "n_components must not be greater than the feature number of X"

        def demean(X):
            return X - np.mean(X, axis=0)

        def f(w, X):
            return np.sum((X.dot(w) ** 2)) / len(X)

        def df(w, X):
            return X.T.dot(X.dot(w)) * 2. / len(X)

        def direction(w):
            return w / np.linalg.norm(w)

        def first_component(X, initial_w, eta=0.01, n_iters=1e4, epsilon=1e-8):

            w = direction(initial_w)
            cur_iter = 0

            while cur_iter < n_iters:
                gradient = df(w, X)
                last_w = w
                w = w + eta * gradient
                w = direction(w)
                if (abs(f(w, X) - f(last_w, X)) < epsilon):
                    break

                cur_iter += 1

            return w

        X_pca = demean(X)
        self.components_ = np.empty(shape=(self.n_components, X.shape[1]))
        for i in range(self.n_components):
            initial_w = np.random.random(X_pca.shape[1])
            w = first_component(X_pca, initial_w, eta, n_iters)
            self.components_[i,:] = w

            X_pca = X_pca - X_pca.dot(w).reshape(-1, 1) * w

        return self

    def transform(self, X):
        """将给定的X，映射到各个主成分分量中"""
        assert X.shape[1] == self.components_.shape[1]

        return X.dot(self.components_.T)

    def inverse_transform(self, X):
        """将给定的X，反向映射回原来的特征空间"""
        assert X.shape[1] == self.components_.shape[0]

        return X.dot(self.components_)

    def __repr__(self):
        return "PCA(n_components=%d)" % self.n_components

降维后升维

测试

import numpy as np
import matplotlib.pyplot as plt
X = np.empty((100, 2))
X[:,0] = np.random.uniform(0., 100., size=100)
X[:,1] = 0.75 * X[:,0] + 3. + np.random.normal(0, 10., size=100)
from playML.PCA import PCA

pca = PCA(n_components=2)
pca.fit(X)
PCA(n_components=2)
pca.components_
"""
array([[ 0.76676948,  0.64192256],
       [-0.64191827,  0.76677307]])
"""

pca = PCA(n_components=1)
pca.fit(X)
PCA(n_components=1)
X_reduction = pca.transform(X)
X_reduction.shape # (100, 1)

X_restore = pca.inverse_transform(X_reduction)
X_restore.shape # (100, 2)

plt.scatter(X[:,0], X[:,1], color='b', alpha=0.5)
plt.scatter(X_restore[:,0], X_restore[:,1], color='r', alpha=0.5)
plt.show()

7-6 scikit-learn中的PCA

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
digits = datasets.load_digits()
X = digits.data
y = digits.target
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)
X_train.shape # (1347, 64)

%%time

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train) # Wall time: 67.4 ms
knn_clf.score(X_test, y_test) # 0.9866666666666667

# 降维后
from sklearn.decomposition import PCA

pca = PCA(n_components=2) # 降低到2维
pca.fit(X_train)
X_train_reduction = pca.transform(X_train)
X_test_reduction = pca.transform(X_test)

%%time 
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train_reduction, y_train) # Wall time: 1.2 ms
knn_clf.score(X_test_reduction, y_test) # 0.6066666666666667

# 主成分所解释的方差
pca.explained_variance_ratio_ # array([0.14566817, 0.13735469])

from sklearn.decomposition import PCA

pca = PCA(n_components=X_train.shape[1])
pca.fit(X_train)
pca.explained_variance_ratio_ # 每一个主成分可以解释的方差是多少（随着维度提高，逐渐递减）

plt.plot([i for i in range(X_train.shape[1])], 
         [np.sum(pca.explained_variance_ratio_[:i]) for i in range(X_train.shape[1])])
plt.show()

pca = PCA(0.95) # 取95%信息的n_components
pca.fit(X_train)
pca.n_components_ # 28
X_train_reduction = pca.transform(X_train)
X_test_reduction = pca.transform(X_test)
%%time 
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train_reduction, y_train) # Wall time: 2.32 ms
knn_clf.score(X_test_reduction, y_test) # 0.98

# 使用PCA对数据进行降维可视化
pca = PCA(n_components=2)
pca.fit(X)
X_reduction = pca.transform(X)

for i in range(10):
    plt.scatter(X_reduction[y==i,0], X_reduction[y==i,1], alpha=0.8)
plt.show()

7-7 试手MNIST数据集

import numpy as np 

# from sklearn.datasets import fetch_mldata
# mnist = fetch_mldata('MNIST original')
# 在最新版的 sklearn 中，fetch_mldata 被弃用，改为使用 fetch_openml 获得 MNIST 数据集
# 具体见如下代码，后续代码无需改变

from sklearn.datasets import fetch_openml
 
mnist = fetch_openml('mnist_784')

X, y = mnist['data'], mnist['target']
X_train = np.array(X[:60000], dtype=float)
y_train = np.array(y[:60000], dtype=float)
X_test = np.array(X[60000:], dtype=float)
y_test = np.array(y[60000:], dtype=float)

X_train.shape # (60000, 784)
X_test.shape # (10000, 784)

# 使用kNN
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
%time knn_clf.fit(X_train, y_train)
%time knn_clf.score(X_test, y_test) #0.9688

# 使用PCA进行降维
from sklearn.decomposition import PCA 
pca = PCA(0.90)
pca.fit(X_train)
X_train_reduction = pca.transform(X_train)
X_test_reduction = pca.transform(X_test)
X_train_reduction.shape # (60000, 87)

knn_clf = KNeighborsClassifier()
%time knn_clf.fit(X_train_reduction, y_train)
%time knn_clf.score(X_test_reduction, y_test) # 降维去除了噪音，有可能准确率更高！ # 0.9728

第8章多项式回归与模型泛化

8-1 什么是多项式回归

URL：https://git.imooc.com/coding-169/coding-169/src/master/08-Polynomial-Regression-and-Model-Generalization/01-What-is-Polynomial-Regression

import numpy as np 
import matplotlib.pyplot as plt
x = np.random.uniform(-3, 3, size=100)
X = x.reshape(-1, 1)
y = 0.5 * x**2 + x + 2 + np.random.normal(0, 1, 100)
plt.scatter(x, y)
plt.show()

线性回归？

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)
y_predict = lin_reg.predict(X)
plt.scatter(x, y)
plt.plot(x, y_predict, color='r')
plt.show()

解决方案，添加一个特征

这样把X^2当作一个线性回归项，就可以使用线性回归

X2 = np.hstack([X, X**2]) #  添加一个特征
X2.shape # (100, 2)

lin_reg2 = LinearRegression()
lin_reg2.fit(X2, y)
y_predict2 = lin_reg2.predict(X2)
plt.scatter(x, y)
plt.plot(np.sort(x), y_predict2[np.argsort(x)], color='r')
plt.show()

lin_reg2.coef_ # array([ 0.99870163,  0.54939125])
lin_reg2.intercept_ # 1.8855236786516001

8-2 scikit-learn中的多项式回归与Pipeline

URL：https://git.imooc.com/coding-169/coding-169/src/master/08-Polynomial-Regression-and-Model-Generalization/02-Polynomial-Regression-in-scikit-learn

import numpy as np
import matplotlib.pyplot as plt
x = np.random.uniform(-3, 3, size=100)
X = x.reshape(-1, 1)
y = 0.5 * x**2 + x + 2 + np.random.normal(0, 1, 100)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
poly.fit(X)
X2 = poly.transform(X)
print(X2.shape) # (100, 3)

print(X[:5,:])
print(X2[:5,:])

from sklearn.linear_model import LinearRegression

lin_reg2 = LinearRegression()
lin_reg2.fit(X2, y)
y_predict2 = lin_reg2.predict(X2)
plt.scatter(x, y)
plt.plot(np.sort(x), y_predict2[np.argsort(x)], color='r')
plt.show()

print(lin_reg2.coef_) # [0.         1.00753239 0.44398241]
print(lin_reg2.intercept_) # 2.0793848909176553

# 关于PolynomialFeatures
X = np.arange(1, 11).reshape(-1, 2)
print(X)
"""
[[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 7  8]
 [ 9 10]]
"""
poly = PolynomialFeatures(degree=2)
poly.fit(X)
X2 = poly.transform(X)
print(X2.shape)
print(X2)

# Pipeline：传入一系列类对象，依次执行
x = np.random.uniform(-3, 3, size=100)
X = x.reshape(-1, 1)
y = 0.5 * x**2 + x + 2 + np.random.normal(0, 1, 100)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

poly_reg = Pipeline([
    ("poly", PolynomialFeatures(degree=2)),
    ("std_scaler", StandardScaler()),
    ("lin_reg", LinearRegression())
])

poly_reg.fit(X, y)
y_predict = poly_reg.predict(X)

plt.scatter(x, y)
plt.plot(np.sort(x), y_predict[np.argsort(x)], color='r')
plt.show()

8-3 过拟合与欠拟合

URL：https://git.imooc.com/coding-169/coding-169/src/master/08-Polynomial-Regression-and-Model-Generalization/03-Overfitting-and-Underfitting

8-4 为什么要有训练数据集与测试数据集

URL：https://git.imooc.com/coding-169/coding-169/src/master/08-Polynomial-Regression-and-Model-Generalization/04-Why-Train-Test-Split

8-5 学习曲线

URL：https://git.imooc.com/coding-169/coding-169/src/master/08-Polynomial-Regression-and-Model-Generalization/05-Learning-Curve

欠拟合：整体误差大

过拟合：测试数据集误差大

8-6 验证数据集与交叉验证

8-7 偏差方差平衡

8-8 模型泛化与岭回归

URL：https://git.imooc.com/coding-169/coding-169/src/master/08-Polynomial-Regression-and-Model-Generalization/08-Model-Regularization-and-Ridge-Regression

模型正则化：通俗理解，就是过拟合的系数太大，加入一项使其减小，增强其泛化能力

8-9 LASSO

在Ridge中，每一步θ都是有值的，沿着梯度降为0

在LASSO中，θ会先沿某个轴降依次降为0

8-10 L1, L2和弹性网络

L0正则项：使θ尽可能少

第9章逻辑回归

解决：分类

9-1 什么是逻辑回归

一句话：将值域的(-∞, +∞)的数据映射到(0, 1)进行分类（大于0.5为1，小于0.5为0）

Sigmoid函数

9-2 逻辑回归的损失函数

自变量X为正数时，满足左加右减；X为负数时，满足左减右加（该规则仅适用于最后一步，即X符号已经确定之后）

累加所有样本的误差求平均得到损失函数

Xbθ：值域范围（-∞，+∞）

预测值-Sigmoid：值域范围（0,1）

J(θ)：损失函数，当真值为1，预测值为0时，误差最大；当真值为0，预测值为1时，误差最大

9-3 逻辑回归损失函数的梯度

前半部分

后半部分

合并

得到损失函数对其中某一维度的θj求导，参考所有样本求本均值的梯度

y~~hat~~：逻辑回归的估计值（介于0~1之间）

结论

9-7 scikit-learn中的逻辑回归

9-8 OvR与OvO

通俗理解：https://wenku.baidu.com/view/c13c613ac181e53a580216fc700abb68a982adc4.html

OvR更快，OvO更准

第10章评价分类结果

10-1 准确度的陷阱和混淆矩阵

记忆诀窍：预测值是啥，第二个数就写啥。预测和真实值（第一和第二个数）不同，就是F；相同就是T

10-2 精准率和召回率

精准率：所有预测中，正确预测的概率（主观，预测结果中，正确的概率）

召回率：所有真值中，正确预测的概率（客观，真实结果中，正确的概率）

在极其有偏的数据中，准确率没意义，需看精准率和召回率

10-4 F1 Score

调和平均值：只有所有变量都大时，F1才大（即一个小，结果小）

10-5 精准率和召回率的平衡

X轴：阈值

Y轴：精确率和召回率

X轴：精确率

Y轴：召回率

曲线下面积越大，效果越好

10-7 ROC曲线

TPR就是召回率

在视频异常检测中：

TPR：所有异常视频中，预测为异常的概率（尽可能高）

FPR：所有正常视频中，预测为异常的概率（尽可能低）

下图中五角星可当异常视频，圆圈当正常视频

记忆：ROC曲线下面积越大，效果越好

PR曲线与ROC曲线的选择：https://coding.imooc.com/learn/questiondetail/42693.html

具体到PR曲线和ROC曲线，他们的核心区别在TN。可以看出来，PR曲线其实不反应TN。所以，如果你的应用场景中，如果TN并不重要，那么PR曲线是一个很好的指标（事实上，Precision和Recall就是通过抹去TN，来去除极度的偏斜数据带来的影响，进而放大FP, FN和TP三者的关系的）。

第11章支撑向量机 SVM

解决：分类和回归

11-1 什么是SVM

Soft Margin SVM可解决线性不可分问题

11-2 SVM背后的最优化问题

将wd改写成w，其余依此类推

有条件的最优化问题

11-3 Soft Margin SVM

Hard Margin SVM存在的问题

可能受特殊点影响，泛化能力低

线性不可分

C越大，容错空间越小；C越小，容错空间越大

sklearn中的超参数C实在min0.5||w||^2前面，而不是控制正则化项。因此越大，容错能力越差

11-4 scikit-learn中的SVM

标准化的原因

11-6 到底什么是核函数

本节巨懵逼

怎么转换和转换后为何是这样都没有说清楚呃。。。

大致意思就是：将原样本数据带入K这个核函数即可完成某种特征的转换

大致意思：把x和y看作是两个向量，使用K函数将x，y映射到x’和y’。x’和y’形式一致，如下所示，从1开始，直到xn^2^

不难发现，映射后的x’有0~2次项，相当于升维，但此法使用函数映射在运行中升维，降低存储空间

11-7 RBF核函数

举例：取固定两个参考点l1和l2

实际使用中，每个样本都是地标点

11-8 RBF核函数中的gamma

11-9 SVM思想解决回归问题

Margin内的点越多越好，最后取平均

第12章决策树

非参数学习

解决：分类、回归

12-1 什么是决策树

12-2 信息熵

pi：每一类信息所占比例

右边的数据比左边的数据更确定

12-3 使用信息熵寻找最优划分

一次划分思路：每次取两个不同的样本的同一维度的特征，相加求平均进行划分，计算划分后两边的信息熵，取最小信息熵的划分

N次划分：在一次划分的基础上，可以剔除信息熵最小（接近0）的部分再次划分，或者在其他特征基础上继续划分

12-4 基尼系数

12-5 CART与决策树中的超参数

max_depth：最大深度

min_samples_split：对一个节点至少拥有多少数据才拆分，越大越不容易过拟合（过大欠拟合）

min_samples_leaf：对叶子节点来说至少有多少样本，越大越不容易过拟合（过大欠拟合）

max_leaf_nodes：叶子节点数，间接设置深度

12-7 决策树的局限性

决策边界横平竖直

对个别数据特别敏感

详见代码

第13章集成学习和随机森林

13-1 什么是集成学习

13-2 Soft Voting Classifier

能估计概率的模型

Soft通常比Hard好

13-3 Bagging 和 Pasting

13-4 oob (Out-of-Bag) 和关于Bagging的更多讨论

下图未讲原因哈

13-5 随机森林和 Extra-Trees

sklearn的随机森林：在每一个节点上，取随机的特征划分（而不是像决策树一样，在所有特征上找最优划分）

13-6 Ada Boosting 和 Gradient Boosting

机器学习经典模型：集成学习——Boosting（Adaboost与gradient boosting）：https://blog.csdn.net/jesseyule/article/details/111997597

Adaboost和gradient boosting的区别主要在于从不同的角度实现对上一轮训练错误的数据更多关注，Adaboost主要对数据引入权重，训练后调整学习错误的数据的权重，从而使得下一轮学习中给予学习错误的数据更多的关注。

犯错->改错，如此往复

看的顺序：从左到右，从上到下

每次绿色的图都是对前一张绿色图中错误的弥补

每次红色的图都是所有绿色图的叠加

13-7 Stacking

第14章更多机器学习算法

14-1 学习scikit-learn文档, 大家加油！

官网链接：https://scikit-learn.org/stable/

【计算机基础】机器学习入门

2024/03/22/blog_articles/计算机基础/Python3入门机器学习经典算法与应用/

作者

聪头

发布于

2024-03-22 15:15

许可

Python3入门机器学习经典算法与应用
第1章欢迎来到 Python3 玩转机器学习
第2章机器学习基础
第3章 Jupyter Notebook, numpy和matplotlib
第4章最基础的分类算法-k近邻算法 kNN
第5章线性回归法
第6章梯度下降法
第7章 PCA与梯度上升法
第8章多项式回归与模型泛化
第9章逻辑回归
第10章评价分类结果
第11章支撑向量机 SVM
第12章决策树
第13章集成学习和随机森林
第14章更多机器学习算法
1. 14-1 学习scikit-learn文档, 大家加油！

Python3入门机器学习经典算法与应用
第1章欢迎来到 Python3 玩转机器学习
第2章机器学习基础
第3章 Jupyter Notebook, numpy和matplotlib
第4章最基础的分类算法-k近邻算法 kNN
第5章线性回归法
第6章梯度下降法
第7章 PCA与梯度上升法
第8章多项式回归与模型泛化
第9章逻辑回归
第10章评价分类结果
第11章支撑向量机 SVM
第12章决策树
第13章集成学习和随机森林
第14章更多机器学习算法
1. 14-1 学习scikit-learn文档, 大家加油！

Python3入门机器学习 经典算法与应用

第1章 欢迎来到 Python3 玩转机器学习

1-1 什么是机器学习

1-2 课程涵盖的内容和理念

1-3 课程所使用的主要技术栈

第2章 机器学习基础

2-1 机器学习世界的数据

2-2 机器学习的主要任务

2-3 监督学习，非监督学习，半监督学习和增强学习

监督学习

非监督学习

半监督学习

增强学习

2-4 批量学习，在线学习，参数学习和非参数学习

2-5 和机器学习相关的“哲学”思考

本章小结

2-7 课程使用环境搭建

第3章 Jupyter Notebook, numpy和matplotlib

3-1 Jupyter Notebook基础

3-2 Jupyter Notebook中的魔法命令

3-3 Numpy 数据基础

3-4 创建Numpy数组(和矩阵)

查看文档

3-5 Numpy数组(和矩阵)的基本操作

3-6 Numpy数组(和矩阵)的合并与分割

3-7 Numpy中的矩阵运算

3-8 Numpy中的聚合运算

3-9 Numpy中的arg运算

3-10 Numpy中的比较和Fancy Indexing

个人：numpy小结

数据对象常用属性

数据对象常用方法

numpy常用方法

3-11 Matplotlib数据可视化基础

3-12 数据加载和简单的数据探索

第4章 最基础的分类算法-k近邻算法 kNN

4-1 k近邻算法基础

4-2 scikit-learn中的机器学习算法封装

kNN算法封装

调用sklearn的kNN

根据sklearn重新封装kNN

4-3 训练数据集，测试数据集

4-4 分类准确度

4-5 超参数

4-6 网格搜索与k近邻算法中更多超参数

4-7 数据归一化

4-8 scikit-learn中的Scaler

4-9 更多有关k近邻算法的思考

第5章 线性回归法

5-1 简单线性回归

5-2 最小二乘法

5-3 简单线性回归的实现

5-4 向量化

5-5 衡量线性回归法的指标：MSE，RMSE和MAE

5-6 最好的衡量线性回归法的指标：R Squared

5-7 多元线性回归和正规方程解

5-8 实现多元线性回归

5-9 使用scikit-learn解决回归问题

5-10 线性回归的可解释性和更多思考

第6章 梯度下降法

6-1 什么是梯度下降法

6-2 模拟实现梯度下降法

6-3 线性回归中的梯度下降法

6-4 实现线性回归中的梯度下降法

6-5 梯度下降法的向量化和数据标准化

6-6 随机梯度下降法

6-7 scikit-learn中的随机梯度下降法

6-8 如何确定梯度计算的准确性？调试梯度下降法

6-9 有关梯度下降法的更多深入讨论

第7章 PCA与梯度上升法

7-1 什么是PCA

7-2 使用梯度上升法求解PCA问题

7-3 求数据的主成分PCA

7-4 求数据的前n个主成分

7-5 高维数据映射为低维数据

7-6 scikit-learn中的PCA

7-7 试手MNIST数据集

第8章 多项式回归与模型泛化

8-1 什么是多项式回归

8-2 scikit-learn中的多项式回归与Pipeline

Python3入门机器学习经典算法与应用

第1章欢迎来到 Python3 玩转机器学习

第2章机器学习基础

第4章最基础的分类算法-k近邻算法 kNN

第5章线性回归法

第6章梯度下降法

第8章多项式回归与模型泛化

第9章逻辑回归

第10章评价分类结果

第11章支撑向量机 SVM

第12章决策树

第13章集成学习和随机森林

第14章更多机器学习算法