线性回归 | 忘生的blog

1.一元线性回归

1.1.问题引入

根据不同房屋尺寸，预测出房子可以卖多少钱。所做任务就是通过给的数据集构建一个模型。
在这里插入图片描述

显然这是一个回归问题，根据之前的数据预测出一个准确的输出值。
为了描述这个回归问题，标记如下：
给出的数据集被称为训练集
用m表示训练集中实列的数量
用x表示输入变量/特征
用y表示输出变量/目标变量
(x，y）表示训练集中的实列
(x(i),y(i))代表第i个观察实列
h代表学习算法的解决方案或函数也称为假设(hypothesis)
在本例中，h表示一个函数，输入时房屋尺寸的大小，h根据输入的x的值来得出y的值。故假设 $h_{\theta}(x)=\theta_{0}+\theta_{1}x$ ,应为只含有一个特征/输入变量，因此这样的问题叫做单变量线性回归问题（一元线性回归）。

1.2.代价函数

在线性回归中，最常用的是==均方误差==，其具体形式为：

$J(\theta_{0},\theta_{1})=\frac{1}{2m}\sum_{i=1}^{m}{(\hat(y^{(i)})-y^{(i)})^{2}}=\frac{1}{2m}\sum_{i=1}^{m}{(h_{\theta}(x^{^{(i)}})-y^{(i)})^{2}}$

我们的目标便是选出可以是代价函数J最小的模型参数。
当假设 $\theta_{0}=0$ 时，假设给出数据（1，1），（2，2），（3，3）， $J(\theta_{1})$ 的函数图像如图所示，
在这里插入图片描述
不难看出，当J在最低点时，代价函数最小，即当 $\theta_{1}$ =1时，误差最小。
所以，想要找出合适的模型参数，即寻找J的最低点即可。

1.3.梯度下降法

梯度下降法是一个用来求函数最小值的算法，其思想是：开始我们随机选取一个参数组合 $(\theta_{0},\theta_{1})$ ,计算代价函数，沿代价函数的梯度方向，同时更新参数组合，一直持续到找到一个局部最小值。由下图可知，选择不同的初始参数组合，可能会找到不同的局部最小值。
在这里插入图片描述

其计算公式为： $\theta_{j}=\theta_{j}-\alpha*\frac{\partial}{\partial\theta_{j}}J(\theta_{0},\theta_{1})$ ,其中 $\alpha$ 是学习率，他决定了每次沿梯度方向向下迈出的步子有多大。
对 $\alpha$ 的选取要适当，如果学习速率太小，沿梯度方向下降的会很慢很慢，需要很多部材能达到最低点；如果学习速率太大，可能会直接越过最低点，在下次迭代是有越过了最低点，一次次越过最低点，导致无法到达最低点。
在这里插入图片描述
在本例中，J的函数图像如图所示，它仅有一个最低点，因此使用梯度下降法找到的最低点就是最优解。对代价函数J使用梯度下降法。代价函数的递归方向就是其偏导数方向。
对代价函数求偏导：
$\frac{\partial }{\partial \theta_j}J(\theta_0,\theta_1)=\frac{\partial}{\partial \theta_j}\frac{1}{2m}\sum_{i=1}^{m}({h_\theta(x^{(i)})-y^{(i)})^2}$

$\theta_0:j=0$ : $\frac{\partial }{\partial \theta_0}J(\theta_0,\theta_1)=\frac{1}{m}\sum_{i=1}^{m}({h_\theta(x^{(i)})-y^{(i)})}$

$\theta_1:j=1$ : $\frac{\partial }{\partial \theta_1}J(\theta_0,\theta_1)=\frac{1}{m}\sum_{i=1}^{m}({h_\theta(x^{(i)})-y^{(i)})*x^{(i)}}$

应用梯度下降法修改不断当前值

$\theta_0$ = $\theta_0-\alpha*$ $\frac{\partial }{\partial \theta_0}J(\theta_0,\theta_1)=\frac{1}{m}\sum_{i=1}^{m}({h_\theta(x^{(i)})-y^{(i)})}$

$\theta_1$ = $\theta_1-\alpha*$ $\frac{\partial }{\partial \theta_0}J(\theta_0,\theta_1)=\frac{1}{m}\sum_{i=1}^{m}({h_\theta(x^{(i)})-y^{(i)})*x^{(i)}}$
直至到达最低点，此时 $\theta_{0},\theta_{1}$ 的值就是我们要找的最适合的模型参数。

实际使用中停止梯度下降方式：

①达到最大迭代次数后即停止梯度下降。

②当损失函数小于很小的预设值时，此时即可认为已经到达了最低点。

1.4实现

1.手工梯度下降法

1
2
3

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#载入数据
data=np.genfromtxt("/root/jupyter_projects/data/fangjia1.csv",delimiter=',')
#x_data=data[0:,0]#分离data数据，取所有行第0列 
x_data=data[:,0]#等价于上边那行
y_data=data[:,1]
plt.scatter(x_data,y_data)#绘制散点图
plt.show()

请添加图片描述

#学习率learning rate
lr=0.0001
#截距
theta_0=0
#斜率
theta_1=0
#最大迭代次数
epochs=50

$J(\theta_{0},\theta_{1})=\frac{1}{2m}\sum_{i=1}^{m}{(\hat(y^{(i)})-y^{(i)})^{2}}=\frac{1}{2m}\sum_{i=1}^{m}{(h_{\theta}(x^{^{(i)}})-y^{(i)})^{2}}$

#最小二乘法
#计算代价函数
def compute_error(theta_0,theta_1,x_data,y_data):
    totalError=0
    for i in range(0,len(x_data)):
        totalError+=(theta_1*x_data[i]+theta_0-y_data[i])**2
    totalError
    return totalError/float(2*len(x_data))

对代价函数求偏导：
$\frac{\partial }{\partial \theta_j}J(\theta_0,\theta_1)=\frac{\partial}{\partial \theta_j}\frac{1}{2m}\sum_{i=1}^{m}({h_\theta(x^{(i)})-y^{(i)})^2}$

$\theta_0:j=0$ : $\frac{\partial }{\partial \theta_0}J(\theta_0,\theta_1)=\frac{1}{m}\sum_{i=1}^{m}({h_\theta(x^{(i)})-y^{(i)})}$

$\theta_1:j=1$ : $\frac{\partial }{\partial \theta_1}J(\theta_0,\theta_1)=\frac{1}{m}\sum_{i=1}^{m}({h_\theta(x^{(i)})-y^{(i)})*x^{(i)}}$

应用梯度下降法修改不断当前值

$\theta_0$ = $\theta_0-\alpha*\frac{\partial }{\partial \theta_0}J(\theta_0,\theta_1)=\frac{1}{m}\sum_{i=1}^{m}({h_\theta(x^{(i)})-y^{(i)})}$

$\theta_1$ = $\theta_1-\alpha*\frac{\partial }{\partial \theta_0}J(\theta_0,\theta_1)=\frac{1}{m}\sum_{i=1}^{m}({h_\theta(x^{(i)})-y^{(i)})*x^{(i)}}$

#梯度下降
def gradient_descent_runner(x_data,y_data,theta_0,theta_1,lr,epochs):
    m=len(x_data)
    #达到最大迭代次数后停止
    for i in range(epochs):
        theta_0_grad=0
        theta_1_grad=0#求梯度该变量
        for j in range(0,len(x_data)):
            theta_0_grad+=(1/m)*((theta_1*x_data[j]+theta_0)-y_data[j])
            theta_1_grad+=(1/m)*((theta_1*x_data[j]+theta_0)-y_data[j])*x_data[j]
        theta_0=theta_0-(theta_0_grad*lr)#同时更新参数模型
        theta_1=theta_1-(theta_1_grad*lr)

        #没迭代5次更新一次图像
        if i%5==0:
            print("epochs:",i)
            plt.scatter(x_data,y_data)
            plt.plot(x_data,theta_1*x_data+theta_0,'r')
            plt.show()
    return theta_0,theta_1

print("starting theta_0={0},theta_1={1},error={2}".format(theta_0,theta_1,compute_error(theta_0,theta_1,x_data,y_data)))
print('running')
theta_0,theta_1=gradient_descent_runner(x_data,y_data,theta_0,theta_1,lr,epochs)
print("end{0} theta_0={1},theta_1={2},error={3}".format(epochs,theta_0,theta_1,compute_error(theta_0,theta_1,x_data,y_data)))


#画图
# plt.scatter(x_data,y_data)
# plt.plot(x_data,theta_1*x_data+theta_0,'r')
# plt.show()

starting theta_0=0,theta_1=0,error=2782.5539172416056
running
epochs: 0

在这里插入图片描述

epochs: 10

在这里插入图片描述

epochs: 20

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2JG7rjll-1641220068004)(output_7_9.png)]

epochs: 30

在这里插入图片描述

epochs: 40

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-S1bQzLt5-1641220068008)(output_7_17.png)]

epochs: 45

在这里插入图片描述

end50 theta_0=0.030569950649287983,theta_1=1.4788903781318357,error=56.32488184238028

2.sklearn实现

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
%matplotlib inline

#载入数据
data=np.genfromtxt("/root/jupyter_projects/data/fangjia1.csv",delimiter=',')
#x_data=data[0:,0]#分离data数据，取所有行第0列 
x_data=data[:,0]#等价于上边那行
y_data=data[:,1]
plt.scatter(x_data,y_data)#绘制散点图
plt.show()
print(x_data.shape)
#x_data

请添加图片描述

(100,)

x_data=data[:,0,np.newaxis]
print(x_data.shape)#将x_data 转换为一个矩阵
y_data=data[:,1,np.newaxis]
#x_data

(100, 1)

#创建并拟合模型
model=LinearRegression()
theta=model.fit(x_data,y_data)

#画图
plt.scatter(x_data,y_data)
plt.plot(x_data,model.predict(x_data),'r')
plt.show()
print(theta)
print(model.coef_)#theta_0
print(model.intercept_)#theta_1

请添加图片描述

2.多元线性回归

1.多元线性回归

对房价模型增加更多的特征，如房间数，楼层数等，构成了一个含有多变量的模型，模型中特征为 $(x_{1},x_{2}...x_{n})$ .
在这里插入图片描述

其中n代表特征数量，m代表训练集中的实列数量。

$x^{(i)}$ 代表第i个训练实列。

$x^{(i)}_{j}$ 代表特征矩阵中第i行的第j个特征。

多变量的回归的假设h为： $h_{\theta}(x)=\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}+...+\theta_{n}x_{n}$ .由于有n+1个参数变量，n个变量，为了简化计算，引入了 $x_{0}=1$ .则：

$$X=\left[\begin{matrix}x_{0}\x_{1}\x_{2}\…\x_{n}\end{matrix}\right]$$ $$\theta=\left[\begin{matrix}\theta_{0}\\theta_{1}\\theta_{2}\…\\theta_{n}\end{matrix}\right]$$

则 $h_{\theta}(x)=\theta_{0}x_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}+...\theta_{n}x_{n}$ = $\theta^{T}X$

接下来就要求解 $\theta$

1.梯度下降法

构造代价函数 $J(\theta_{0},\theta_{1},...\theta_{n})=\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^{2}$

运用梯度下降法：

$\theta_{j}=\theta_{j}-\alpha \frac{\partial}{\partial\theta_{j}}J(\theta_{0},\theta_{1},...,\theta_{n})$

$\theta_{j}=\theta_{j}-\alpha \frac{\partial}{\partial\theta_{j}}\frac{1}{2m}\sum_{i=1}^{m}({h_{\theta}(x^{(i)})-y^{(i)}})^{2}$

$\theta_{j}=\theta_{j}-\alpha \frac{1}{m}\sum_{i=1}^{m}({h_{\theta}(x^{(i)})-y^{(i)}})x^{(i)}_{j}$

2.正规方程

正规方程通过是通过对代价函数进行求导，并使其导数为0，来解得 $\theta$ .假设训练集的特征矩阵为X，训练结果为向量y，则利用正规方程求解出 $\theta=(X^{T}X)^{-1}X^{T}y$

 当特征变量数目小于1万以下时，通常使用正规方程来求解，大于一万则使用梯度下降法。

2.梯度下降法实践

1.特征缩放

以房价为例，假设我们使用两个特征（房屋的尺寸大小和房间的数量），并且令 $\theta_{0}=0$ ,尺寸大小为 $0—2000$ 平方尺，而房间数量则是 $0-5$ 间，则以两个参数分别为横纵坐标轴，绘制出代价函数的等高线图。其图像将会是一个很扁的椭圆形，此时进行梯度下降法时，往往需要很多次迭代才能够收敛。
在这里插入图片描述

为了解决这个问题：就需要对特征进行缩放，如图：
在这里插入图片描述
另一种缩放形式是：

$x_{i}=\frac{x-u_{i}}{S_{i}}$

其中 $u_{i}$ 是所有训练集中的特征 $x_{i}$ 的平均值。 $S_{i}$ 是其标准差，但也可以用极差（最大值见最小值）来代替。

2.学习率（learning rate）

由于梯度下降法收敛所需的迭代次数无法提前预支。通常画出代价函数随迭代次数的图像来观察梯度下降法合适趋于收敛。

在这里插入图片描述

当然也可以自带测试收敛，即将代价函数的变化值与某个阈值$w$（如：0.001）来比较，如果小于该阈值则认为此时收敛。但是选择一个合适的阈值也是比较困，所有更多采用第一种方法。



当代价函数随迭代步数的图像如此变化时，说明此时$\alpha$偏大，应当选择一个稍微小一点的值。

在这里插入图片描述

当 $\alpha$ 过小时，则梯度下降法会收敛的非常慢，需要很多次迭代才会收敛。

尝试挑选的学习率：……，0.001，0.003，0.1，0.3，1，……

3.Boston房价预测

1
2
3

from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt

boston=datasets.load_boston()
x_data=boston.data
y_data=boston.target
#print(boston.feature_names)
print(x_data.shape)
print(y_data.shape)

for i in range(x_data.shape[1]):#对数据进行归一化
    x_data[:,i]=(x_data[:,i]-x_data[:,i].mean())/(x_data[:,i].max()-x_data[:,i].min())
X=np.concatenate((np.ones((x_data.shape[0],1)),x_data),axis=1)#添加偏置项
Y=y_data.reshape(-1,1)
print(X.shape,Y.shape)
print(X[:5,0:])

(506, 13)
(506,)
(506, 14) (506, 1)
[[ 1.         -0.0405441   0.06636364 -0.32356227 -0.06916996 -0.03435197
   0.05563625 -0.03475696  0.02682186 -0.37171335 -0.21419304 -0.33569506
   0.10143217 -0.21172912]
 [ 1.         -0.04030818 -0.11363636 -0.14907546 -0.06916996 -0.17632728
   0.02612869  0.10633469  0.1065807  -0.32823509 -0.31724648 -0.06973762
   0.10143217 -0.09693883]
 [ 1.         -0.0403084  -0.11363636 -0.14907546 -0.06916996 -0.17632728
   0.17251688 -0.07698147  0.1065807  -0.32823509 -0.31724648 -0.06973762
   0.09116942 -0.23794325]
 [ 1.         -0.0402513  -0.11363636 -0.32832766 -0.06916996 -0.19896103
   0.13668626 -0.23455099  0.20616331 -0.28475683 -0.35541442  0.02600706
   0.09570823 -0.26802051]
 [ 1.         -0.03983903 -0.11363636 -0.32832766 -0.06916996 -0.19896103
   0.16523579 -0.14804224  0.20616331 -0.28475683 -0.35541442  0.02600706
   0.10143217 -0.20207128]]

$J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}{(h_{\theta}(x^{(i)}-y^{(i)}))}^{2}$

$\sum_{i=1}^{2m}(h_{\theta}(x^{(i)})-y^{(i)})^{2}=(y-x\theta)^{T}(y-x\theta)$

公式：$$(A+B)^{T}=A^{T}+B^{T}$$

$(AB)^{T}=B^{T}A^{T}$

$\frac{\partial J(\theta)}{\partial \theta}=\frac{\partial (y-x\theta)^{T}(y-x\theta)}{\partial \theta}$

$=\frac{\partial (y^{T}y-y^{T}x\theta-\theta^{T}x^{T}y+\theta^{T}x^{T}x\theta)}{\partial \theta}$
$=\frac{\partial y^{T}y}{\partial \theta}-\frac{\partial y^{T}x\theta}{\partial \theta}-\frac{\partial \theta^{T}x^{T}y}{\partial \theta}+\frac{\partial \theta^{T}x^{T}x\theta}{\partial \theta}$

分子布局：分子为列向量或者分母为行向量
分母布局：分子为行向量或者分母为列向量

	分子布局	分母布局
$\frac{\partial b^{T}Ax}{\partial x}$	$b^{T}A$	$A^{T}b$
$\frac{\partial X^{T}AX}{\partial X}$	$X^{T}(A+A^{T})$	$(A+A^{T})X$

$\frac{\partial y^{T}y}{\partial \theta}=0$

$\frac{\partial y^{T}x\theta}{\partial \theta}=x^{T}y$

应为 $\theta^{T}x^{T}y$ 为标量
$\frac{\partial \theta^{T}x^{T}y}{\partial \theta}=\frac{(\partial \theta^{T}x^{T}y)^{T}}{\partial \theta}=\frac{\partial y^{T}x\theta}{\partial \theta}=x^{T}y$

$\frac{\partial \theta^{T}x^{T}x\theta}{\partial \theta}=(x^{T}x+(x^{T}x)^{T})\theta=2x^{T}x\theta$

$\frac{\partial J(\theta)}{\partial \theta}=X^{T}X\theta-X^{T}y$

直接令J=0
可直接求解出 $\theta$

$\theta=(X^{T}X)^{-1}X^{T}Y$

1 2	theta=np.random.rand(14,1) print(theta)

[[0.59826263]
 [0.7714136 ]
 [0.96782367]
 [0.05007622]
 [0.28160071]
 [0.59145316]
 [0.86688075]
 [0.32402074]
 [0.83781872]
 [0.9972824 ]
 [0.03729666]
 [0.3470009 ]
 [0.35738919]
 [0.25179409]]

learning_rate=0.0001
epochs=10000
costList=[]
for i in range(epochs):
    grad_theta=np.matmul(X.T,np.matmul(X,theta)-Y)
    #print(grad_theta.shape)
    theta=theta-learning_rate*grad_theta
    if(i%10==0):

        cost=np.mean(np.square(Y-np.matmul(X,theta)))
        costList.append(cost)
ep=np.linspace(0,epochs,1000)
print(costList[-1])
plt.plot(ep,costList)
plt.xlabel('epochs')
plt.ylabel('cost')
plt.show()

21.901811902045843

在这里插入图片描述

theta

array([[ 22.53280632],
       [ -8.60798018],
       [  4.47504867],
       [  0.50817969],
       [  2.71178142],
       [ -8.41145881],
       [ 20.00501683],
       [  0.09039972],
       [-15.82506381],
       [  6.74485343],
       [ -6.25266277],
       [ -8.92882317],
       [  3.73796887],
       [-19.13688284]])

#正规方程
Xmat=np.matrix(X)
Ymat=np.matrix(Y)
xt=Xmat.T*Xmat
w=xt.I*X.T*Y
w

matrix([[ 22.53280632],
        [ -9.60975755],
        [  4.64204584],
        [  0.56083933],
        [  2.68673382],
        [ -8.63457306],
        [ 19.88368651],
        [  0.06721501],
        [-16.22666104],
        [  7.03913802],
        [ -6.46332721],
        [ -8.95582398],
        [  3.69282735],
        [-19.01724361]])

1 2	cost=np.mean(np.square(Ymat-Xmat*w)) cost

21.894831181729206