首页 - 技术栈

网站建设一般收多少定金一个做网站的团队需要哪些人员

作者: 五速梦信息网
时间: 2026年06月19日 07:46

当前位置：首页 > news >正文

网站建设一般收多少定金,一个做网站的团队需要哪些人员,微信商城有哪些,网页设计能干什么文章目录1. 项目分析1. 框架问题2. 性能指标2. 获取数据1. 准备工作区2. 下载数据3. 查看数据4. 创建测试集3. 数据探索1. 地理位置可视化2. 寻找相关性3. 组合属性4. 数据准备1. 数据清理2. Scikit-Learn 的设计3. 处理文本、分类属性4. 自定义转换器5. 特征缩放6. 流水线5. 选… 文章目录1. 项目分析1. 框架问题2. 性能指标2. 获取数据1. 准备工作区2. 下载数据3. 查看数据4. 创建测试集3. 数据探索1. 地理位置可视化2. 寻找相关性3. 组合属性4. 数据准备1. 数据清理2. Scikit-Learn 的设计3. 处理文本、分类属性4. 自定义转换器5. 特征缩放6. 流水线5. 选择和训练模型1. 训练和评估训练集2. 交叉验证6. 微调模型1. 网格搜索2. 随机搜索3. 集成方法4. 模型误差5. 通过测试集评估系统7. 部署、监控与系统维护1. 部署2. 系统监控3. 系统维护8. 可用数据源1. 项目分析目的使用加州人口普查的数据建立加州的房价模型从而根据所有其他指标预测任意区域的房价中位数机器学习项目清单框出问题并看整体获取数据研究数据以获得深刻见解准备数据以便将潜在的数据模式提供给机器学习算法探索不同模型并列出最佳模型微调模型并将它们组合成一个很好的解决方案演示解决方案启动、监视、维护这个系统

框架问题业务目标模型的输出对一个区域房价的中位数的预测将会与其他信号一起传输给另一个机器学习系统下游系统将用来决策这个区域是否值得投资流水线一个序列的数据处理组件用于机器学习系统中的数据操作和数据转化流水线中的组件通常是异步的每个组件拉取大量数据进行处理再将结果传输给另一个数据仓库然后下一个组件拉取前面输出的数据并给出自己的输出以此类推组件与组件之间独立只通过数仓连接系统组件保持简单互不干扰需要实施适当的监控否则坏掉的组件虽不影响其他组件的可用性但长时间无补救措施会导致整体系统的性能下降现有解决方案由专家团队手动估算区域的住房价格先持续收集最新的区域信息计算房价中位数若不能计算得到则使用复杂的规则来估算现行方案可供参考专家系统由人把知识总结出来再教给计算机实施过程昂贵计算结果也未必令人满意
这是一个典型的监督学习任务已经给出了标记的训练示例也是一个典型的多重回归任务、一元回归任务系统要通过多个特征对某个值进行预测这是一个批量学习系统没有连续的数据流输入也不需要针对变化的数据做特别调整数据量也不很大
性能指标均方根误差RMSE欧几里得范数可用于体现系统通常会在预测中产生多大误差 RMSE(X,h)1m∑i1m(h(xi)−yi)2RMSE(X, h) \sqrt{ \frac{1}{m} \sum{i1}^m (h(x^i) - y^i)^2 } RMSE(X,h)m1i1∑m(h(xi)−yi)2 m表示测量 RMSE 的数据集中的实例个数如在 2000 个区域的验证集上评估 RMSE则 m2000xix^ixi表示数据集中第 i 个实例的所有特征值不包括标签的向量yiy^iyi表示标签实例的期望输出值如数据集中第 1 个区域位于经度 -118.29°纬度 33.91°居民 1416 人收入中位数 38372 美元房屋价值中位数为 156400 美元则 x1(−118.2933.91141638372)x^1 \begin{pmatrix} -118.29 \ 33.91 \ 1416 \ 38372 \end{pmatrix} x1−118.2933.91141638372 y1156400y^1 156400 y1156400 X矩阵包含数据集中所有实例的所有特征值不包含标签其中每一行代表一个实例第 i 行等于 xix^ixi 的装置即 (xi)T(x^i)^T(xi)T X((x1)T(x2)T…(x1999)T(x2000)T)(−118.2933.91141638372…………)X \begin{pmatrix} (x^1)^T \ (x^2)^T \ … \ (x^1999)^T \ (x^2000)^T \end{pmatrix} \begin{pmatrix} -118.29 33.91 1416 38372 \ … … … … \end{pmatrix} X(x1)T(x2)T…(x1999)T(x2000)T(−118.29…33.91…1416…38372…) h系统的预测函数也称假设当给系统输入一个实例的特征向量 xix^ixi 时它会为该实例输出一个预测值 y^ih(xi)\hat{y}^i h(x^i)y^ih(xi)如系统预测第一个区域的房价中位数为 158400 美元则 y^1h(x1)\hat{y}^1 h(x^1)y^1h(x1) 158400其预测误差为 y^1−y1\hat{y}^1 - y^1y^1−y1 2000RMSE(X, h)使用假设 h 在一组实例中测量的成本函数其他函数平均绝对误差Mean Absolute ErrorMAE平均绝对偏差曼哈顿范数 MAE(X,h)1m∑i1m∣h(xi)−yi∣MAE(X, h) \frac{1}{m} \sum{i1}^m | h(x^i) - y^i | MAE(X,h)m1i1∑m∣h(xi)−yi∣ RMSE 和 MAE 都是测量两个向量预测值向量和目标值向量之间距离的方法范数指标越高越关注大值而忽略小值RMSE 对异常值比 MAE 更敏感当离群值呈指数形式稀有时RMSE 表现非常好
获取数据
准备工作区创建工作区目录 export ML_HOME$HOME/Documents/workspace/projects/aurelius/lmsl/studying/ml/handson-ml2/workspace mkdir -p $ML_HOME安装 Python这里省略安装细节 Python 版本需要保持 python3 的较新版本pip 版本则保持最新

查看 pip 版本号

python3 -m pip –version# 升级 pip 至最新版 python3 -m pip install –user -U pip创建专属 Python 环境 cd $ML_HOME

创建一个名为 .venv 的专属 Python 环境

python3 -m venv .venv# 进入专属 Python 环境 source .venv/bin/activate # on Linux or macOS

$ ..venv\Scripts\activate # on Windows# 退出专属 Python 环境

deactivate安装依赖模块 requestsJupyterNumPypandasMatplotlibScikitLearn

通过 pip 按照依赖模块

pip install requests jupyter matplotlib numpy pandas scipy scikit-learn# 将专属环境注册到 Jupyter 并给它一个名字 python -m ipykernel install –user –nameml-venv# 若安装缓慢可切换 pip 清华镜像源 cd ~/.pip vi pip.conf# 在 ~/.pip/pip.conf 加入如下配置 [global] index-url https://pypi.tuna.tsinghua.edu.cn/simple[install] trusted-hostpypi.tuna.tsinghua.edu.cn启用 Jupyter Notebook jupyter notebook启用 Jupyter Notebook 将在本地开启一个 Web Service通过 http://localhost:8888 访问该服务推荐直接使用 VS Code 的 Jupter 插件使用 Jupyter Notebook无须自己通过 jupyter notebook 命令启动 Jupyter Service具体使用方法可自行探索
下载数据这个项目的数据是 csv 格式的压缩包可以通过浏览器下载并通过 tar 命令解压获得但推荐创建一个 Python 函数来实现通用处理 import tarfile import requestsdef fetch_data(url, path, tgz):if not os.path.isdir(path):os.makedirs(path)tgz_path os.path.join(path, tgz)with open(tgz_path, wb) as w:w.write(requests.get(url).content)housing_tgz tarfile.open(tgz_path)housing_tgz.extractall(pathpath)housing_tgz.close()将数据下载并加压到工作区路径 import osDOWNLOAD_ROOT https://raw.githubusercontent.com/ageron/handson-ml2/master/ HOUSING_PATH os.path.join(workspace, datasets, housing) HOUSING_URL DOWNLOAD_ROOT datasets/housing/housing.tgz HOUSING_TGZ housing.tgzfetch_data(HOUSING_URL, HOUSING_PATH, HOUSING_TGZ)使用 pandas 加载并查看数据 import pandas as pddef load_data(path, csv):csv_path os.path.join(path, csv)return pd.read_csv(csv_path)housing load_data(HOUSING_PATH, housing.csv)3. 查看数据查看数据集前 5 行 housing.head()实例属性 longitude: 经度latitude: 纬度housing_median_age: 住房中位数年龄total_rooms: 房子总数total_bedrooms: 卧室总数population: 人口households: 家庭户数median_income: 收入中位数median_house_value: 房价中位数ocean_proximity: 海洋的距离
查看数据集简要描述 housing info()class pandas.core.frame.DataFrame RangeIndex: 20640 entries, 0 to 20639 Data columns (total 10 columns):# Column Non-Null Count Dtype — —— ————– —–0 longitude 20640 non-null float641 latitude 20640 non-null float642 housing_median_age 20640 non-null float643 total_rooms 20640 non-null float644 total_bedrooms 20433 non-null float645 population 20640 non-null float646 households 20640 non-null float647 median_income 20640 non-null float648 median_house_value 20640 non-null float649 ocean_proximity 20640 non-null object dtypes: float64(9), object(1) memory usage: 1.6 MB数据集摘要包含 20640 个实例total_bedrooms 只有 20433 个非空值ocean_proximity 是 object 类型其他所有属性都是数值类型查看字段分类属性 housing[ocean_proximity].value_counts()1H OCEAN 9136 INLAND 6551 NEAR OCEAN 2658 NEAR BAY 2290 ISLAND 5 Name: ocean_proximity, dtype: int64ocean_proximity 有五个类型的值分布如上输出查看数值属性的摘要 housing.describe()std标准差用于测量数值的离散程度25%/50%/75%百分位数表示在观测值组中给定百分比的观测值都低于该值count总行数空值会被忽略如 total_bedrooms 的 count 是 20433 绘制每个属性的直方图

指定 Matplotlib 使用哪个后端在 VS Code 中则无需指定

%matplotlib inline # only in a Jupyter notebook令 Matplotlib 使用 Jupyter 的后端图形在 Notebook 上显示

import matplotlib.pyplot as plt# hist() 依赖于 matplotlib housing.hist(bins50, figsize(20,15)) plt.show()median_income收入中位数明显不是美元而是万美元在衡量而是按一定比例缩小了上限 15下限 0.5其他属性值也有被不同程度的缩放housing_median_age 和 median_house_value 也被设置了上限而 median_house_value 作为预测目标属性需要特别注意对标签值被设置了上限的区域重新收集标签值移除标签值超出上限的区域直方图大多显示出重尾长尾效应这可能导致一些机器学习算法难以检测模式需要通过一些转化方法将这些属性转化为更偏向钟形的分布
创建测试集数据窥探偏误data snooping bias若提前浏览过测试集数据可能会跌入某个看似有趣的测试数据模式进而选择某个特殊的机器学习模型然后当再使用测试集对泛化误差进行评估时结果会过于乐观在系统正式投入生产时表现不如预期随机选择一些实例通常是 20%数据集很大时可以缩小比例将之放在一边即可 import numpy as npdef split_train_test(data, test_ratio):shuffled_indices np.random.permutation(len(data))test_set_size int(len(data) * test_ratio)test_indices shuffled_indices[:test_set_size]train_indices shuffled_indices[test_set_size:]return data.iloc[train_indices], data.iloc[test_indices]train_set, test_set split_train_test(housing, 0.2) print(len(train_set), len(test_set))

16512 4128这样分割的测试集重复运行会得到不同的结果这将导致学习算法看到完整的数据集这是创建测试集时需要避免的

可以通过转存测试集或固定随机数生成器种子例如使用 np.random.seed(42)使索引到的测试集始终相同这样固定测试集无法分割更新的数据集更好的办法是使用固定算法哈希算法将每个实例的唯一标识例如哈希值作为输入来决定是否进入测试集 from zlib import crc32# 是否进入测试集的固定算法 def test_set_check(identifier, test_ratio):return crc32(np.int64(identifier)) 0xffffffff test_ratio * 2**32def split_train_test_by_id(data, test_ratio, id_column):ids data[id_column]in_testset ids.apply(lambda id: test_setcheck(id, test_ratio))return data.loc[~in_test_set], data.loc[in_test_set]将 index 作为唯一标识输入 housing_with_id housing.reset_index() # adds an index column train_set, test_set split_train_test_by_id(housing_with_id, 0.2, index)需要确保新增数据只追加在数据集末尾且不会删除任何行将经纬度作为唯一标识输入 housing_with_id[id] housing[longitude] * 1000 housing[latitude] train_set, test_set split_train_test_by_id(housing_with_id, 0.2, id)使用 Scikit-Learn train_test_split() from sklearn.model_selection import train_test_split train_set, test_set train_test_split(housing, test_size0.2, random_state42)random_state 可用于设置随机数生成器也可以把行数相同的多个数据集一次性发送给它从而更具相同的索引将其拆分随机抽样适用于数据集足够庞大相较于属性数量否则容易导致明显的抽样偏差分层抽样将数据集按属性划分为多个子集层然后从每个子集抽取相同比例的实例合并为测试集
保留重要属性在测试集的原始分布预测房价中位数收入中位数是重要的属性测试集应能够代表整个数据集中各种不同类型的收入

将收入按 0 ~ 1.5 ~ 3 ~ 4.5 ~ 6 ~ 无穷大分为 5 个子集层

housing[income_cat] pd.cut(housing[median_income],bins[0., 1.5, 3.0, 4.5, 6., np.inf],labels[1, 2, 3, 4, 5])housing[income_cat].hist()通过 Scikin-Learn 的 StratifiedShuffleSplit 按收入分层抽样 from sklearn.model_selection import StratifiedShuffleSplit split StratifiedShuffleSplit(n_splits1, test_size0.2, random_state42) for train_index, test_index in split.split(housing, housing[income_cat]):strat_train_set housing.loc[train_index]strat_test_set housing.loc[test_index]# 验证分层的实例占比 strat_test_set[income_cat].value_counts() / len(strat_test_set)3 0.350533 2 0.318798 4 0.176357 5 0.114341 1 0.039971 Name: income_cat, dtype: float64完整数据集、分层抽样测试集、随机抽样测试集中收入属性比例分布 def income_cat_proportions(data):return data[income_cat].value_counts() / len(data)train_set, test_set train_test_split(housing, test_size0.2, random_state42)compare_props pd.DataFrame({Overall: income_cat_proportions(housing),Stratified: income_cat_proportions(strat_test_set),Random: income_cat_proportions(test_set), }).sort_index() compare_props[Rand. %error] 100 * compare_props[Random] / compare_props[Overall] - 100 compare_props[Strat. %error] 100 * compare_props[Stratified] / compare_props[Overall] - 100compare_props移除 incomecat 属性 for set in (strat_train_set, strat_testset):set.drop(income_cat, axis1, inplaceTrue)3. 数据探索创建一个训练集的副本以便之后的尝试不会损害训练集 housing strat_train_set.copy()1. 地理位置可视化按数据密度绘制经纬度分布图 housing.plot(kindscatter, xlongitude, ylatitude, alpha0.1)可以从图中清晰分辨高密度区域按人口密度和房价中位数绘制经纬度分布图 housing.plot(kindscatter, xlongitude, ylatitude, alpha0.4,shousing[population]/100, labelpopulation, figsize(10,7),cmedian_house_value, cmapplt.get_cmap(jet), colorbarTrue, ) plt.legend()人口数量用圆的半径选项 s表示房价中位数用演示选项 c表示其中颜色范围选项 cmap取自预定义颜色表 jet 可以从图中印证房价与地理位置、人口密度息息相关
寻找相关性使用 corr() 计算每对属性之间的标准相关系数皮尔逊 corr_matrix housing.corr() corr_matrix[median_house_value].sort_values(ascendingFalse)median_house_value 1.000000 median_income 0.687151 total_rooms 0.135140 housing_median_age 0.114146 households 0.064590 total_bedrooms 0.047781 population -0.026882 longitude -0.047466 latitude -0.142673 Name: median_house_value, dtype: float64相关系数线性相关性范围从 -1 到 1越接近 1 表示越正相关越接近 -1 表示越负相关0 说明二者之间无线性相关性使用 pandas 的 scatter_matrix() 绘制相关性 from pandas.plotting import scatter_matrixattributes [median_house_value, median_income, total_rooms, housing_median_age] scatter_matrix(housing[attributes], figsize(12, 8))主对角线显示的是每个属性的直方图其他位置显示属性之间的相关性查看最有潜力预测房价中位数的属性收入中位数相关性最强属性 housing.plot(kindscatter, xmedian_income, ymedian_house_value, alpha0.1)从图中可印证二者相关性较强且 50W、35W、45W 处存在清晰的水平线这些可能是客观存在的价格上限导致的为了避免学习算法学到这些怪异的数据可以尝试删除这些区域
组合属性从上文属性相关性分析可以发现一些异常数据如水平线需要提前清理掉还有一些重尾分布需要进行转换处理如计算对数等而尝试组合属性可能让我们发现新的高相关性属性尝试组合属性并观察与目标属性相关性 housing[rooms_per_household] housing[total_rooms]/housing[households] housing[bedrooms_per_room] housing[total_bedrooms]/housing[total_rooms] housing[population_per_household]housing[population]/housing[households]corr_matrix housing.corr() corr_matrix[median_house_value].sort_values(ascendingFalse)median_house_value 1.000000 median_income 0.687151 rooms_per_household 0.146255 total_rooms 0.135140 housing_median_age 0.114146 households 0.064590 total_bedrooms 0.047781 population_per_household -0.021991 population -0.026882 longitude -0.047466 latitude -0.142673 bedrooms_per_room -0.259952 Name: median_house_value, dtype: float64新属性 bedrooms_per_room 与房间中位数的相关性明显高于原始属性total_bedroomstotal_rooms
数据准备创新新的训练集副本将其预测期与标签分开 housing strat_train_set.drop(median_house_value, axis1) housing_labels strat_train_set[median_house_value].copy()drop 不会影响 strat_train_set只会创建一个新的数据副本
数据清理解决 total_bedrooms 的部分值缺失问题 dropna()放弃这些缺值区域drop()放弃整个属性fillna()将缺失的值设置为某个值0、平均数、中位数等 housing.dropna(subset[total_bedrooms]) # option 1 housing.drop(total_bedrooms, axis1) # option 2 median housing[total_bedrooms].median() # option 3 housing[total_bedrooms].fillna(median, inplaceTrue)使用 Scikit-Learn 的 SimpleImputer 处理缺失值 from sklearn.impute import SimpleImputer

创建中位数填充处理器

imputer SimpleImputer(strategymedian)

因为中位数值只能计算数值属性这里需要移除 ocean_proximity 属性

housing_num housing.drop(ocean_proximity, axis1)

使用 fit() 将 imputer 实例适配到训练数据计算每个属性的中位数值并存储在 statistics_

imputer.fit(housing_num)

查看中位数值

imputer.statistics_

比较中位数值是否计算正确

housing_num.median().values# 使用 transform() 将中位数值替换到缺失值 X imputer.transform(housing_num)# 重新将 numpy 数组加载到 pandas 的 DataFrame housing_tr pd.DataFrame(X, columnshousing_num.columns, indexhousing_num.index)2. Scikit-Learn 的设计一致性Scikit-Learn 的 API 设计遵守一致性原则所有对象共享一个简单一致的界面估算器根据数据集对某些参数进行估算例如 imputer 估算中位数由 fit() 方法执行估算只需要一个数据集作为参数或者一对参数一个作为训练器一个作为标签集引导估算过程的其他参数即为超参数如 strategy‘median’ 的 strategy超参数必须是一个实例变量转换器可以转换数据集的估算起如 imputer也被称为转换器由 transform() 方法和作为参数的待转换数据集一起执行转换返回的结果即转换后的数据集转换过程通常依赖于学习的参数如 imputer.statistics fit_transform() 方法相当于先执行 fit() 在执行 transform()有时可能包含一些优化会运行得更快预测器能够基于一个给定数据集进行预测的估算器也被称为预测器如 LinearRegression 模型由 predict() 方法对一个新实例的数据集进行预测返回一个包含相应预测结果的数据集 score() 方法可以用来衡量给定测试集的预测质量以及监督学习算法中对应的标签检查所有估算器的超参数都可以通过公共实例变量如 imputer.strategy直接访问所有估算器的学习参数都可以通过带下划线后缀的公共变量如 imputer.statistics_直接访问防止类扩散数据集被表示为 NumPy 数组或 SciPy 稀疏矩阵而非自定义的类型超参数只是普通 Python 字符串或数值构成构件块尽最大可能的重用任意序列的转换器最后加一个预测器就可以构建一个 Pipeline 估算器合理的默认值 Scikit-Learn 为大多数参数提供了合理的默认值从而快速搭建一个基本的工作系统
处理文本、分类属性查看文本属性的前 10 行 housing_cat housing[[ocean_proximity]] housing_cat.head(10)ocean_proximity 12655 INLAND 15502 NEAR OCEAN 2908 INLAND 14053 NEAR OCEAN 20496 1H OCEAN 1481 NEAR BAY 18125 1H OCEAN 5830 1H OCEAN 17989 1H OCEAN 4861 1H OCEANocean_proximity 不是任意文本而是枚举值即分类属性使用 Scikit-Learn 的 OrdinalEncoder 将文本属性转数值属性 from sklearn.preprocessing import OrdinalEncoderordinal_encoder OrdinalEncoder() housing_cat_encoded ordinal_encoder.fit_transform(housing_cat)housing_cat_encoded[:10]array([[1.],[4.],[1.],[4.],[0.],[3.],[0.],[0.],[0.],[0.]])查看类别列表 ordinalencoder.categories[array([1H OCEAN, INLAND, ISLAND, NEAR BAY, NEAR OCEAN],dtypeobject)]独热编码为类别属性的每个属性值创建一个二进制的属性1 表示热0 表示冷避免文本属性转数值属性后误把数值越接近的属性看作越相近使用 Scikin-Learn 的 OneHotEncoder 编码器将文本属性转换为独热向量 from sklearn.preprocessing import OneHotEncodercat_encoder OneHotEncoder() housing_cat_1hot cat_encoder.fit_transform(housing_cat) housing_cat_1hot

输出一个 SciPy 稀疏矩阵

16512x5 sparse matrix of type class numpy.float64with 16512 stored elements in Compressed Sparse Row format稀疏矩阵仅存储非零元素的位置依旧可以像使用普通二维数组使用它查看稀疏矩阵的二维数组表示 housing_cat_1hot.toarray()array([[0., 1., 0., 0., 0.],[0., 0., 0., 0., 1.],[0., 1., 0., 0., 0.],…,[1., 0., 0., 0., 0.],[1., 0., 0., 0., 0.],[0., 1., 0., 0., 0.]])查看编码器的类别列表 catencoder.categories[array([1H OCEAN, INLAND, ISLAND, NEAR BAY, NEAR OCEAN],dtypeobject)]若类别属性的属性值类别很多独热编码会产生大量输入特征这可能会减慢训练并降低性能此时可能需要使用相关的数字特征代替类别输入如使用海洋距离代替 ocean_proximity也可以用可学习的低维向量替换每一个类别
自定义转换器可以通过 Scikit-Learn 自定义转换器实现一些清理操作或组合特定属性等可与 Scikit-Learn 自身功能无缝衔接 Scikit-Learn 依赖鸭子类型的编译而非继承只要创建的类包含 fit()返回 self、transform()、fit_transform() TransformerMixin自动实现 fit_transform() 方法BaseEstimator获得自动调整超参数的方法 get_params() 和 set_params() 通过自定义转换器实现组合属性的转换器 from sklearn.base import BaseEstimator, TransformerMixinrooms_ix, bedrooms_ix, population_ix, households_ix 3, 4, 5, 6class CombinedAttributesAdder(BaseEstimator, TransformerMixin):def init(self, add_bedrooms_per_room True): # no *args or **kargsself.add_bedrooms_per_room add_bedrooms_per_roomdef fit(self, X, yNone):return self # nothing else to dodef transform(self, X):rooms_per_household X[:, rooms_ix] / X[:, households_ix]population_per_household X[:, population_ix] / X[:, households_ix]if self.add_bedrooms_per_room:bedrooms_per_room X[:, bedrooms_ix] / X[:, roomsix]return np.c[X, rooms_per_household, population_per_household,bedrooms_perroom]else:return np.c[X, rooms_per_household, population_per_household]attr_adder CombinedAttributesAdder(add_bedrooms_per_roomFalse) housing_extra_attribs attr_adder.transform(housing.values)超参数 add_bedrooms_per_room 可以用于控制是否添加 bedrooms_per_room 属性这中实现可以提供更多的组合方式
特征缩放最重要也是最需要应用到数据上的转换就是特征缩放同比例缩放所有属性的两种常用方法是最小-最大缩放和标准化最小-最大缩放归一法将值缩放使其范围归于 0 ~ 1 之间将所有值减去最小值并除以最大最小值之差Scikit-Learn 的 MinMaxScaler 转换器可以轻松实现且其超参数 feature_range 可以调整其范围标准化将所有值减去平均值标准化值的均值总是 0再除以方差结果的分布具备单位方差标准化不将值绑定在特定范围受异常值的影响会更小Scikit-Learn 的 StandardScaler 转化器可以实现标准化
流水线流水线Pipeline以一定的步骤实现多个数据转换Scikit-Learn 的 Pipeline 提供这类转换支持 from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScalernum_pipeline Pipeline([(imputer, SimpleImputer(strategymedian)),(attribs_adder, CombinedAttributesAdder()),(std_scaler, StandardScaler()),])housing_num_tr num_pipeline.fit_transform(housing_num)Pipeline 构造函数通过一系列名称、估算器的配对定义的序列除了最后一个是估算器外前面的都必须是转换器实现了 fit_transform() 方法当调用 Pipeline 的 fit() 方法时会按顺序依次调用转换器的 fit_transform() 方法并将上个转换器的输出作为参数传递给下个转换器直到传递给最后的估算器并执行最后估算器的 fit() 方法使用 Scikit-Learn 的 ColumnTransformer 转换器处理所有列 from sklearn.compose import ColumnTransformer num_attribs list(housing_num) cat_attribs [ocean_proximity]full_pipeline ColumnTransformer([(num, num_pipeline, num_attribs),(cat, OneHotEncoder(), cat_attribs),]) housing_prepared full_pipeline.fit_transform(housing)ColumnTransformer 可以通过传递列名称列表将转换作用在数据集的指定列上并沿第二个轴合并输出转换器的返回行数必须相同稀疏矩阵与密集矩阵合并ColumnTransformer 会估算最终矩阵的密度单元格的非零比率若密度低于给定阈值sparse_threshold 默认为 0.3则返回一个稀疏矩阵
选择和训练模型
训练和评估训练集训练一个线性回归模型 from sklearn.linear_model import LinearRegressionlin_reg LinearRegression() lin_reg.fit(housing_prepared, housing_labels)使用训练集的实例测试预测结果 some_data housing.iloc[:5] some_labels housing_labels.iloc[:5] some_data_prepared full_pipeline.transform(some_data) print(Predictions:, lin_reg.predict(some_data_prepared)) print(Labels:, list(some_labels))Predictions: [ 86208. 304704. 153536. 185728. 244416.] Labels: [72100.0, 279600.0, 82700.0, 112500.0, 238300.0]测量训练集上回归模型的 RMSE 使用 Scikit-Learn 的 mean_squared_error() 进行均方根误差测量 from sklearn.metrics import mean_squared_error housing_predictions lin_reg.predict(housing_prepared) lin_mse mean_squared_error(housing_labels, housing_predictions) lin_rmse np.sqrt(lin_mse) print(lin_rmse)68633.40810776998说明预测误差达到 68628 美元而整个 median_housing_values 也只是分布在 120000 ~ 26500 美元之间这么大的误差说明这是一个模型对训练数据欠拟合的方案这时我们可尝试的优化方式有选择更强大的模型、为算法训练提供更好的特征、减少对模型的限制使用 DecisionTreeRegressor 训练一个决策树从数据中找到复杂的非线性关系 from sklearn.tree import DecisionTreeRegressor tree_reg DecisionTreeRegressor() tree_reg.fit(housing_prepared, housing_labels)测量训练集上回归模型的 RMSE housing_predictions tree_reg.predict(housing_prepared) tree_mse mean_squared_error(housing_labels, housing_predictions) tree_rmse np.sqrt(tree_mse) print(tree_rmse)0.00 误差说明模型要么绝对完美这不可能要么对数据严重过拟合
交叉验证通过交叉验证对决策树模型进行评估使用 Scikit-Learn 的 cross_val_score 进行 K 折交叉验证 from sklearn.model_selection import cross_val_scorescores cross_val_score(tree_reg, housing_prepared, housing_labels, scoringneg_mean_squared_error, cv10) tree_rmse_scores np.sqrt(-scores)scores 为 MSE 的负数代表效用函数越大越好np.sqrt(-scores) 正好计算 RMSE def display_scores(scores):print(Scores:, scores)print(Mean:, scores.mean())print(Standard deviation:, scores.std())display_scores(tree_rmse_scores)Scores: [73444.02930862 69237.91537492 67003.65412022 71810.5776078370631.08058123 77465.52053272 70962.67507776 73613.9363141668442.91744801 72364.26672416] Mean: 71497.65730896383 Standard deviation: 2835.532019536459该决策树在验证集的平均 RMSE 评分为 71497训练集0上下浮动精确度为 2835 线性回归模型的交叉验证 lin_scores cross_val_score(lin_reg, housing_prepared, housing_labels,scoringneg_mean_squared_error, cv10) lin_rmse_scores np.sqrt(-lin_scores) display_scores(lin_rmse_scores)Scores: [71800.38078269 64114.99166359 67844.95431254 68635.1907208266801.98038821 72531.04505346 73992.85834976 68824.5409209466474.60750419 70143.79750458] Mean: 69116.4347200802 Standard deviation: 2880.6588594759014线性回归模型在验证集的平均 RMSE 评分为 69116训练集68633上下浮动精确度为 2880 决策树的 RMSE 评分比线性回归模型还高可见是严重过拟合的使用 RandomForestRegressor 训练随机森林随机森林通过对特征的随机子集进行许多个决策树的训练然后对其预测取平均在多个模型的基础之上建立模型称为集成学习 from sklearn.ensemble import RandomForestRegressorforest_reg RandomForestRegressor() forest_reg.fit(housing_prepared, housing_labels) housing_predictions forest_reg.predict(housing_prepared) forest_mse mean_squared_error(housing_labels, housing_predictions) forest_rmse np.sqrt(forest_mse) print(forest_rmse)forest_scores cross_val_score(forest_reg, housing_prepared, housing_labels,scoringneg_mean_squared_error, cv10) forest_rmse_scores np.sqrt(-forest_scores) display_scores(forest_rmse_scores)18580.285001969234 Scores: [51420.10657898 48950.26905778 46724.70163181 52032.1675181347382.48485738 51644.10218989 52532.85241798 50040.9677222648869.83863791 53727.35461654] Mean: 50332.484522865096 Standard deviation: 2191.1726721020977训练集 RMSE 评分为 18580验证集评分为 50332上下浮动精确度为 2191虽然比上两个模型表现好很多但训练集评分远低于验证集可见依旧是过拟合的在进行模型简化、模型约束之前可以去尝试更多的机器学习算法如不同内核的支持向量机、神经网络模型等先筛选几个有效的模型别花太多时间在调整超参数保存模型 import joblibjoblib.dump(forest_reg, ./workspace/models/forest_reg.pkl)

and later, reload model…

forest_reg_loaded joblib.load(./workspace/models/forest_reg.pkl)6. 微调模型有了几个有效候选模型后可以对它们进行微调
网格搜索网格搜索Scikit-Learn 的 GridSearchCV 可以通过设置实验的超参数以及需要尝试的值使用交叉验证来评估超参数的所有可能组合从而得到最佳组合 from sklearn.model_selection import GridSearchCVparam_grid [{n_estimators: [3, 10, 30], max_features: [2, 4, 6, 8]},{bootstrap: [False], n_estimators: [3, 10], max_features: [2, 3, 4]},] forest_reg RandomForestRegressor() grid_search GridSearchCV(forest_reg, param_grid, cv5,scoringneg_mean_squared_error,return_train_scoreTrue) grid_search.fit(housing_prepared, housing_labels)parram_grid超参数网格设置 n_estimatorsmax_features超参数名给定 3 * 4 12 种值bootstrapn_estimatorsmax_features超参数名给定 1 * 2 * 3 6 种值 cv对上述 18 种组合的超参数进行了 5 次训练5-折交叉验证refitTrue默认可以让 GridSearchCV 通过交叉验证找到最佳估算器后再在整个训练集上重新训练模型更多的数据可以提升模型性能查看网格搜索结果 grid_search.bestparams{max_features: 6, n_estimators: 30}获得最好的估算器 grid_search.best_estimator_估算器的评估分数 cvres grid_search.cvresults for mean_score, params in zip(cvres[mean_test_score], cvres[params]):print(np.sqrt(-mean_score), params)63475.5397459137 {max_features: 2, n_estimators: 3} 55754.473565553184 {max_features: 2, n_estimators: 10} 52830.64714547093 {max_features: 2, n_estimators: 30} 60296.33920014068 {max_features: 4, n_estimators: 3} 52504.03498357088 {max_features: 4, n_estimators: 10} 50328.7606181505 {max_features: 4, n_estimators: 30} 59328.255990059035 {max_features: 6, n_estimators: 3} 51909.34406264884 {max_features: 6, n_estimators: 10} 49802.234477838996 {max_features: 6, n_estimators: 30} 58997.87515871176 {max_features: 8, n_estimators: 3} 52036.752607340735 {max_features: 8, n_estimators: 10} 50321.971231209965 {max_features: 8, n_estimators: 30} 62389.547952235145 {bootstrap: False, max_features: 2, n_estimators: 3} 53800.36505088281 {bootstrap: False, max_features: 2, n_estimators: 10} 59953.45347364427 {bootstrap: False, max_features: 3, n_estimators: 3} 52115.46931655621 {bootstrap: False, max_features: 3, n_estimators: 10} 59061.9294179386 {bootstrap: False, max_features: 4, n_estimators: 3} 52197.755732390906 {bootstrap: False, max_features: 4, n_estimators: 10}最佳估算器max_features: 6, n_estimators: 30的 RMSE 评分为 49802略优于默认超参数的 50332模型得到了优化可以通过数据准备阶段定义的超参数用于控制异常值处理、缺失特征、特征选择等进行网格搜索从而自动探索问题的最佳解决办法
随机搜索随机搜素Scikit-Learn 的 RandomizedSearchCV 与 GridSearchCV 大致相同但每次迭代中只会为每个超参数选择一个随机值然后对一定数量的随机组合进行评估可以通过反复执行随机搜索从而每次探索不一样的超参数而不像网格搜索必须固定每个超参数的搜索范围可以通过简单的迭代次数设置更好的控制分配给超参数搜索的计算预算
使用 RandomizedSearchCV 评估支持向量机回归器 from sklearn.svm import SVR from sklearn.model_selection import RandomizedSearchCV from scipy.stats import expon, reciprocal# see https://docs.scipy.org/doc/scipy/reference/stats.html

for expon() and reciprocal() documentation and more probability distribution functions.# Note: gamma is ignored when kernel is linear

param_distribs {kernel: [linear, rbf],C: reciprocal(20, 200000),gamma: expon(scale1.0),}svm_reg SVR() rnd_search RandomizedSearchCV(svm_reg, param_distributionsparam_distribs,n_iter50, cv5, scoringneg_mean_squared_error,verbose2, random_state42) rnd_search.fit(housing_prepared, housing_labels)Fitting 5 folds for each of 50 candidates, totalling 250 fits [CV] END C629.782329591372, gamma3.010121430917521, kernellinear; total time 3.3s [CV] END C629.782329591372, gamma3.010121430917521, kernellinear; total time 3.3s [CV] END C629.782329591372, gamma3.010121430917521, kernellinear; total time 3.2s [CV] END C629.782329591372, gamma3.010121430917521, kernellinear; total time 3.2s [CV] END C629.782329591372, gamma3.010121430917521, kernellinear; total time 3.2s …negative_mse rnd_search.bestscore rmse np.sqrt(-negative_mse) print(rmse)54767.960710084146print(rnd_search.bestparams){C: 157055.10989448498, gamma: 0.26497040005002437, kernel: rbf}随机搜索到支持向量机回归器的一组最优超参数最终 RMSE 评分为 54767
集成方法集成方法将表现最优的模型组合起来通常比单一模型表现更好如随机森林之于决策树特别是当单一模型会产生不同类型的误差时
模型误差查看每个属性的相对重要层度 feature_importances grid_search.bestestimator.featureimportances print(feature_importances)array([8.30181927e-02, 7.09849240e-02, 4.24425223e-02, 1.76691115e-02,1.61540923e-02, 1.71789859e-02, 1.59395934e-02, 3.39837758e-01,6.50843504e-02, 1.04717194e-01, 6.48945156e-02, 1.47186585e-02,1.38881431e-01, 6.76526692e-05, 3.02499407e-03, 5.38602332e-03])extra_attribs [rooms_per_hhold, pop_per_hhold, bedrooms_per_room] cat_encoder full_pipeline.namedtransformers[cat] cat_one_hot_attribs list(catencoder.categories[0]) attributes num_attribs extra_attribs cat_one_hot_attribs sorted(zip(feature_importances, attributes), reverseTrue)[(0.3398377582278221, median_income),(0.13888143088401578, INLAND),(0.10471719429817675, pop_per_hhold),(0.0830181926813895, longitude),(0.07098492396156919, latitude),(0.06508435039879204, rooms_per_hhold),(0.06489451561779028, bedrooms_per_room),(0.042442522257867, housing_median_age),(0.017669111520336293, total_rooms),(0.017178985883288055, population),(0.016154092256827887, total_bedrooms),(0.015939593408818325, households),(0.0147186585483286, 1H OCEAN),(0.005386023320075893, NEAR OCEAN),(0.0030249940656810405, NEAR BAY),(6.765266922142473e-05, ISLAND)]可以尝试删除一些不太有用的特征本例中只有一个 ocean_proximity 是有用的其他就可以删除还可以通过添加额外特征、删除没有信息的特征、清除异常值等优化模型
通过测试集评估系统在测试集评估最终模型 final_model grid_search.bestestimator X_test strat_test_set.drop(median_house_value, axis1) y_test strat_test_set[median_house_value].copy() X_test_prepared full_pipeline.transform(X_test) final_predictions final_model.predict(X_test_prepared) final_mse mean_squared_error(y_test, final_predictions) final_rmse np.sqrt(final_mse) print(final_rmse)47785.02562107877使用 scipy.stats.t.interval() 计算泛化误差的 95% 置信区间 from scipy import statsconfidence 0.95 squared_errors (final_predictions - y_test) ** 2 np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,locsquared_errors.mean(),scalestats.sem(squared_errors)))array([45805.04012754, 49686.17157851])在测试集的评估结果会略逊于之前使用交叉验证时的表现这时不要再继续调整超参数试图让测试集的结果变得好看一些因为这些改进对于新的数据集上的泛化效果是无用的系统的最终性能可能并不比专家系统效果好比如下降 20% 左右但这并不一定是无用功这个机器学习系统可以提供一些有用的信息一定层度上解放专家系统的任务量可以通过特定测试集如内陆的区域、靠近海洋的区域评估模型的长短处
部署、监控与系统维护
部署通过 REST API 开放服务通过 joblib 将训练好的 Scikit-Learn 模型序列化保存这个模型包含完整的预处理和预测流水线在生产环境通过 Web Service 加载这个模型并开放调用模型 predict 功能的接口可以在模型服务的前面通过一个 Web App 与之交互提供新数据输入和预测结果处理并将结果开放给桌面端和移动端用户通过 Google Cloud AI Platform 部署将 joblib 序列化的模型上传到 Google CloudStorageGCS;在 Google Cloud AI Platform 创建新的模型版本模型指向 GCS 上的模型文件Google Cloud AI Platform 会直接提供一个简单的 Web Service类似上文的模型服务
系统监控监控目标编写监控代码定期检查系统的实时性能在系统性能降低时触发报警监控方向基础架构中的组件损坏可能引擎性能大降性能的轻微下降在长时间内可能被忽略外界是变化的可能训练的模型在一段时间后不再适应新输入的数据评估方式可以从下游推断模型的性能指标如推荐系统重推荐与不推荐产生的订单数多少即体现了推荐系统性能的优劣让人工分析介入系统性能评估如引入专家、非专家、众包平台上的工人对数据标记Google 的验证码就有标记训练数据的功能监控模型的输入数据的质量如对比输入数据与训练集的平均值、标准差等或分类特征出现新类别等可以提前发现引发系统性能下降的原因
系统维护系统维护的最佳做法是让其整个过程自动化系统维护所需做的事情定期收集新数据并做标记必要时人工标记编写脚本定期训练模型并自动微调超参数根据需求让脚本定期跑起来编写脚本在更新的测试集上评估新模型和旧模型对比二者性能决定是否替换到生产环境中保留所有版本的模型方便快速回滚保留每个版本的数据集方便回滚新数据集被破坏时如添加了离群值和其他模型的评估机器学习涉及很多基础建设工作第一个机器学习项目花费大量精力和时间来构建和部署这些组件是很正常的一旦这些流程走通往后的模型服务上线与迭代都将是很容易的事情推荐读者从 Kaggle 这样的竞赛网站选择一个不错的目标然后将整个流程 Run 起来
可用数据源流行的开放数据存储库: UC Irvine Machine Learning RepositoryKaggle datasetsAmazon’s AWS datasets 元门户站点(它们会列出开放的数据存储库): Data PortalsOpenDataMonitorQuandl 其他一些列出许多流行的开放数据存储库的页面: Wikipedia’s list of Machine Learning datasetsQuora.comThe datasets subreddit 上一篇「ML 基础篇」机器学习概览专栏《机器学习》 PS感谢每一位志同道合者的阅读欢迎关注、评论、赞参考资料 [1]《机器学习》[2]《机器学习实战》