探索性分析
首先读入数据housing。将数据导入到data变量中,观察data变量的结构。
str(data)
## 'data.frame': 2919 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
## $ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
## $ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Utilities : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
## $ LotConfig : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
## $ LandSlope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
## $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
## $ Condition1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
## $ Condition2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
## $ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ HouseStyle : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior1st : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
## $ Exterior2nd : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
## $ MasVnrType : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
## $ ExterCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
## $ BsmtQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
## $ BsmtCond : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
## $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
## $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ HeatingQC : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ CentralAir : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ Electrical : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
## $ GarageType : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
## $ GarageCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ PavedDrive : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
## $ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
## $ MiscFeature : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
从上面可以观察到,变量主要分为数值型变量和因子型变量。
bianliang <- sapply(data,class)
table(bianliang)
## bianliang
## factor integer
## 43 38
#对每列就是变量使用class函数然后使用table函数输出类型
数据中有43个因子型变量和38个整数型变量。
变量处理
通过对数据观察发现,存在许多缺失值,首先处理数据的缺失值。
queshizhi<-sapply(data,function(x) sum(is.na(x)))
#使用r语言中的function函数计算为na的个数
# 按照缺失的大小进行排名
queshi <-sort(queshizhi,decreasing = T)
queshi[queshi>0]
## PoolQC MiscFeature Alley Fence SalePrice FireplaceQu
## 2909 2814 2721 2348 1459 1420
## LotFrontage GarageYrBlt GarageFinish GarageQual GarageCond GarageType
## 486 159 159 159 159 157
## BsmtCond BsmtExposure BsmtQual BsmtFinType2 BsmtFinType1 MasVnrType
## 82 82 81 80 79 24
## MasVnrArea MSZoning Utilities BsmtFullBath BsmtHalfBath Functional
## 23 4 2 2 2 2
## Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 1 1 1 1 1
## Electrical KitchenQual GarageCars GarageArea SaleType
## 1 1 1 1 1
我们进一步看这些缺失的变量的情况。
summary(data[,names(queshi)[queshi>0]])
## PoolQC MiscFeature Alley Fence SalePrice FireplaceQu
## Ex : 4 Gar2: 5 Grvl: 120 GdPrv: 118 Min. : 34900 Ex : 43
## Fa : 2 Othr: 4 Pave: 78 GdWo : 112 1st Qu.:129975 Fa : 74
## Gd : 4 Shed: 95 NA's:2721 MnPrv: 329 Median :163000 Gd : 744
## NA's:2909 TenC: 1 MnWw : 12 Mean :180921 Po : 46
## NA's:2814 NA's :2348 3rd Qu.:214000 TA : 592
## Max. :755000 NA's:1420
## NA's :1459
## LotFrontage GarageYrBlt GarageFinish GarageQual GarageCond
## Min. : 21.00 Min. :1895 Fin : 719 Ex : 3 Ex : 3
## 1st Qu.: 59.00 1st Qu.:1960 RFn : 811 Fa : 124 Fa : 74
## Median : 68.00 Median :1979 Unf :1230 Gd : 24 Gd : 15
## Mean : 69.31 Mean :1978 NA's: 159 Po : 5 Po : 14
## 3rd Qu.: 80.00 3rd Qu.:2002 TA :2604 TA :2654
## Max. :313.00 Max. :2207 NA's: 159 NA's: 159
## NA's :486 NA's :159
## GarageType BsmtCond BsmtExposure BsmtQual BsmtFinType2 BsmtFinType1
## 2Types : 23 Fa : 104 Av : 418 Ex : 258 ALQ : 52 ALQ :429
## Attchd :1723 Gd : 122 Gd : 276 Fa : 88 BLQ : 68 BLQ :269
## Basment: 36 Po : 5 Mn : 239 Gd :1209 GLQ : 34 GLQ :849
## BuiltIn: 186 TA :2606 No :1904 TA :1283 LwQ : 87 LwQ :154
## CarPort: 15 NA's: 82 NA's: 82 NA's: 81 Rec : 105 Rec :288
## Detchd : 779 Unf :2493 Unf :851
## NA's : 157 NA's: 80 NA's: 79
## MasVnrType MasVnrArea MSZoning Utilities BsmtFullBath
## BrkCmn : 25 Min. : 0.0 C (all): 25 AllPub:2916 Min. :0.0000
## BrkFace: 879 1st Qu.: 0.0 FV : 139 NoSeWa: 1 1st Qu.:0.0000
## None :1742 Median : 0.0 RH : 26 NA's : 2 Median :0.0000
## Stone : 249 Mean : 102.2 RL :2265 Mean :0.4299
## NA's : 24 3rd Qu.: 164.0 RM : 460 3rd Qu.:1.0000
## Max. :1600.0 NA's : 4 Max. :3.0000
## NA's :23 NA's :2
## BsmtHalfBath Functional Exterior1st Exterior2nd
## Min. :0.00000 Typ :2717 VinylSd:1025 VinylSd:1014
## 1st Qu.:0.00000 Min2 : 70 MetalSd: 450 MetalSd: 447
## Median :0.00000 Min1 : 65 HdBoard: 442 HdBoard: 406
## Mean :0.06136 Mod : 35 Wd Sdng: 411 Wd Sdng: 391
## 3rd Qu.:0.00000 Maj1 : 19 Plywood: 221 Plywood: 270
## Max. :2.00000 (Other): 11 (Other): 369 (Other): 390
## NA's :2 NA's : 2 NA's : 1 NA's : 1
## BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 220.0 1st Qu.: 793.0
## Median : 368.5 Median : 0.00 Median : 467.0 Median : 989.5
## Mean : 441.4 Mean : 49.58 Mean : 560.8 Mean :1051.8
## 3rd Qu.: 733.0 3rd Qu.: 0.00 3rd Qu.: 805.5 3rd Qu.:1302.0
## Max. :5644.0 Max. :1526.00 Max. :2336.0 Max. :6110.0
## NA's :1 NA's :1 NA's :1 NA's :1
## Electrical KitchenQual GarageCars GarageArea SaleType
## FuseA: 188 Ex : 205 Min. :0.000 Min. : 0.0 WD :2525
## FuseF: 50 Fa : 70 1st Qu.:1.000 1st Qu.: 320.0 New : 239
## FuseP: 8 Gd :1151 Median :2.000 Median : 480.0 COD : 87
## Mix : 1 TA :1492 Mean :1.767 Mean : 472.9 ConLD : 26
## SBrkr:2671 NA's: 1 3rd Qu.:2.000 3rd Qu.: 576.0 CWD : 12
## NA's : 1 Max. :5.000 Max. :1488.0 (Other): 29
## NA's :1 NA's :1 NA's : 1
对于缺失数据很多的PoolQC、MiscFeature、Alley、Fence、FireplaceQu这些变量,无法进行插值补充,我们直接去除这些变量。
# 去除如下变量
quchu <- names(data) %in% c("PoolQC","MiscFeature","Alley","Fence","FireplaceQu")
data <- data[!quchu]
#使用逻辑运算符提取出变量然后去除相对应的列
去除了NA值较多的变量之后,对Garage系列的变量和Bsmt系列的变量通过网上搜索发现这是车库和地下室的相关数据,对这些NA值我们使用NONE来替代缺失值。
通过查询描述文件得知,GarrageYrBLt为车库建造年份,使用房子的建造年份代替。
data$GarageYrBlt[is.na(data$GarageYrBlt)] <- data$YearBuilt[is.na(data$GarageYrBlt)]
补全缺失值
对LotFrontage是房屋到街道的距离,用中位数来填充缺失。
data$LotFrontage[is.na(data$LotFrontage)] <- median(data$LotFrontage, na.rm = T)
#使用中位数补全
MasVnrType是外墙的装饰材料,对售价的影响不大。用NONE补充
data[["MasVnrType"]][is.na(data[["MasVnrType"]])] = "None"
MasVnrArea是外墙装饰材料的面积,用数值0来填充。
data[["MasVnrArea"]][is.na(data[["MasVnrArea"]])] <- 0
Utilities 没有分析的意义,直接去除
data$Utilities <- NULL
变量 BsmtFullBath BsmtHalfBath BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF GarageCars GarageArea 是和车库和地下室相关的变量,是数值型变量,补充为0就可以。
drop <- c("BsmtFullBath","BsmtHalfBath","BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF","GarageCars","GarageArea")
for (x in drop ) data[[x]][is.na(data[[x]])] <- 0
#使用简单的循环进行数据清洗
MSZoning,Functional,Exterior1st,Exterior2nd,KitchenQual,Electrical,SaleType,这些是因子型变量,并且缺失值很少,用最高频率的因子来替代即可。
tidai <- c("MSZoning","Functional","Exterior1st","Exterior2nd","KitchenQual","Electrical","SaleType")
for (x in tidai ) data[[x]][is.na(data[[x]])] <- levels(data[[x]])[which.max(table(data[[x]]))]
# 通过SalePrice是否为空来区分训练集和测试集
train <- data[!is.na(data$SalePrice), ]
test <- data[is.na(data$SalePrice), ]
建立模型
数据中自变量很多,根据常识我们要从自变量中选出对房价影响最大的因素。我们可以首先人工筛选出一些对房价影响大的因素,然后再添加新的变量来看是否对模型会有改善。 在国内的话,房价的主要影响因素有房子面积、房子所在的区域、小区等,房龄、房型(小高层、多层、别墅等)、特殊场景(地铁房、学区房等)、装修等也会影响价格。对美国房价来说其实也差不多。因此我们选择如下的变量作为观测。
- LotArea 房子的面积
- Neighborhood 城市街区 用来初步代替 区域、小区
- Condition1 Condition2 附近的交通情况
- BldgType 房屋类型 独栋别墅、联排别墅
- HouseStyle 房子的层数
- YearBuilt 房子建造的年份
- YearRemodAdd: 房子的改造年份
- OverallQual: 房子整体质量,考量材料和完成度
- OverallCond:房子整体条件
观察变量关系
写出函数来观察因子型变量和数值型变量的分布图。
# 加载库
library(ggplot2)
library(Rmisc)
## 载入需要的程辑包:lattice
## 载入需要的程辑包:plyr
# 将对于因子变量画图
plot2_factor <- function(var_name){
plots <- list()
plots[[1]] <- ggplot(train, aes_string(x = var_name, fill = var_name) ) +
geom_bar() +
guides(fill = FALSE) +
ggtitle(paste("count of ", var_name)) +
theme(axis.text.x = element_text(angle = 90, hjust =1))
plots[[2]] <- ggplot(train, aes_string(x = var_name, y = "SalePrice", fill = var_name) ) +
geom_boxplot() +
guides(fill = FALSE) +
ggtitle(paste( var_name, " vs SalePrice")) +
theme(axis.text.x = element_text(angle = 90, hjust =1))
multiplot(plotlist = plots, cols = 2)
}
# 对于连续数字变量画图
plot2_number <- function(var_name){
plots <- list()
plots[[1]] <- ggplot(train, aes_string(x = var_name) ) +
geom_histogram() +
ggtitle(paste("count of ", var_name))
plots[[2]] <- ggplot(train, aes_string(x = var_name, y = "SalePrice") ) +
geom_point() +
ggtitle(paste( var_name, " vs SalePrice"))
multiplot(plotlist = plots, cols = 2)
}
#构建两个绘图函数,分别可以直接输出变量名画出相对应的图像,分别是因子型变量和数值型变量。
首先观察街区和房间的关系图。
plot2_factor("Neighborhood")
通过上图可以看出不同的社区,房价差异很大,因此这个变量应该是影响比较大的。
plot2_number("YearBuilt")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
通过对建筑年限这一数值型变量进行绘图研究看出,售价和建筑年限也有强烈的线性关系,说明该变量是有意义的。
plot2_number("OverallQual")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
从上图看出装修越好的房子价格越高。
各个变量之间的相关性
library(corrgram)
##
## 载入程辑包:'corrgram'
## The following object is masked from 'package:plyr':
##
## baseball
## The following object is masked from 'package:lattice':
##
## panel.fill
sel <- c("LotArea","Neighborhood","BldgType","HouseStyle","YearBuilt","YearRemodAdd","OverallQual","OverallCond","MSZoning")
corrgram(train[,sel], order=TRUE, lower.panel=panel.shade, upper.panel=panel.pie, text.panel=panel.txt)
训练线性模型
tezheng <- SalePrice ~ LotArea + Neighborhood + BldgType + HouseStyle + YearBuilt + YearRemodAdd + OverallQual + OverallCond
# 训练模型
lm1 <- lm(tezheng, train)
# 查看模型概要
summary(lm1)
##
## Call:
## lm(formula = tezheng, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -208970 -20882 -2917 15544 351199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.455e+06 1.850e+05 -7.862 7.42e-15 ***
## LotArea 1.084e+00 1.156e-01 9.375 < 2e-16 ***
## NeighborhoodBlueste -1.068e+03 2.953e+04 -0.036 0.971141
## NeighborhoodBrDale -1.440e+04 1.518e+04 -0.949 0.342806
## NeighborhoodBrkSide -1.876e+04 1.278e+04 -1.468 0.142460
## NeighborhoodClearCr -2.352e+03 1.332e+04 -0.177 0.859842
## NeighborhoodCollgCr -2.917e+04 1.086e+04 -2.685 0.007335 **
## NeighborhoodCrawfor 1.747e+04 1.246e+04 1.402 0.161225
## NeighborhoodEdwards -2.813e+04 1.165e+04 -2.414 0.015924 *
## NeighborhoodGilbert -4.030e+04 1.157e+04 -3.484 0.000508 ***
## NeighborhoodIDOTRR -3.357e+04 1.343e+04 -2.499 0.012570 *
## NeighborhoodMeadowV 1.338e+04 1.446e+04 0.925 0.354867
## NeighborhoodMitchel -2.819e+04 1.196e+04 -2.356 0.018617 *
## NeighborhoodNAmes -2.202e+04 1.130e+04 -1.950 0.051426 .
## NeighborhoodNoRidge 6.105e+04 1.226e+04 4.980 7.13e-07 ***
## NeighborhoodNPkVill 6.340e+03 1.650e+04 0.384 0.700928
## NeighborhoodNridgHt 4.876e+04 1.104e+04 4.417 1.08e-05 ***
## NeighborhoodNWAmes -2.126e+04 1.166e+04 -1.823 0.068457 .
## NeighborhoodOldTown -2.915e+04 1.243e+04 -2.344 0.019194 *
## NeighborhoodSawyer -2.575e+04 1.188e+04 -2.168 0.030350 *
## NeighborhoodSawyerW -2.224e+04 1.154e+04 -1.927 0.054234 .
## NeighborhoodSomerst -1.228e+04 1.093e+04 -1.123 0.261764
## NeighborhoodStoneBr 5.984e+04 1.249e+04 4.790 1.84e-06 ***
## NeighborhoodSWISU -2.365e+04 1.433e+04 -1.651 0.099024 .
## NeighborhoodTimber -1.326e+04 1.236e+04 -1.073 0.283489
## NeighborhoodVeenker 2.303e+04 1.555e+04 1.481 0.138905
## BldgType2fmCon 1.230e+03 7.413e+03 0.166 0.868218
## BldgTypeDuplex -7.231e+02 5.831e+03 -0.124 0.901330
## BldgTypeTwnhs -6.675e+04 7.811e+03 -8.546 < 2e-16 ***
## BldgTypeTwnhsE -4.916e+04 4.892e+03 -10.049 < 2e-16 ***
## HouseStyle1.5Unf -2.835e+04 1.102e+04 -2.573 0.010184 *
## HouseStyle1Story -3.981e+03 3.977e+03 -1.001 0.316972
## HouseStyle2.5Fin 5.328e+04 1.472e+04 3.619 0.000306 ***
## HouseStyle2.5Unf -4.606e+03 1.250e+04 -0.368 0.712613
## HouseStyle2Story 4.069e+03 4.205e+03 0.968 0.333393
## HouseStyleSFoyer -1.173e+04 7.791e+03 -1.505 0.132424
## HouseStyleSLvl -6.438e+03 6.197e+03 -1.039 0.299077
## YearBuilt 4.285e+02 8.428e+01 5.084 4.20e-07 ***
## YearRemodAdd 3.114e+02 7.505e+01 4.149 3.53e-05 ***
## OverallQual 2.849e+04 1.187e+03 24.010 < 2e-16 ***
## OverallCond 1.613e+03 1.150e+03 1.402 0.161035
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38880 on 1419 degrees of freedom
## Multiple R-squared: 0.7671, Adjusted R-squared: 0.7605
## F-statistic: 116.8 on 40 and 1419 DF, p-value: < 2.2e-16
模型中部分特征没有显著,但是模型整体的F检验通过,说明该模型还是可以的。模型调整后的\(R^2=0.7605\)效果还不错。
变量选择
首先进行人工变量选择,去除不显著的变量。去掉OverallCond,重新进行拟合
# 初步决定的 lm.base 模型的变量
fm.base <- SalePrice ~ LotArea + Neighborhood + BldgType + HouseStyle + YearBuilt + YearRemodAdd + OverallQual
# 训练模型
lm.base <- lm(fm.base, train)
summary(lm.base)
##
## Call:
## lm(formula = fm.base, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -209817 -20321 -2822 15210 352218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.464e+06 1.850e+05 -7.914 4.98e-15 ***
## LotArea 1.081e+00 1.156e-01 9.352 < 2e-16 ***
## NeighborhoodBlueste 7.266e+02 2.951e+04 0.025 0.980358
## NeighborhoodBrDale -1.362e+04 1.517e+04 -0.898 0.369408
## NeighborhoodBrkSide -1.798e+04 1.277e+04 -1.408 0.159440
## NeighborhoodClearCr -1.805e+03 1.331e+04 -0.136 0.892207
## NeighborhoodCollgCr -2.898e+04 1.087e+04 -2.667 0.007744 **
## NeighborhoodCrawfor 1.883e+04 1.243e+04 1.516 0.129852
## NeighborhoodEdwards -2.782e+04 1.166e+04 -2.387 0.017109 *
## NeighborhoodGilbert -4.028e+04 1.157e+04 -3.481 0.000515 ***
## NeighborhoodIDOTRR -3.359e+04 1.344e+04 -2.499 0.012553 *
## NeighborhoodMeadowV 1.425e+04 1.445e+04 0.986 0.324151
## NeighborhoodMitchel -2.759e+04 1.196e+04 -2.307 0.021210 *
## NeighborhoodNAmes -2.091e+04 1.127e+04 -1.855 0.063807 .
## NeighborhoodNoRidge 6.106e+04 1.226e+04 4.980 7.14e-07 ***
## NeighborhoodNPkVill 7.526e+03 1.649e+04 0.456 0.648117
## NeighborhoodNridgHt 4.836e+04 1.104e+04 4.380 1.27e-05 ***
## NeighborhoodNWAmes -1.993e+04 1.163e+04 -1.714 0.086670 .
## NeighborhoodOldTown -2.849e+04 1.243e+04 -2.292 0.022025 *
## NeighborhoodSawyer -2.475e+04 1.186e+04 -2.086 0.037120 *
## NeighborhoodSawyerW -2.201e+04 1.155e+04 -1.907 0.056764 .
## NeighborhoodSomerst -1.248e+04 1.094e+04 -1.141 0.254229
## NeighborhoodStoneBr 5.963e+04 1.250e+04 4.772 2.01e-06 ***
## NeighborhoodSWISU -2.346e+04 1.433e+04 -1.637 0.101780
## NeighborhoodTimber -1.325e+04 1.236e+04 -1.072 0.283915
## NeighborhoodVeenker 2.475e+04 1.551e+04 1.596 0.110733
## BldgType2fmCon 7.637e+02 7.408e+03 0.103 0.917900
## BldgTypeDuplex -1.566e+03 5.802e+03 -0.270 0.787285
## BldgTypeTwnhs -6.658e+04 7.813e+03 -8.522 < 2e-16 ***
## BldgTypeTwnhsE -4.951e+04 4.888e+03 -10.129 < 2e-16 ***
## HouseStyle1.5Unf -2.800e+04 1.102e+04 -2.541 0.011173 *
## HouseStyle1Story -4.114e+03 3.977e+03 -1.035 0.301043
## HouseStyle2.5Fin 5.310e+04 1.473e+04 3.606 0.000322 ***
## HouseStyle2.5Unf -5.807e+03 1.248e+04 -0.465 0.641658
## HouseStyle2Story 3.910e+03 4.205e+03 0.930 0.352516
## HouseStyleSFoyer -1.110e+04 7.781e+03 -1.427 0.153881
## HouseStyleSLvl -6.096e+03 6.195e+03 -0.984 0.325288
## YearBuilt 3.934e+02 8.052e+01 4.886 1.15e-06 ***
## YearRemodAdd 3.548e+02 6.840e+01 5.186 2.46e-07 ***
## OverallQual 2.863e+04 1.183e+03 24.206 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38890 on 1420 degrees of freedom
## Multiple R-squared: 0.7668, Adjusted R-squared: 0.7603
## F-statistic: 119.7 on 39 and 1420 DF, p-value: < 2.2e-16
发现模型的效果没有显著提升。
变量选择方法
传统的变量选择方法有很多,例如LASSO
,Ridge
等方法,这里我们使用Lasso
,随机森林和梯度下降法来进行变量选择,提升模型性能。
# 安装
library(glmnet)
## 载入需要的程辑包:Matrix
## Loaded glmnet 4.1-1
# 准备数据
formula <- as.formula( log(SalePrice)~ .-Id )
# model.matrix 会自动将分类变量变成哑变量
x <- model.matrix(formula, train)
y <- log(train$SalePrice)
#执行 lasso
set.seed(999)
lm.lasso <- cv.glmnet(x, y, alpha=1)
# 画图
plot(lm.lasso)
# 得到各变量的系数
coef(lm.lasso, s = "lambda.min")
## 242 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 7.314346e+00
## (Intercept) .
## MSSubClass -3.089056e-04
## MSZoningFV 1.793795e-02
## MSZoningRH .
## MSZoningRL 4.132766e-02
## MSZoningRM .
## LotFrontage .
## LotArea 1.272041e-06
## StreetPave 4.921358e-02
## LotShapeIR2 1.071380e-02
## LotShapeIR3 -9.683435e-02
## LotShapeReg .
## LandContourHLS .
## LandContourLow .
## LandContourLvl .
## LotConfigCulDSac 3.543690e-02
## LotConfigFR2 .
## LotConfigFR3 .
## LotConfigInside .
## LandSlopeMod .
## LandSlopeSev .
## NeighborhoodBlueste .
## NeighborhoodBrDale -1.493220e-02
## NeighborhoodBrkSide .
## NeighborhoodClearCr 3.747801e-02
## NeighborhoodCollgCr .
## NeighborhoodCrawfor 1.115702e-01
## NeighborhoodEdwards -4.439143e-02
## NeighborhoodGilbert .
## NeighborhoodIDOTRR -6.746018e-02
## NeighborhoodMeadowV -7.649854e-02
## NeighborhoodMitchel .
## NeighborhoodNAmes .
## NeighborhoodNoRidge 5.440480e-02
## NeighborhoodNPkVill .
## NeighborhoodNridgHt 1.179915e-01
## NeighborhoodNWAmes .
## NeighborhoodOldTown -2.833497e-02
## NeighborhoodSawyer .
## NeighborhoodSawyerW .
## NeighborhoodSomerst 6.349095e-02
## NeighborhoodStoneBr 1.204762e-01
## NeighborhoodSWISU .
## NeighborhoodTimber 2.260192e-03
## NeighborhoodVeenker 2.878460e-02
## Condition1Feedr -1.568548e-02
## Condition1Norm 3.532074e-02
## Condition1PosA .
## Condition1PosN .
## Condition1RRAe -2.434992e-02
## Condition1RRAn .
## Condition1RRNe .
## Condition1RRNn .
## Condition2Feedr .
## Condition2Norm .
## Condition2PosA .
## Condition2PosN -5.253862e-01
## Condition2RRAe .
## Condition2RRAn .
## Condition2RRNn .
## BldgType2fmCon .
## BldgTypeDuplex .
## BldgTypeTwnhs -7.598136e-02
## BldgTypeTwnhsE -1.805548e-02
## HouseStyle1.5Unf .
## HouseStyle1Story .
## HouseStyle2.5Fin .
## HouseStyle2.5Unf .
## HouseStyle2Story .
## HouseStyleSFoyer .
## HouseStyleSLvl .
## OverallQual 6.845347e-02
## OverallCond 3.219936e-02
## YearBuilt 1.274539e-03
## YearRemodAdd 8.437361e-04
## RoofStyleGable -5.642072e-03
## RoofStyleGambrel .
## RoofStyleHip .
## RoofStyleMansard .
## RoofStyleShed .
## RoofMatlCompShg 8.744853e-03
## RoofMatlMembran .
## RoofMatlMetal .
## RoofMatlRoll .
## RoofMatlTar&Grv .
## RoofMatlWdShake .
## RoofMatlWdShngl 7.458154e-02
## Exterior1stAsphShn .
## Exterior1stBrkComm -1.345774e-01
## Exterior1stBrkFace 4.770749e-02
## Exterior1stCBlock .
## Exterior1stCemntBd .
## Exterior1stHdBoard -8.584669e-03
## Exterior1stImStucc .
## Exterior1stMetalSd .
## Exterior1stPlywood .
## Exterior1stStone .
## Exterior1stStucco .
## Exterior1stVinylSd .
## Exterior1stWd Sdng -6.333826e-03
## Exterior1stWdShing .
## Exterior2ndAsphShn .
## Exterior2ndBrk Cmn .
## Exterior2ndBrkFace .
## Exterior2ndCBlock .
## Exterior2ndCmentBd 1.903051e-03
## Exterior2ndHdBoard .
## Exterior2ndImStucc .
## Exterior2ndMetalSd .
## Exterior2ndOther .
## Exterior2ndPlywood .
## Exterior2ndStone .
## Exterior2ndStucco -4.652564e-02
## Exterior2ndVinylSd .
## Exterior2ndWd Sdng .
## Exterior2ndWd Shng -1.387069e-02
## MasVnrTypeBrkFace .
## MasVnrTypeNone .
## MasVnrTypeStone .
## MasVnrArea .
## ExterQualFa -1.044328e-02
## ExterQualGd .
## ExterQualTA -6.848606e-03
## ExterCondFa -9.876932e-03
## ExterCondGd .
## ExterCondPo .
## ExterCondTA 4.288596e-03
## FoundationCBlock .
## FoundationPConc 2.610683e-02
## FoundationSlab -2.033377e-03
## FoundationStone .
## FoundationWood .
## BsmtQualFa .
## BsmtQualGd .
## BsmtQualTA -2.813238e-04
## BsmtQualNone -3.399457e-03
## BsmtCondGd .
## BsmtCondPo .
## BsmtCondTA 3.624419e-03
## BsmtCondNone .
## BsmtExposureGd 4.858853e-02
## BsmtExposureMn .
## BsmtExposureNo -6.556593e-03
## BsmtExposureNone -3.795778e-02
## BsmtFinType1BLQ .
## BsmtFinType1GLQ 9.606824e-03
## BsmtFinType1LwQ .
## BsmtFinType1Rec .
## BsmtFinType1Unf -3.321896e-02
## BsmtFinType1None .
## BsmtFinSF1 .
## BsmtFinType2BLQ -7.694026e-03
## BsmtFinType2GLQ .
## BsmtFinType2LwQ .
## BsmtFinType2Rec .
## BsmtFinType2Unf .
## BsmtFinType2None -2.374994e-02
## BsmtFinSF2 .
## BsmtUnfSF .
## TotalBsmtSF 2.725149e-05
## HeatingGasA .
## HeatingGasW 6.541262e-02
## HeatingGrav -1.079210e-01
## HeatingOthW .
## HeatingWall .
## HeatingQCFa .
## HeatingQCGd -2.803107e-03
## HeatingQCPo .
## HeatingQCTA -1.893740e-02
## CentralAirY 6.505505e-02
## ElectricalFuseF .
## ElectricalFuseP .
## ElectricalMix .
## ElectricalSBrkr .
## X1stFlrSF 2.739845e-05
## X2ndFlrSF .
## LowQualFinSF .
## GrLivArea 1.960863e-04
## BsmtFullBath 4.369629e-02
## BsmtHalfBath .
## FullBath 2.498523e-02
## HalfBath 1.102765e-02
## BedroomAbvGr .
## KitchenAbvGr -2.694865e-02
## KitchenQualFa .
## KitchenQualGd .
## KitchenQualTA -7.346020e-03
## TotRmsAbvGrd 8.445023e-03
## FunctionalMaj2 -1.661537e-01
## FunctionalMin1 .
## FunctionalMin2 .
## FunctionalMod .
## FunctionalSev -1.691674e-01
## FunctionalTyp 3.674019e-02
## Fireplaces 2.925835e-02
## GarageTypeAttchd 1.017935e-02
## GarageTypeBasment -2.805227e-04
## GarageTypeBuiltIn .
## GarageTypeCarPort .
## GarageTypeDetchd .
## GarageTypeNone -5.483381e-04
## GarageYrBlt .
## GarageFinishRFn .
## GarageFinishUnf -3.452388e-03
## GarageFinishNone -6.132096e-05
## GarageCars 5.606530e-02
## GarageArea 3.428652e-05
## GarageQualFa -9.605669e-03
## GarageQualGd 1.663474e-02
## GarageQualPo .
## GarageQualTA .
## GarageQualNone -2.467408e-04
## GarageCondFa -1.397724e-02
## GarageCondGd .
## GarageCondPo .
## GarageCondTA .
## GarageCondNone -2.685390e-06
## PavedDriveP .
## PavedDriveY 1.160351e-02
## WoodDeckSF 9.033916e-05
## OpenPorchSF .
## EnclosedPorch 2.864247e-06
## X3SsnPorch 1.227765e-05
## ScreenPorch 2.102909e-04
## PoolArea -6.262337e-05
## MiscVal .
## MoSold .
## YrSold -4.603035e-04
## SaleTypeCon .
## SaleTypeConLD .
## SaleTypeConLI .
## SaleTypeConLw .
## SaleTypeCWD .
## SaleTypeNew 7.765443e-02
## SaleTypeOth .
## SaleTypeWD .
## SaleConditionAdjLand .
## SaleConditionAlloca .
## SaleConditionFamily .
## SaleConditionNormal 3.497182e-02
## SaleConditionPartial .
#由于 SalePrice 为 NA 无法数组化
test$SalePrice <- 1
test_x <- model.matrix(formula, test)
# 预测、输出结果
lm.pred <- predict(lm.lasso, newx = test_x, s = "lambda.min")
res <- data.frame(Id = test$Id, SalePrice = exp(lm.pred))
write.csv(res, file = "res_lasso.csv", row.names = FALSE)
使用随机森林方法进行回归建模.
library(randomForest)
library(caret)
#设定种子
set.seed(223)
# 设定控制参数
# method = "cv" -- k 折交叉验证
# number -- K 折交叉验证中的 K, number=10 则是 10 折交叉验证
# repeats -- 交叉验证的次数
# verboseIter -- 打印训练日志
ctrl <- trainControl(method = "cv", number = 10, repeats = 20, verboseIter = TRUE)
# 训练模型
lm.rf <- train(log(SalePrice)~ .-Id, data = train, method = "rf", trControl = ctrl, tuneLength = 3)
# 输出结果
#write_res(lm.rf, test, 'rf')
# 输出结果
lm.pred <- predict(lm.rf, test)
res <- data.frame(Id = test$Id, SalePrice = exp(lm.pred))
write.csv(res, file = "res_rf.csv", row.names = FALSE)
使用梯度下降法进行模型建立。
lm.gbm <- train(log(SalePrice)~ .-Id, data = train, method = "gbm", trControl = ctrl)
# 输出结果
lm.pred <- predict(lm.gbm, test)
res <- data.frame(Id = test$Id, SalePrice = exp(lm.pred))
write.csv(res, file = "res_gbm.csv", row.names = FALSE)
最终我们得到了各个方法进行预测的残差,并且进行计算得到了各个方法的RES值,通过对比可以得出相对最好的方法,想要进一步提升模型的性能,可以使用bagging,stacking等模型融合的方法来进行改进。
使用BeSS方法进行变量选择回归。
之前使用的是比较传统的方法,这里本文使用最优子集法,使用BeSS
包对该问题进行分析建模。
最优子集法即best subset selection
,最早可见Hocking[@hocking1967selection]等人1967年的文章,其思想十分简单。从零号模型(null model)\(M_0\)开始,这个模型只有截距项而没有任何自变量。然后用不同的特征组合进行拟合,从特征中分别挑选出一个最好的模型(RSS最小或\(R^2\)最大),也就是包含1个特征的模型\(M_1\),包含2个特征的模型\(M_2\),直至包含p个特征的模型\(M_p\)。然后从这总共p+1个模型中选出其中最好的模型(根据交叉验证误差,\(C_p\),BIC或adjusted \(R^2\))(注:为什么不能用RSS或\(R^2\)来衡量?因为增加任何特征,模型的训练RSS只会变小,\(R^2\)只会增大)。这个最好模型所配置的特征就是筛选出的特征。最优子集法在理想的条件下可以筛选出最好的特征集合出来,但是其高昂的计算成本是阻碍其应用的主要问题。Zhu[@Zhu2021]等人在2021年提出了一种最优子集法的多项式算法,证明了在一定条件下,该算法具有以下三个优良性质:
- 计算复杂度是多项式的
- 选择出来的子集能够覆盖真实的集合
- 该算法的解是全局最优的
作者将该方法称为Adaptive Best-Subset Selection
以下简称BeSS
,在作者提出的SIC(special information criterion)准则下,该方法的模型选择连续性得到了证明。SIC准则如下: \[\operatorname{SIC}(\mathcal{A})=n \log \mathcal{L}_{\mathcal{A}}+|\mathcal{A}| \log (p) \log \log n\] 其中\(\mathcal{A}\)为筛选出来的特征的集合。
剪接方法简介
首先定义一些变量名称,\(\boldsymbol{\beta}=\left(\beta_{1}, \ldots, \beta_{p}\right)^{\top} \in \mathbb{R}^{p}\),将\(\ell_{q}\)定义为:\(\|\boldsymbol{\beta}\|_{q}=\left(\sum_{j=1}^{p}\left|\beta_{j}\right|^{q}\right)^{1 / q},q \in[1, \infty)\)。\(\mathcal{S}=\{1, \ldots, p\}\),对任何\(\mathcal{A} \subseteq \mathcal{S}\),记\(\mathcal{A}^{c}=\mathcal{S} \backslash \mathcal{A}\)作为\(\mathcal{A}\)的补集,\(|\mathcal{A}|\)作为他的基。将\(\beta\)的子集定义为\(\operatorname{supp}(\boldsymbol{\beta})=\left\{j: \beta_{j} \neq 0\right\}\)。对一个指标集\(\mathcal{A} \subseteq\{1, \ldots, p\}\),\(\boldsymbol{\beta}_{\mathcal{A}}=\left(\beta_{j}, j \in \mathcal{A}\right) \in \mathbb{R}^{|\mathcal{A}|}\)。对一个矩阵\(\boldsymbol{X} \in \mathbb{R}^{n \times p}\)定义\(\boldsymbol{X}_{\mathcal{A}} =(X_{j},j \in \mathcal{A}) \in \mathbb{R}^{n \times|\mathcal{A}|}\)。对任何一个向量\(t\)和任何一个集合\(\mathcal{A}\),\(t^{\mathcal{A}}\)是一个第\(j\)个元素\(\left(\boldsymbol{t}^{\mathcal{A}}\right)_{j}\)如果\(j \in \mathcal{A}\)就等于\(t_{j}\),否则为0。
在文章中,作者最重要的贡献是提出了剪接方法,即splicing method
,通过使用该种方法,显著的提升了最优子集算法的性能,使其在当前的环境下变的可行。 考虑\(\ell_{0}\)限制最小化问题:\[
\min _{\boldsymbol{\beta}} \mathcal{L}_{n}(\boldsymbol{\beta}), \quad \text { s. } t\|\boldsymbol{\beta}\|_{0} \leq \boldsymbol{s}
\] 其中:\(\mathcal{L}_{n}(\boldsymbol{\beta})=\frac{1}{2 n}\|\boldsymbol{y}-\boldsymbol{X} \boldsymbol{\beta}\|_{2}^{2}\)。在不考虑全局损失的情况下,我们考虑\(\|\boldsymbol{\beta}\|_{0}=\boldsymbol{s}\)。给定一个初始集合\(\mathcal{A} \subset \mathcal{S}=\{1,2, \ldots, p\}\),且\(|\mathcal{A}|=s\),记\(\mathcal{I}=\mathcal{A}^{c}\)并且计算\[
\hat{\boldsymbol{\beta}}=\arg \min _{\boldsymbol{\beta}_{\mathcal{I}}=0} \mathcal{L}_{n}(\boldsymbol{\beta})
\] 将\(\mathcal{A}\)和\(\mathcal{I}\)定义为激活集合和非激活集合,这里的激活就是在真实集合中的意思。在给定了\(\mathcal{A}\)和\(\hat{\beta}\)之后,可以定义两种损失如下:
后退损失\[ \xi_{j}=\mathcal{L}_{n}\left(\hat{\boldsymbol{\beta}}^{\mathcal{A} \backslash\{j\}}\right)-\mathcal{L}_{n}\left(\hat{\boldsymbol{\beta}}^{\mathcal{A}}\right)=\frac{\boldsymbol{X}_{j}^{\top} \boldsymbol{X}_{j}}{2 n}\left(\hat{\beta}_{j}\right)^{2} \]
前进损失:\[ \zeta_{j}=\mathcal{L}_{n}\left(\hat{\boldsymbol{\beta}}^{\mathcal{A}}\right)-\mathcal{L}_{n}\left(\hat{\boldsymbol{\beta}}^{\mathcal{A}}+\hat{\boldsymbol{t}}^{\{j\}}\right)=\frac{\boldsymbol{X}_{j}^{\top} \boldsymbol{X}_{j}}{2 n}\left(\frac{\hat{d}_{j}}{\boldsymbol{X}_{j}^{\top} \boldsymbol{X}_{j} / n}\right)^{2} \] 其中:\(\hat{t}=\arg \min _{t} \mathcal{L}_{n}\left(\hat{\boldsymbol{\beta}}^{\mathcal{A}}+\boldsymbol{t}^{\{j\}}\right), \hat{d}_{j}=\boldsymbol{X}_{j}^{\top}(\boldsymbol{y}-\boldsymbol{X} \hat{\boldsymbol{\beta}}) / n\)
直观的而言,对\(j \in \mathcal{A}\)一个大的\(\xi_{j}\)说明这个变量是潜在重要的。但是由于子集大小不同,这两个损失是无法比较的。然而如果将\(\mathcal{A}\)中的一些不是那么相关的变量和\(\mathcal{I}\)中一些重要的变量交换,这也许能够获得比较好的结果,这就是剪接法的思想所在。 特别的,对任何给定的\(k <= s\)定义如下: \[ \mathcal{A}_{k}=\left\{j \in \mathcal{A}: \sum_{i \in \mathcal{A}} \mathrm{I}\left(\xi_{j} \geq \xi_{i}\right) \leq k\right\} \] \[ \mathcal{I}_{k}=\left\{j \in \mathcal{I}: \sum_{i \in \mathcal{I}} \mathrm{I}\left(\zeta_{j} \leq \zeta_{i}\right) \leq k\right\} \] 通过交换\(\mathcal{A}_{k}\)和\(\mathcal{I}_{k}\),从而实现对\(\mathcal{A}\)和\(\mathcal{I}\)的切片,得到了新的集合: \[ \tilde{\mathcal{A}}=\left(\mathcal{A} \backslash \mathcal{A}_{k}\right) \cup \mathcal{I}_{k} \] 记\(\tilde{\mathcal{I}}=\tilde{\mathcal{A}}^{c}, \tilde{\boldsymbol{\beta}}=\arg \min _{\boldsymbol{\beta}_{\tilde{\mathcal{I}}}=0} \mathcal{L}_{n}(\boldsymbol{\beta})\),并且\(\tau_{s}>0\)为阈值。如果\(\tau_{s} < \mathcal{L}_{n}(\hat{\boldsymbol{\beta}})-\mathcal{L}_{n}(\tilde{\boldsymbol{\beta}})\) 则说明\(\tilde{\mathcal{A}}\)是优于\(\mathcal{A}\)。通过这样的方法则可以更新集合\(\mathcal{A}\)指导损失函数不能够通过剪接方法来进行提升。还有的问题就是设定初始集。通常来说我们选定第一批\(s\)个特征,这些特征是与\(y\)关联程度最大的特征。设\(k_{max}\)为剪接最大尺寸,\(k_{max}<s\),接下来的算法演示了如何计算具体的方法。
# 这里使用之前构造好的x矩阵,将分类因子型变量转化为哑变量
library(BeSS)
bessmodel<-bess(x,log(train$SalePrice),family = "gaussian",K.max = 40,max.steps = 30,method = "gsection")
## 1-th iteration s.left:1 s.split:124 s.right:200
## 2-th iteration s.left:1 s.split:77 s.right:124
## 3-th iteration s.left:77 s.split:106 s.right:124
## 4-th iteration s.left:77 s.split:95 s.right:106
## 5-th iteration s.left:95 s.split:102 s.right:106
## 6-th iteration s.left:102 s.split:104 s.right:106
## 7-th iteration s.left:104 s.split:105 s.right:106
plot(bessmodel,type = "both",breaks = TRUE,K=10)
bessmodel
## Df MSE AIC BIC EBIC
## [1,] 1 0.052971658 -4287.477 -4282.191 -4271.222
## [2,] 200 0.009208594 -6443.922 -5386.684 -3192.765
## [3,] 124 0.009723193 -6516.532 -5861.044 -4500.815
## [4,] 77 0.010871005 -6447.618 -6040.581 -5195.922
## [5,] 106 0.009944842 -6519.624 -5959.288 -4796.511
## [6,] 95 0.010160562 -6510.293 -6008.104 -4965.993
## [7,] 102 0.010012268 -6517.758 -5978.567 -4859.668
## [8,] 104 0.009982011 -6518.177 -5968.413 -4827.575
## [9,] 105 0.009951605 -6520.631 -5965.581 -4813.774
输出使用bess方法建立的最优模型
bestmodel<-bessmodel$bestmodel
summary(bestmodel)
##
## Call:
## lm(formula = ys ~ xbest, weights = weights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.75141 -0.05143 0.00246 0.05453 0.75141
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.630e+00 5.931e-01 2.749 0.006061 **
## xbestMSSubClass -2.562e-04 1.032e-04 -2.482 0.013169 *
## xbestMSZoningFV 4.833e-01 3.969e-02 12.176 < 2e-16 ***
## xbestMSZoningRH 4.325e-01 4.437e-02 9.749 < 2e-16 ***
## xbestMSZoningRL 4.287e-01 3.681e-02 11.646 < 2e-16 ***
## xbestMSZoningRM 3.635e-01 3.676e-02 9.890 < 2e-16 ***
## xbestLotFrontage 5.194e-04 1.748e-04 2.971 0.003019 **
## xbestLotArea 2.768e-06 4.207e-07 6.579 6.73e-11 ***
## xbestLotShapeIR2 2.952e-02 1.778e-02 1.660 0.097089 .
## xbestLotConfigCulDSac 4.148e-02 1.237e-02 3.353 0.000822 ***
## xbestLandSlopeSev -2.262e-01 4.596e-02 -4.921 9.66e-07 ***
## xbestNeighborhoodBrkSide 4.672e-02 1.579e-02 2.960 0.003134 **
## xbestNeighborhoodCrawfor 1.276e-01 1.697e-02 7.518 1.00e-13 ***
## xbestNeighborhoodEdwards -5.667e-02 1.285e-02 -4.412 1.11e-05 ***
## xbestNeighborhoodMeadowV -1.291e-01 3.391e-02 -3.808 0.000146 ***
## xbestNeighborhoodMitchel -3.483e-02 1.626e-02 -2.142 0.032354 *
## xbestNeighborhoodNAmes -1.736e-02 9.775e-03 -1.776 0.075923 .
## xbestNeighborhoodNoRidge 3.049e-02 1.902e-02 1.603 0.109270
## xbestNeighborhoodNridgHt 7.814e-02 1.559e-02 5.012 6.10e-07 ***
## xbestNeighborhoodStoneBr 1.220e-01 2.347e-02 5.197 2.34e-07 ***
## xbestCondition1Norm 5.229e-02 9.153e-03 5.713 1.37e-08 ***
## xbestCondition1PosN 4.631e-02 2.784e-02 1.664 0.096435 .
## xbestCondition1RRAe -6.459e-02 3.342e-02 -1.932 0.053511 .
## xbestCondition2PosN -9.288e-01 8.233e-02 -11.282 < 2e-16 ***
## xbestCondition2RRAe -4.751e-01 1.662e-01 -2.858 0.004327 **
## xbestBldgTypeTwnhs -5.001e-02 1.974e-02 -2.533 0.011412 *
## xbestHouseStyle1Story -1.750e-02 1.125e-02 -1.555 0.120267
## xbestHouseStyle2.5Fin -9.306e-02 5.011e-02 -1.857 0.063540 .
## xbestHouseStyle2.5Unf 6.948e-02 3.587e-02 1.937 0.052923 .
## xbestHouseStyle2Story -1.909e-02 1.309e-02 -1.459 0.144796
## xbestOverallQual 4.507e-02 4.075e-03 11.061 < 2e-16 ***
## xbestOverallCond 3.851e-02 3.582e-03 10.752 < 2e-16 ***
## xbestYearBuilt 2.042e-03 2.419e-04 8.439 < 2e-16 ***
## xbestYearRemodAdd 8.796e-04 2.262e-04 3.889 0.000105 ***
## xbestRoofStyleShed 3.562e-01 1.253e-01 2.844 0.004528 **
## xbestRoofMatlCompShg 2.774e+00 1.273e-01 21.798 < 2e-16 ***
## xbestRoofMatlMembran 3.171e+00 1.752e-01 18.105 < 2e-16 ***
## xbestRoofMatlMetal 3.038e+00 1.736e-01 17.499 < 2e-16 ***
## xbestRoofMatlRoll 2.786e+00 1.649e-01 16.897 < 2e-16 ***
## xbestRoofMatlTar&Grv 2.823e+00 1.316e-01 21.454 < 2e-16 ***
## xbestRoofMatlWdShake 2.750e+00 1.375e-01 20.007 < 2e-16 ***
## xbestRoofMatlWdShngl 2.832e+00 1.335e-01 21.210 < 2e-16 ***
## xbestExterior1stBrkComm -2.223e-01 8.083e-02 -2.750 0.006043 **
## xbestExterior1stBrkFace 5.533e-02 1.759e-02 3.146 0.001690 **
## xbestExterior1stCBlock -1.221e-01 1.063e-01 -1.149 0.250871
## xbestExterior1stCemntBd -9.214e-02 6.328e-02 -1.456 0.145572
## xbestExterior1stHdBoard -2.106e-02 9.233e-03 -2.281 0.022695 *
## xbestExterior1stMetalSd 3.414e-02 3.482e-02 0.981 0.327009
## xbestExterior1stWd Sdng -5.200e-02 1.737e-02 -2.994 0.002805 **
## xbestExterior2ndCBlock NA NA NA NA
## xbestExterior2ndCmentBd 1.078e-01 6.387e-02 1.688 0.091632 .
## xbestExterior2ndMetalSd -2.332e-02 3.491e-02 -0.668 0.504290
## xbestExterior2ndWd Sdng 4.314e-02 1.705e-02 2.531 0.011486 *
## xbestMasVnrTypeBrkFace 3.599e-02 2.902e-02 1.240 0.215138
## xbestMasVnrTypeNone 2.806e-02 2.868e-02 0.979 0.327997
## xbestMasVnrTypeStone 5.264e-02 3.073e-02 1.713 0.086947 .
## xbestExterCondFa -1.019e-01 5.860e-02 -1.739 0.082273 .
## xbestExterCondGd -8.444e-02 5.523e-02 -1.529 0.126521
## xbestExterCondTA -6.340e-02 5.480e-02 -1.157 0.247490
## xbestFoundationPConc 3.250e-02 9.392e-03 3.460 0.000557 ***
## xbestFoundationStone 1.050e-01 4.479e-02 2.345 0.019168 *
## xbestFoundationWood -1.488e-01 6.331e-02 -2.351 0.018863 *
## xbestBsmtCondPo 1.864e-01 8.724e-02 2.137 0.032813 *
## xbestBsmtExposureGd 3.626e-02 1.185e-02 3.060 0.002259 **
## xbestBsmtExposureNo -1.372e-02 7.164e-03 -1.916 0.055630 .
## xbestBsmtFinType2BLQ -4.045e-02 2.050e-02 -1.973 0.048716 *
## xbestBsmtFinType2Unf 1.684e-02 9.977e-03 1.688 0.091634 .
## xbestBsmtUnfSF -6.426e-05 1.026e-05 -6.263 5.07e-10 ***
## xbestTotalBsmtSF 1.430e-04 1.544e-05 9.264 < 2e-16 ***
## xbestHeatingGasA 1.503e-01 3.650e-02 4.117 4.08e-05 ***
## xbestHeatingGasW 2.172e-01 4.375e-02 4.963 7.81e-07 ***
## xbestHeatingWall 2.131e-01 6.604e-02 3.227 0.001282 **
## xbestHeatingQCGd -1.962e-02 8.609e-03 -2.279 0.022830 *
## xbestHeatingQCTA -3.092e-02 8.057e-03 -3.838 0.000130 ***
## xbestCentralAirY 6.336e-02 1.525e-02 4.154 3.47e-05 ***
## xbestX1stFlrSF 2.705e-04 1.766e-05 15.314 < 2e-16 ***
## xbestX2ndFlrSF 2.713e-04 1.755e-05 15.463 < 2e-16 ***
## xbestLowQualFinSF 2.957e-04 7.443e-05 3.972 7.49e-05 ***
## xbestGrLivArea NA NA NA NA
## xbestBsmtFullBath 2.886e-02 7.709e-03 3.743 0.000189 ***
## xbestHalfBath 1.465e-02 7.951e-03 1.843 0.065608 .
## xbestKitchenAbvGr -3.935e-02 1.639e-02 -2.400 0.016524 *
## xbestKitchenQualFa -6.725e-02 2.499e-02 -2.691 0.007210 **
## xbestKitchenQualGd -6.974e-02 1.363e-02 -5.116 3.56e-07 ***
## xbestKitchenQualTA -6.501e-02 1.574e-02 -4.130 3.85e-05 ***
## xbestFunctionalMaj2 -3.014e-01 5.277e-02 -5.712 1.37e-08 ***
## xbestFunctionalMod -9.172e-02 3.227e-02 -2.843 0.004541 **
## xbestFunctionalSev -3.083e-01 1.166e-01 -2.644 0.008285 **
## xbestFunctionalTyp 4.074e-02 1.350e-02 3.018 0.002589 **
## xbestFireplaces 2.228e-02 5.557e-03 4.010 6.40e-05 ***
## xbestGarageTypeAttchd 4.771e-02 1.703e-02 2.801 0.005168 **
## xbestGarageTypeBuiltIn 2.901e-02 2.168e-02 1.338 0.181093
## xbestGarageTypeDetchd 3.832e-02 1.709e-02 2.242 0.025093 *
## xbestGarageCars 1.939e-02 9.408e-03 2.061 0.039488 *
## xbestGarageArea 1.187e-04 3.048e-05 3.894 0.000103 ***
## xbestGarageQualFa -5.842e-02 2.303e-02 -2.536 0.011315 *
## xbestGarageQualTA -1.868e-02 1.808e-02 -1.034 0.301542
## xbestWoodDeckSF 1.003e-04 2.506e-05 4.003 6.59e-05 ***
## xbestEnclosedPorch 1.159e-04 5.240e-05 2.212 0.027141 *
## xbestX3SsnPorch 1.381e-04 9.669e-05 1.428 0.153542
## xbestScreenPorch 2.527e-04 5.262e-05 4.802 1.75e-06 ***
## xbestSaleTypeConLD 1.290e-01 3.728e-02 3.460 0.000558 ***
## xbestSaleTypeCWD 7.657e-02 5.373e-02 1.425 0.154357
## xbestSaleTypeNew 1.091e-01 1.521e-02 7.176 1.18e-12 ***
## xbestSaleConditionAdjLand 8.365e-02 5.605e-02 1.493 0.135800
## xbestSaleConditionNormal 5.456e-02 1.016e-02 5.370 9.25e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1035 on 1356 degrees of freedom
## Multiple R-squared: 0.9376, Adjusted R-squared: 0.9328
## F-statistic: 197.8 on 103 and 1356 DF, p-value: < 2.2e-16
mse<-(sum((bestmodel$residuals)^2))/1460
mse
## [1] 0.009951527
各种方法对比
jieguo<-read.csv("jieguo.csv",header = T)
library(knitr)
knitr::kable(jieguo)
锘縨ethod | MSE |
---|---|
BASE | 0.1716800 |
stepwise | 0.1319700 |
LASSO | 0.1317300 |
random forest | 0.1418900 |
gbdt | 0.1313700 |
BeSS | 0.0099515 |
使用bess方法建立的模型最终的均方误差只有\(0.009951527\),效果很好,但是也许建模过程中对因子化处理有些问题。通过对比得出相对而言bess
方法是最优的。