最近在网上尝试练习使用了kaggle中的高级房价预测,这个问题本质上回归问题,使用机器学习的方法和技巧可以较好的解决这一类问题,最终本文使用BeSS这一方法来处理该问题,\(MSE\)近乎为0,效果不错,故和大家分享。这偏文章的前面受到该博客的启发,特此感谢。

探索性分析

首先读入数据housing。将数据导入到data变量中,观察data变量的结构。

str(data)
## 'data.frame':    2919 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley        : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
##  $ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
##  $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Utilities    : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
##  $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
##  $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
##  $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ RoofMatl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
##  $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
##  $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
##  $ ExterCond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
##  $ BsmtQual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
##  $ BsmtCond     : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
##  $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
##  $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ HeatingQC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
##  $ GarageType   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
##  $ GarageCond   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence        : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
##  $ MiscFeature  : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

从上面可以观察到,变量主要分为数值型变量和因子型变量。

bianliang <- sapply(data,class)
table(bianliang)
## bianliang
##  factor integer 
##      43      38

数据中有43个因子型变量和38个整数型变量。

变量处理

通过对数据观察发现,存在许多缺失值,首先处理数据的缺失值。

queshizhi<-sapply(data,function(x) sum(is.na(x)))

# 按照缺失的大小进行排名
queshi <-sort(queshizhi,decreasing = T)
queshi[queshi>0]
##       PoolQC  MiscFeature        Alley        Fence    SalePrice  FireplaceQu 
##         2909         2814         2721         2348         1459         1420 
##  LotFrontage  GarageYrBlt GarageFinish   GarageQual   GarageCond   GarageType 
##          486          159          159          159          159          157 
##     BsmtCond BsmtExposure     BsmtQual BsmtFinType2 BsmtFinType1   MasVnrType 
##           82           82           81           80           79           24 
##   MasVnrArea     MSZoning    Utilities BsmtFullBath BsmtHalfBath   Functional 
##           23            4            2            2            2            2 
##  Exterior1st  Exterior2nd   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF  TotalBsmtSF 
##            1            1            1            1            1            1 
##   Electrical  KitchenQual   GarageCars   GarageArea     SaleType 
##            1            1            1            1            1

我们进一步看这些缺失的变量的情况。

summary(data[,names(queshi)[queshi>0]])
##   PoolQC     MiscFeature  Alley        Fence        SalePrice      FireplaceQu
##  Ex  :   4   Gar2:   5   Grvl: 120   GdPrv: 118   Min.   : 34900   Ex  :  43  
##  Fa  :   2   Othr:   4   Pave:  78   GdWo : 112   1st Qu.:129975   Fa  :  74  
##  Gd  :   4   Shed:  95   NA's:2721   MnPrv: 329   Median :163000   Gd  : 744  
##  NA's:2909   TenC:   1               MnWw :  12   Mean   :180921   Po  :  46  
##              NA's:2814               NA's :2348   3rd Qu.:214000   TA  : 592  
##                                                   Max.   :755000   NA's:1420  
##                                                   NA's   :1459                
##   LotFrontage      GarageYrBlt   GarageFinish GarageQual  GarageCond 
##  Min.   : 21.00   Min.   :1895   Fin : 719    Ex  :   3   Ex  :   3  
##  1st Qu.: 59.00   1st Qu.:1960   RFn : 811    Fa  : 124   Fa  :  74  
##  Median : 68.00   Median :1979   Unf :1230    Gd  :  24   Gd  :  15  
##  Mean   : 69.31   Mean   :1978   NA's: 159    Po  :   5   Po  :  14  
##  3rd Qu.: 80.00   3rd Qu.:2002                TA  :2604   TA  :2654  
##  Max.   :313.00   Max.   :2207                NA's: 159   NA's: 159  
##  NA's   :486      NA's   :159                                        
##    GarageType   BsmtCond    BsmtExposure BsmtQual    BsmtFinType2 BsmtFinType1
##  2Types :  23   Fa  : 104   Av  : 418    Ex  : 258   ALQ :  52    ALQ :429    
##  Attchd :1723   Gd  : 122   Gd  : 276    Fa  :  88   BLQ :  68    BLQ :269    
##  Basment:  36   Po  :   5   Mn  : 239    Gd  :1209   GLQ :  34    GLQ :849    
##  BuiltIn: 186   TA  :2606   No  :1904    TA  :1283   LwQ :  87    LwQ :154    
##  CarPort:  15   NA's:  82   NA's:  82    NA's:  81   Rec : 105    Rec :288    
##  Detchd : 779                                        Unf :2493    Unf :851    
##  NA's   : 157                                        NA's:  80    NA's: 79    
##    MasVnrType     MasVnrArea        MSZoning     Utilities     BsmtFullBath   
##  BrkCmn :  25   Min.   :   0.0   C (all):  25   AllPub:2916   Min.   :0.0000  
##  BrkFace: 879   1st Qu.:   0.0   FV     : 139   NoSeWa:   1   1st Qu.:0.0000  
##  None   :1742   Median :   0.0   RH     :  26   NA's  :   2   Median :0.0000  
##  Stone  : 249   Mean   : 102.2   RL     :2265                 Mean   :0.4299  
##  NA's   :  24   3rd Qu.: 164.0   RM     : 460                 3rd Qu.:1.0000  
##                 Max.   :1600.0   NA's   :   4                 Max.   :3.0000  
##                 NA's   :23                                    NA's   :2       
##   BsmtHalfBath       Functional    Exterior1st    Exterior2nd  
##  Min.   :0.00000   Typ    :2717   VinylSd:1025   VinylSd:1014  
##  1st Qu.:0.00000   Min2   :  70   MetalSd: 450   MetalSd: 447  
##  Median :0.00000   Min1   :  65   HdBoard: 442   HdBoard: 406  
##  Mean   :0.06136   Mod    :  35   Wd Sdng: 411   Wd Sdng: 391  
##  3rd Qu.:0.00000   Maj1   :  19   Plywood: 221   Plywood: 270  
##  Max.   :2.00000   (Other):  11   (Other): 369   (Other): 390  
##  NA's   :2         NA's   :   2   NA's   :   1   NA's   :   1  
##    BsmtFinSF1       BsmtFinSF2        BsmtUnfSF       TotalBsmtSF    
##  Min.   :   0.0   Min.   :   0.00   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.: 220.0   1st Qu.: 793.0  
##  Median : 368.5   Median :   0.00   Median : 467.0   Median : 989.5  
##  Mean   : 441.4   Mean   :  49.58   Mean   : 560.8   Mean   :1051.8  
##  3rd Qu.: 733.0   3rd Qu.:   0.00   3rd Qu.: 805.5   3rd Qu.:1302.0  
##  Max.   :5644.0   Max.   :1526.00   Max.   :2336.0   Max.   :6110.0  
##  NA's   :1        NA's   :1         NA's   :1        NA's   :1       
##  Electrical   KitchenQual   GarageCars      GarageArea        SaleType   
##  FuseA: 188   Ex  : 205   Min.   :0.000   Min.   :   0.0   WD     :2525  
##  FuseF:  50   Fa  :  70   1st Qu.:1.000   1st Qu.: 320.0   New    : 239  
##  FuseP:   8   Gd  :1151   Median :2.000   Median : 480.0   COD    :  87  
##  Mix  :   1   TA  :1492   Mean   :1.767   Mean   : 472.9   ConLD  :  26  
##  SBrkr:2671   NA's:   1   3rd Qu.:2.000   3rd Qu.: 576.0   CWD    :  12  
##  NA's :   1               Max.   :5.000   Max.   :1488.0   (Other):  29  
##                           NA's   :1       NA's   :1        NA's   :   1

对于缺失数据很多的PoolQC、MiscFeature、Alley、Fence、FireplaceQu这些变量,无法进行插值补充,我们直接去除这些变量。

# 去除如下变量
quchu <- names(data) %in% c("PoolQC","MiscFeature","Alley","Fence","FireplaceQu")
data <- data[!quchu]

去除了NA值较多的变量之后,对Garage系列的变量和Bsmt系列的变量通过网上搜索发现这是车库和地下室的相关数据,对这些NA值我们使用NONE来替代缺失值。

通过查询描述文件得知,GarrageYrBLt为车库建造年份,使用房子的建造年份代替。

data$GarageYrBlt[is.na(data$GarageYrBlt)] <- data$YearBuilt[is.na(data$GarageYrBlt)]

补全缺失值

对LotFrontage是房屋到街道的距离,用中位数来填充缺失。

data$LotFrontage[is.na(data$LotFrontage)] <- median(data$LotFrontage, na.rm = T)

MasVnrType是外墙的装饰材料,对售价的影响不大。用NONE补充

data[["MasVnrType"]][is.na(data[["MasVnrType"]])] = "None"

MasVnrArea是外墙装饰材料的面积,用数值0来填充。

data[["MasVnrArea"]][is.na(data[["MasVnrArea"]])] <- 0

Utilities 没有分析的意义,直接去除

data$Utilities <- NULL

变量 BsmtFullBath BsmtHalfBath BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF GarageCars GarageArea 是和车库和地下室相关的变量,是数值型变量,补充为0就可以。

drop <- c("BsmtFullBath","BsmtHalfBath","BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF","GarageCars","GarageArea")
for (x in drop )    data[[x]][is.na(data[[x]])] <- 0

MSZoning,Functional,Exterior1st,Exterior2nd,KitchenQual,Electrical,SaleType,这些是因子型变量,并且缺失值很少,用最高频率的因子来替代即可。

tidai <- c("MSZoning","Functional","Exterior1st","Exterior2nd","KitchenQual","Electrical","SaleType")
for (x in tidai )    data[[x]][is.na(data[[x]])] <- levels(data[[x]])[which.max(table(data[[x]]))]
# 通过SalePrice是否为空来区分训练集和测试集
train <- data[!is.na(data$SalePrice), ]
test <- data[is.na(data$SalePrice), ]

建立模型

数据中自变量很多,根据常识我们要从自变量中选出对房价影响最大的因素。我们可以首先人工筛选出一些对房价影响大的因素,然后再添加新的变量来看是否对模型会有改善。 在国内的话,房价的主要影响因素有房子面积、房子所在的区域、小区等,房龄、房型(小高层、多层、别墅等)、特殊场景(地铁房、学区房等)、装修等也会影响价格。对美国房价来说其实也差不多。因此我们选择如下的变量作为观测。

  • LotArea 房子的面积
  • Neighborhood 城市街区 用来初步代替 区域、小区
  • Condition1 Condition2 附近的交通情况
  • BldgType 房屋类型 独栋别墅、联排别墅
  • HouseStyle 房子的层数
  • YearBuilt 房子建造的年份
  • YearRemodAdd: 房子的改造年份
  • OverallQual: 房子整体质量,考量材料和完成度
  • OverallCond:房子整体条件

观察变量关系

写出函数来观察因子型变量和数值型变量的分布图。

# 加载库
library(ggplot2)
library(Rmisc)
## 载入需要的程辑包:lattice
## 载入需要的程辑包:plyr
# 将对于因子变量画图
plot2_factor <- function(var_name){
    plots <- list()
    plots[[1]] <- ggplot(train, aes_string(x = var_name, fill = var_name) ) + 
        geom_bar() +
        guides(fill = FALSE) +
            ggtitle(paste("count of ", var_name)) +
            theme(axis.text.x = element_text(angle = 90, hjust =1))

    plots[[2]] <- ggplot(train, aes_string(x = var_name, y = "SalePrice", fill = var_name) ) +
        geom_boxplot() +
        guides(fill = FALSE) +
        ggtitle(paste( var_name, " vs SalePrice")) +
        theme(axis.text.x = element_text(angle = 90, hjust =1))

    multiplot(plotlist = plots, cols = 2)   
}

# 对于连续数字变量画图
plot2_number <- function(var_name){
    plots <- list()
    plots[[1]] <- ggplot(train, aes_string(x = var_name) ) + 
        geom_histogram() +
        ggtitle(paste("count of ", var_name))

    plots[[2]] <- ggplot(train, aes_string(x = var_name, y = "SalePrice") ) +
        geom_point() +
        ggtitle(paste( var_name, " vs SalePrice"))

    multiplot(plotlist = plots, cols = 2)   
}

首先观察街区和房间的关系图。

plot2_factor("Neighborhood")

通过上图可以看出不同的社区,房价差异很大,因此这个变量应该是影响比较大的。

plot2_number("YearBuilt")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

通过对建筑年限这一数值型变量进行绘图研究看出,售价和建筑年限也有强烈的线性关系,说明该变量是有意义的。

plot2_number("OverallQual")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

从上图看出装修越好的房子价格越高。

各个变量之间的相关性

library(corrgram)
## 
## 载入程辑包:'corrgram'
## The following object is masked from 'package:plyr':
## 
##     baseball
## The following object is masked from 'package:lattice':
## 
##     panel.fill
sel <- c("LotArea","Neighborhood","BldgType","HouseStyle","YearBuilt","YearRemodAdd","OverallQual","OverallCond","MSZoning")

corrgram(train[,sel], order=TRUE, lower.panel=panel.shade, upper.panel=panel.pie, text.panel=panel.txt)

训练线性模型

tezheng <- SalePrice ~ LotArea + Neighborhood + BldgType + HouseStyle + YearBuilt + YearRemodAdd + OverallQual + OverallCond

# 训练模型
lm1 <- lm(tezheng, train)

# 查看模型概要
summary(lm1)
## 
## Call:
## lm(formula = tezheng, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -208970  -20882   -2917   15544  351199 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.455e+06  1.850e+05  -7.862 7.42e-15 ***
## LotArea              1.084e+00  1.156e-01   9.375  < 2e-16 ***
## NeighborhoodBlueste -1.068e+03  2.953e+04  -0.036 0.971141    
## NeighborhoodBrDale  -1.440e+04  1.518e+04  -0.949 0.342806    
## NeighborhoodBrkSide -1.876e+04  1.278e+04  -1.468 0.142460    
## NeighborhoodClearCr -2.352e+03  1.332e+04  -0.177 0.859842    
## NeighborhoodCollgCr -2.917e+04  1.086e+04  -2.685 0.007335 ** 
## NeighborhoodCrawfor  1.747e+04  1.246e+04   1.402 0.161225    
## NeighborhoodEdwards -2.813e+04  1.165e+04  -2.414 0.015924 *  
## NeighborhoodGilbert -4.030e+04  1.157e+04  -3.484 0.000508 ***
## NeighborhoodIDOTRR  -3.357e+04  1.343e+04  -2.499 0.012570 *  
## NeighborhoodMeadowV  1.338e+04  1.446e+04   0.925 0.354867    
## NeighborhoodMitchel -2.819e+04  1.196e+04  -2.356 0.018617 *  
## NeighborhoodNAmes   -2.202e+04  1.130e+04  -1.950 0.051426 .  
## NeighborhoodNoRidge  6.105e+04  1.226e+04   4.980 7.13e-07 ***
## NeighborhoodNPkVill  6.340e+03  1.650e+04   0.384 0.700928    
## NeighborhoodNridgHt  4.876e+04  1.104e+04   4.417 1.08e-05 ***
## NeighborhoodNWAmes  -2.126e+04  1.166e+04  -1.823 0.068457 .  
## NeighborhoodOldTown -2.915e+04  1.243e+04  -2.344 0.019194 *  
## NeighborhoodSawyer  -2.575e+04  1.188e+04  -2.168 0.030350 *  
## NeighborhoodSawyerW -2.224e+04  1.154e+04  -1.927 0.054234 .  
## NeighborhoodSomerst -1.228e+04  1.093e+04  -1.123 0.261764    
## NeighborhoodStoneBr  5.984e+04  1.249e+04   4.790 1.84e-06 ***
## NeighborhoodSWISU   -2.365e+04  1.433e+04  -1.651 0.099024 .  
## NeighborhoodTimber  -1.326e+04  1.236e+04  -1.073 0.283489    
## NeighborhoodVeenker  2.303e+04  1.555e+04   1.481 0.138905    
## BldgType2fmCon       1.230e+03  7.413e+03   0.166 0.868218    
## BldgTypeDuplex      -7.231e+02  5.831e+03  -0.124 0.901330    
## BldgTypeTwnhs       -6.675e+04  7.811e+03  -8.546  < 2e-16 ***
## BldgTypeTwnhsE      -4.916e+04  4.892e+03 -10.049  < 2e-16 ***
## HouseStyle1.5Unf    -2.835e+04  1.102e+04  -2.573 0.010184 *  
## HouseStyle1Story    -3.981e+03  3.977e+03  -1.001 0.316972    
## HouseStyle2.5Fin     5.328e+04  1.472e+04   3.619 0.000306 ***
## HouseStyle2.5Unf    -4.606e+03  1.250e+04  -0.368 0.712613    
## HouseStyle2Story     4.069e+03  4.205e+03   0.968 0.333393    
## HouseStyleSFoyer    -1.173e+04  7.791e+03  -1.505 0.132424    
## HouseStyleSLvl      -6.438e+03  6.197e+03  -1.039 0.299077    
## YearBuilt            4.285e+02  8.428e+01   5.084 4.20e-07 ***
## YearRemodAdd         3.114e+02  7.505e+01   4.149 3.53e-05 ***
## OverallQual          2.849e+04  1.187e+03  24.010  < 2e-16 ***
## OverallCond          1.613e+03  1.150e+03   1.402 0.161035    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38880 on 1419 degrees of freedom
## Multiple R-squared:  0.7671, Adjusted R-squared:  0.7605 
## F-statistic: 116.8 on 40 and 1419 DF,  p-value: < 2.2e-16

模型中部分特征没有显著,但是模型整体的F检验通过,说明该模型还是可以的。模型调整后的\(R^2=0.7605\)效果还不错。

变量选择

首先进行人工变量选择,去除不显著的变量。去掉OverallCond,重新进行拟合

# 初步决定的 lm.base 模型的变量
fm.base <- SalePrice ~ LotArea + Neighborhood + BldgType + HouseStyle + YearBuilt + YearRemodAdd + OverallQual

# 训练模型
lm.base <- lm(fm.base, train)
summary(lm.base)
## 
## Call:
## lm(formula = fm.base, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -209817  -20321   -2822   15210  352218 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.464e+06  1.850e+05  -7.914 4.98e-15 ***
## LotArea              1.081e+00  1.156e-01   9.352  < 2e-16 ***
## NeighborhoodBlueste  7.266e+02  2.951e+04   0.025 0.980358    
## NeighborhoodBrDale  -1.362e+04  1.517e+04  -0.898 0.369408    
## NeighborhoodBrkSide -1.798e+04  1.277e+04  -1.408 0.159440    
## NeighborhoodClearCr -1.805e+03  1.331e+04  -0.136 0.892207    
## NeighborhoodCollgCr -2.898e+04  1.087e+04  -2.667 0.007744 ** 
## NeighborhoodCrawfor  1.883e+04  1.243e+04   1.516 0.129852    
## NeighborhoodEdwards -2.782e+04  1.166e+04  -2.387 0.017109 *  
## NeighborhoodGilbert -4.028e+04  1.157e+04  -3.481 0.000515 ***
## NeighborhoodIDOTRR  -3.359e+04  1.344e+04  -2.499 0.012553 *  
## NeighborhoodMeadowV  1.425e+04  1.445e+04   0.986 0.324151    
## NeighborhoodMitchel -2.759e+04  1.196e+04  -2.307 0.021210 *  
## NeighborhoodNAmes   -2.091e+04  1.127e+04  -1.855 0.063807 .  
## NeighborhoodNoRidge  6.106e+04  1.226e+04   4.980 7.14e-07 ***
## NeighborhoodNPkVill  7.526e+03  1.649e+04   0.456 0.648117    
## NeighborhoodNridgHt  4.836e+04  1.104e+04   4.380 1.27e-05 ***
## NeighborhoodNWAmes  -1.993e+04  1.163e+04  -1.714 0.086670 .  
## NeighborhoodOldTown -2.849e+04  1.243e+04  -2.292 0.022025 *  
## NeighborhoodSawyer  -2.475e+04  1.186e+04  -2.086 0.037120 *  
## NeighborhoodSawyerW -2.201e+04  1.155e+04  -1.907 0.056764 .  
## NeighborhoodSomerst -1.248e+04  1.094e+04  -1.141 0.254229    
## NeighborhoodStoneBr  5.963e+04  1.250e+04   4.772 2.01e-06 ***
## NeighborhoodSWISU   -2.346e+04  1.433e+04  -1.637 0.101780    
## NeighborhoodTimber  -1.325e+04  1.236e+04  -1.072 0.283915    
## NeighborhoodVeenker  2.475e+04  1.551e+04   1.596 0.110733    
## BldgType2fmCon       7.637e+02  7.408e+03   0.103 0.917900    
## BldgTypeDuplex      -1.566e+03  5.802e+03  -0.270 0.787285    
## BldgTypeTwnhs       -6.658e+04  7.813e+03  -8.522  < 2e-16 ***
## BldgTypeTwnhsE      -4.951e+04  4.888e+03 -10.129  < 2e-16 ***
## HouseStyle1.5Unf    -2.800e+04  1.102e+04  -2.541 0.011173 *  
## HouseStyle1Story    -4.114e+03  3.977e+03  -1.035 0.301043    
## HouseStyle2.5Fin     5.310e+04  1.473e+04   3.606 0.000322 ***
## HouseStyle2.5Unf    -5.807e+03  1.248e+04  -0.465 0.641658    
## HouseStyle2Story     3.910e+03  4.205e+03   0.930 0.352516    
## HouseStyleSFoyer    -1.110e+04  7.781e+03  -1.427 0.153881    
## HouseStyleSLvl      -6.096e+03  6.195e+03  -0.984 0.325288    
## YearBuilt            3.934e+02  8.052e+01   4.886 1.15e-06 ***
## YearRemodAdd         3.548e+02  6.840e+01   5.186 2.46e-07 ***
## OverallQual          2.863e+04  1.183e+03  24.206  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38890 on 1420 degrees of freedom
## Multiple R-squared:  0.7668, Adjusted R-squared:  0.7603 
## F-statistic: 119.7 on 39 and 1420 DF,  p-value: < 2.2e-16

发现模型的效果没有显著提升。

变量选择方法

传统的变量选择方法有很多,例如LASSORidge等方法,这里我们使用Lasso,随机森林和梯度下降法来进行变量选择,提升模型性能。

# 安装
library(glmnet)
## 载入需要的程辑包:Matrix
## Loaded glmnet 4.1-1
# 准备数据
formula <- as.formula( log(SalePrice)~ .-Id )

# model.matrix 会自动将分类变量变成哑变量
x <- model.matrix(formula, train)
y <- log(train$SalePrice)

#执行 lasso 
set.seed(999)
lm.lasso <- cv.glmnet(x, y, alpha=1)

# 画图
plot(lm.lasso)

# 得到各变量的系数
coef(lm.lasso, s = "lambda.min")
## 242 x 1 sparse Matrix of class "dgCMatrix"
##                                  1
## (Intercept)           7.314346e+00
## (Intercept)           .           
## MSSubClass           -3.089056e-04
## MSZoningFV            1.793795e-02
## MSZoningRH            .           
## MSZoningRL            4.132766e-02
## MSZoningRM            .           
## LotFrontage           .           
## LotArea               1.272041e-06
## StreetPave            4.921358e-02
## LotShapeIR2           1.071380e-02
## LotShapeIR3          -9.683435e-02
## LotShapeReg           .           
## LandContourHLS        .           
## LandContourLow        .           
## LandContourLvl        .           
## LotConfigCulDSac      3.543690e-02
## LotConfigFR2          .           
## LotConfigFR3          .           
## LotConfigInside       .           
## LandSlopeMod          .           
## LandSlopeSev          .           
## NeighborhoodBlueste   .           
## NeighborhoodBrDale   -1.493220e-02
## NeighborhoodBrkSide   .           
## NeighborhoodClearCr   3.747801e-02
## NeighborhoodCollgCr   .           
## NeighborhoodCrawfor   1.115702e-01
## NeighborhoodEdwards  -4.439143e-02
## NeighborhoodGilbert   .           
## NeighborhoodIDOTRR   -6.746018e-02
## NeighborhoodMeadowV  -7.649854e-02
## NeighborhoodMitchel   .           
## NeighborhoodNAmes     .           
## NeighborhoodNoRidge   5.440480e-02
## NeighborhoodNPkVill   .           
## NeighborhoodNridgHt   1.179915e-01
## NeighborhoodNWAmes    .           
## NeighborhoodOldTown  -2.833497e-02
## NeighborhoodSawyer    .           
## NeighborhoodSawyerW   .           
## NeighborhoodSomerst   6.349095e-02
## NeighborhoodStoneBr   1.204762e-01
## NeighborhoodSWISU     .           
## NeighborhoodTimber    2.260192e-03
## NeighborhoodVeenker   2.878460e-02
## Condition1Feedr      -1.568548e-02
## Condition1Norm        3.532074e-02
## Condition1PosA        .           
## Condition1PosN        .           
## Condition1RRAe       -2.434992e-02
## Condition1RRAn        .           
## Condition1RRNe        .           
## Condition1RRNn        .           
## Condition2Feedr       .           
## Condition2Norm        .           
## Condition2PosA        .           
## Condition2PosN       -5.253862e-01
## Condition2RRAe        .           
## Condition2RRAn        .           
## Condition2RRNn        .           
## BldgType2fmCon        .           
## BldgTypeDuplex        .           
## BldgTypeTwnhs        -7.598136e-02
## BldgTypeTwnhsE       -1.805548e-02
## HouseStyle1.5Unf      .           
## HouseStyle1Story      .           
## HouseStyle2.5Fin      .           
## HouseStyle2.5Unf      .           
## HouseStyle2Story      .           
## HouseStyleSFoyer      .           
## HouseStyleSLvl        .           
## OverallQual           6.845347e-02
## OverallCond           3.219936e-02
## YearBuilt             1.274539e-03
## YearRemodAdd          8.437361e-04
## RoofStyleGable       -5.642072e-03
## RoofStyleGambrel      .           
## RoofStyleHip          .           
## RoofStyleMansard      .           
## RoofStyleShed         .           
## RoofMatlCompShg       8.744853e-03
## RoofMatlMembran       .           
## RoofMatlMetal         .           
## RoofMatlRoll          .           
## RoofMatlTar&Grv       .           
## RoofMatlWdShake       .           
## RoofMatlWdShngl       7.458154e-02
## Exterior1stAsphShn    .           
## Exterior1stBrkComm   -1.345774e-01
## Exterior1stBrkFace    4.770749e-02
## Exterior1stCBlock     .           
## Exterior1stCemntBd    .           
## Exterior1stHdBoard   -8.584669e-03
## Exterior1stImStucc    .           
## Exterior1stMetalSd    .           
## Exterior1stPlywood    .           
## Exterior1stStone      .           
## Exterior1stStucco     .           
## Exterior1stVinylSd    .           
## Exterior1stWd Sdng   -6.333826e-03
## Exterior1stWdShing    .           
## Exterior2ndAsphShn    .           
## Exterior2ndBrk Cmn    .           
## Exterior2ndBrkFace    .           
## Exterior2ndCBlock     .           
## Exterior2ndCmentBd    1.903051e-03
## Exterior2ndHdBoard    .           
## Exterior2ndImStucc    .           
## Exterior2ndMetalSd    .           
## Exterior2ndOther      .           
## Exterior2ndPlywood    .           
## Exterior2ndStone      .           
## Exterior2ndStucco    -4.652564e-02
## Exterior2ndVinylSd    .           
## Exterior2ndWd Sdng    .           
## Exterior2ndWd Shng   -1.387069e-02
## MasVnrTypeBrkFace     .           
## MasVnrTypeNone        .           
## MasVnrTypeStone       .           
## MasVnrArea            .           
## ExterQualFa          -1.044328e-02
## ExterQualGd           .           
## ExterQualTA          -6.848606e-03
## ExterCondFa          -9.876932e-03
## ExterCondGd           .           
## ExterCondPo           .           
## ExterCondTA           4.288596e-03
## FoundationCBlock      .           
## FoundationPConc       2.610683e-02
## FoundationSlab       -2.033377e-03
## FoundationStone       .           
## FoundationWood        .           
## BsmtQualFa            .           
## BsmtQualGd            .           
## BsmtQualTA           -2.813238e-04
## BsmtQualNone         -3.399457e-03
## BsmtCondGd            .           
## BsmtCondPo            .           
## BsmtCondTA            3.624419e-03
## BsmtCondNone          .           
## BsmtExposureGd        4.858853e-02
## BsmtExposureMn        .           
## BsmtExposureNo       -6.556593e-03
## BsmtExposureNone     -3.795778e-02
## BsmtFinType1BLQ       .           
## BsmtFinType1GLQ       9.606824e-03
## BsmtFinType1LwQ       .           
## BsmtFinType1Rec       .           
## BsmtFinType1Unf      -3.321896e-02
## BsmtFinType1None      .           
## BsmtFinSF1            .           
## BsmtFinType2BLQ      -7.694026e-03
## BsmtFinType2GLQ       .           
## BsmtFinType2LwQ       .           
## BsmtFinType2Rec       .           
## BsmtFinType2Unf       .           
## BsmtFinType2None     -2.374994e-02
## BsmtFinSF2            .           
## BsmtUnfSF             .           
## TotalBsmtSF           2.725149e-05
## HeatingGasA           .           
## HeatingGasW           6.541262e-02
## HeatingGrav          -1.079210e-01
## HeatingOthW           .           
## HeatingWall           .           
## HeatingQCFa           .           
## HeatingQCGd          -2.803107e-03
## HeatingQCPo           .           
## HeatingQCTA          -1.893740e-02
## CentralAirY           6.505505e-02
## ElectricalFuseF       .           
## ElectricalFuseP       .           
## ElectricalMix         .           
## ElectricalSBrkr       .           
## X1stFlrSF             2.739845e-05
## X2ndFlrSF             .           
## LowQualFinSF          .           
## GrLivArea             1.960863e-04
## BsmtFullBath          4.369629e-02
## BsmtHalfBath          .           
## FullBath              2.498523e-02
## HalfBath              1.102765e-02
## BedroomAbvGr          .           
## KitchenAbvGr         -2.694865e-02
## KitchenQualFa         .           
## KitchenQualGd         .           
## KitchenQualTA        -7.346020e-03
## TotRmsAbvGrd          8.445023e-03
## FunctionalMaj2       -1.661537e-01
## FunctionalMin1        .           
## FunctionalMin2        .           
## FunctionalMod         .           
## FunctionalSev        -1.691674e-01
## FunctionalTyp         3.674019e-02
## Fireplaces            2.925835e-02
## GarageTypeAttchd      1.017935e-02
## GarageTypeBasment    -2.805227e-04
## GarageTypeBuiltIn     .           
## GarageTypeCarPort     .           
## GarageTypeDetchd      .           
## GarageTypeNone       -5.483381e-04
## GarageYrBlt           .           
## GarageFinishRFn       .           
## GarageFinishUnf      -3.452388e-03
## GarageFinishNone     -6.132096e-05
## GarageCars            5.606530e-02
## GarageArea            3.428652e-05
## GarageQualFa         -9.605669e-03
## GarageQualGd          1.663474e-02
## GarageQualPo          .           
## GarageQualTA          .           
## GarageQualNone       -2.467408e-04
## GarageCondFa         -1.397724e-02
## GarageCondGd          .           
## GarageCondPo          .           
## GarageCondTA          .           
## GarageCondNone       -2.685390e-06
## PavedDriveP           .           
## PavedDriveY           1.160351e-02
## WoodDeckSF            9.033916e-05
## OpenPorchSF           .           
## EnclosedPorch         2.864247e-06
## X3SsnPorch            1.227765e-05
## ScreenPorch           2.102909e-04
## PoolArea             -6.262337e-05
## MiscVal               .           
## MoSold                .           
## YrSold               -4.603035e-04
## SaleTypeCon           .           
## SaleTypeConLD         .           
## SaleTypeConLI         .           
## SaleTypeConLw         .           
## SaleTypeCWD           .           
## SaleTypeNew           7.765443e-02
## SaleTypeOth           .           
## SaleTypeWD            .           
## SaleConditionAdjLand  .           
## SaleConditionAlloca   .           
## SaleConditionFamily   .           
## SaleConditionNormal   3.497182e-02
## SaleConditionPartial  .
#由于 SalePrice 为 NA 无法数组化
test$SalePrice <- 1
test_x <- model.matrix(formula, test)

# 预测、输出结果
lm.pred <- predict(lm.lasso, newx = test_x, s = "lambda.min")
res <- data.frame(Id = test$Id, SalePrice = exp(lm.pred))
write.csv(res, file = "res_lasso.csv", row.names = FALSE)

使用随机森林方法进行回归建模.

library(randomForest)
library(caret)

#设定种子
set.seed(223)

# 设定控制参数
# method = "cv" -- k 折交叉验证 
# number -- K 折交叉验证中的 K, number=10 则是 10 折交叉验证
# repeats -- 交叉验证的次数
# verboseIter -- 打印训练日志
ctrl <- trainControl(method = "cv", number = 10, repeats = 20, verboseIter = TRUE)

# 训练模型
lm.rf <- train(log(SalePrice)~ .-Id, data = train,  method = "rf",  trControl = ctrl,  tuneLength = 3)

# 输出结果 
#write_res(lm.rf, test, 'rf')

# 输出结果
lm.pred <- predict(lm.rf, test)
res <- data.frame(Id = test$Id, SalePrice = exp(lm.pred))
write.csv(res, file = "res_rf.csv", row.names = FALSE)

使用梯度下降法进行模型建立。

lm.gbm <- train(log(SalePrice)~ .-Id, data = train,  method = "gbm",  trControl = ctrl)

# 输出结果 
lm.pred <- predict(lm.gbm, test)
res <- data.frame(Id = test$Id, SalePrice = exp(lm.pred))
write.csv(res, file = "res_gbm.csv", row.names = FALSE)

最终我们得到了各个方法进行预测的残差,并且进行计算得到了各个方法的RES值,通过对比可以得出相对最好的方法,想要进一步提升模型的性能,可以使用bagging,stacking等模型融合的方法来进行改进。