第9讲 二元结果模型 下载本文

education(受教育年限)。考虑以下模型:

worki=β0+β1agei+β2marriedi+β3childreni+β4educationi+εi 作为对照,首先使用OLS进行线性概率模型(LPM)估计: use womenwk1,clear (原数据是womenwk.dta) reg work age married children education probit work age married children education,nolog

mfx (计算probit模型在样本均值处的边际效应,与OLS估计的回归系数进行比较)

estat classification (计算预测准确的百分比) logit work age married children education,nolog mfx

estat classification

hetprob work age married children education,het(age married children education)nolog (p值为0.78,所以接受“同方差”的原假设。)

generate age2=age*age generateagemari=age*married generateagechr=age*children

quietlylogit work age married children education age2 agemariagechr test age2 agemariagechr (接受零假设)

quietlylogit work age married children education estimates store blogit

quietlyprobit work age married children education estimates store bprobit

quietlyregress work age married children education estimates store bols

quietlylogit work age married children education,vce(robust) estimates store blogitr

quietlyprobit work age married children education,vce(robust)

5

estimates store bprobitr

quietlyregress work age married children education,vce(robust) estimates store bolsr

estimatestable blogitblogitrbprobitbprobitrbolsbolsr, t b(%7.3f) stfmt(%8.2f)

例子:航天飞机数据

use shuttle,clear(美国航天飞机25次飞行数据,包括1986年挑战者号最后一次升空失败的飞行) describe

(distress:助推结点一处或多处受损;temp:助推结点的温度;date:从1960.1.1起的消逝天数)

generate date=mdy(month,day,year) tabulate distress tabulatedistress,nolabel generate any=distress

replace any=1 if distress==2 (建立虚拟变量any,0代表无损坏,1代表有1处或更多损坏)

logistic any date (logistic提供优势比,e^b。它的意义是,自变量每增加一个单位时,事件(y=1)的发生比的变化倍数(如有其他自变量,则以其他自变量保持不变为条件))

predict phat (取得预测概率)

label variable phat “Predicted P(distress>=1)” graphtwoway connected phat date

estat classification(默认应用0.5的概率作为分割点)。几种符号的含义: D:一个观测中所关注的事件确实发生(y=1)。在本例中,D表示结点损

6

坏发生了

~D:一个观测中所关注的事件没有发生(y=0)。在本例中,~D表示结点损坏没发生

+:模型预测概率值大于等于分割点。本例中,+表示模型预测的事故发生概率为0.5或更高

-:模型预测概率值小于分割点。 Pr(D|+)=12/16=75% (准确预测) Pr( -D|+)=4/16=25%

Pr(~D| -)=5/7=71.43%(准确预测) Pr( D| -) =2/7= 28.57%

logistic any date temp(加入助推结点温度temp)

根据拟合模型,结点温度每1度增量将使助推结点损坏发生比乘以0.84,也就是说温度每提高1度减少损坏发生比16%。卡方检验更有确定性。 estat classification(分类正确率提高到78.26%)

三、条件效应标绘图(条件效应标绘图有助于理解logistic模型在概率方面意味着什么)

quietly logit any date temp

generate L1=_b[_cons]+_b[date]*8569+_b[temp]*temp generate phat1=1/(1+exp(-L1))

(date的第25百分位数为8569;L1是预测的logit值;phat1为相应的distress>=1的预测概率)

label variable phat1 “P(distress>=1) | date=8569”

generate L2=_b[_cons]+_b[date]*9341+_b[temp]*temp generate phat2=1/(1+exp(-L2)) (date的第75百分位数为9341)

label variable phat2 “P(distress>=1) | date=9341”

7

graph twowaymspline phat1 temp,bands(50) || mspline phat2 temp,bands(50) ||,ytitle(“Probability of thermal distress”) legend(label(1 “June 1983”) label(2 “July 1985”))

(挑战者号的起飞温度为31,这将使它位于图的左侧顶部。这个分析预测出助推结点几乎是肯定要损坏的)

四、诊断统计与标绘图(不讲) quietly logistic any date temp predict phat3

label variable phat3 “Predicted probability” predict dx2,dx2

label variable dx2 “Change in Pearson chi-squared” predictdb,dbeta

label variable db “Influence” predictdd,ddeviance

label variable dd “Change in deviance”

graphtwoway scatter dx2 phat3,mlabel(flight)

(皮尔森卡方变化对损坏概率图形,图中凸显两个拟合很差的预测概率) graphtwoway scatter dx2 phat3 [aweight=db],msymbol(oh)

(标绘记号的大小与其影响成比例。权重图揭示出:两个拟合最差的观测同时就是最有影响的)

list flight any date temp dx2 phat3 if dx2>5

总结:拟合差而又有影响的观测最值得特别关注,因为它们既与数据的主要模式矛盾、又将模型估计拉向与其相反的方向。更深切的反应是研究这些特异值为什么不同寻常?寻求这种答案也许会导致研究人员发现以前忽视的变量或按其他方式来定义模型。

8

五、对序次多分类Y的logistic回归(不讲)

logit和logistic只能拟合含有两个类别{0,1}的Y变量模型。ologit:序次logistic回归,其中y是序次变量。较大的数值代表“较高”的类别。比如,{1=“差”,2=“中”,3=“好”}

use shuttle,clear(美国航天飞机25次飞行数据) ologit distress date temp predict none onetwothreeplus

list flight none onetwothreeplus if flight==25

9