dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs') for dataset in all_data:
dataset['Age_bin'] = pd.cut(dataset['Age'], bins=[0,14,20,40,120], labels=['Children','Teenage','Adult','Elder']) for dataset in all_data:
dataset['Fare_bin'] = pd.cut(dataset['Fare'], bins=[0,7.91,14.45,31,120], labels ['Low_fare','median_fare', 'Average_fare','high_fare']) traindf=train_df for dataset in traindf:
drop_column = ['Age','Fare','Name','Ticket'] dataset.drop(drop_column, axis=1, inplace = True) drop_column = ['PassengerId']
traindf.drop(drop_column, axis=1, inplace = True) traindf = pd.get_dummies(traindf, columns = [\prefix=[\
现在,你已经创建完成所有的特征了。接着我们看看这些特征之间的相关性:
sns.heatmap(traindf.corr(),annot=True,cmap='RdYlGn',linewidths=0.2) #data.corr()-->correlation matrix fig=plt.gcf()
fig.set_size_inches(20,12) plt.show()
相关值接近 1 意味着高度正相关,-1 意味着高度负相关。例如,性别为男和性别为女之间就呈负相关,因为必须将乘客识别为一种性别(或另一种)。此外,你还可以看到,除了用特征工程创建的内容外,没有哪两种是高度相关的。这证明我们做得对。
如果某些因素之间高度相关会怎么样?我们可以删除其中的一个,新列中的信息并不能给系统提供任何新信息,因为这两者是完全一样的。 用 Python 实现机器学习
现在我们已经到达本教程的高潮——机器学习建模。
from sklearn.model_selection import train_test_split #for split the data from sklearn.metrics import accuracy_score #for accuracy_score from sklearn.model_selection import KFold #for K-fold cross validation from sklearn.model_selection import cross_val_score #score evaluation from sklearn.model_selection import cross_val_predict #prediction from sklearn.metrics import confusion_matrix #for confusion matrix all_features = traindf.drop(\Targeted_feature = traindf[\X_train,X_test,y_train,y_test =
train_test_split(all_features,Targeted_feature,test_size=0.3,random_state=42)
X_train.shape,X_test.shape,y_train.shape,y_test.shape Sc