北大青鸟:Python机器学习介绍 - 图文 下载本文

dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs') for dataset in all_data:

dataset['Age_bin'] = pd.cut(dataset['Age'], bins=[0,14,20,40,120], labels=['Children','Teenage','Adult','Elder']) for dataset in all_data:

dataset['Fare_bin'] = pd.cut(dataset['Fare'], bins=[0,7.91,14.45,31,120], labels ['Low_fare','median_fare', 'Average_fare','high_fare']) traindf=train_df for dataset in traindf:

drop_column = ['Age','Fare','Name','Ticket'] dataset.drop(drop_column, axis=1, inplace = True) drop_column = ['PassengerId']

traindf.drop(drop_column, axis=1, inplace = True) traindf = pd.get_dummies(traindf, columns = [\prefix=[\

现在,你已经创建完成所有的特征了。接着我们看看这些特征之间的相关性:

sns.heatmap(traindf.corr(),annot=True,cmap='RdYlGn',linewidths=0.2) #data.corr()-->correlation matrix fig=plt.gcf()

fig.set_size_inches(20,12) plt.show()

相关值接近 1 意味着高度正相关,-1 意味着高度负相关。例如,性别为男和性别为女之间就呈负相关,因为必须将乘客识别为一种性别(或另一种)。此外,你还可以看到,除了用特征工程创建的内容外,没有哪两种是高度相关的。这证明我们做得对。

如果某些因素之间高度相关会怎么样?我们可以删除其中的一个,新列中的信息并不能给系统提供任何新信息,因为这两者是完全一样的。 用 Python 实现机器学习

现在我们已经到达本教程的高潮——机器学习建模。

from sklearn.model_selection import train_test_split #for split the data from sklearn.metrics import accuracy_score #for accuracy_score from sklearn.model_selection import KFold #for K-fold cross validation from sklearn.model_selection import cross_val_score #score evaluation from sklearn.model_selection import cross_val_predict #prediction from sklearn.metrics import confusion_matrix #for confusion matrix all_features = traindf.drop(\Targeted_feature = traindf[\X_train,X_test,y_train,y_test =

train_test_split(all_features,Targeted_feature,test_size=0.3,random_state=42)

X_train.shape,X_test.shape,y_train.shape,y_test.shape Scikit-Learn 库中有多种算法供你选择:

逻辑回归 随机森林 支持向量机 K 最近邻 朴素贝叶斯 决策树 AdaBoost LDA 梯度增强

你可能感到不知所措,想弄清什么是什么。别担心,只要将它当做「黑箱」对待就好——选一个表现最好的。(我之后会写一篇完整的文章讨论如何选择这些算法。)

以我最喜欢的随机森林算法为例:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(criterion='gini', n_estimators=700, min_samples_split=10,min_samples_leaf=1, max_features='auto',oob_score=True,

? ? ? ? ? ? ? ? ?

random_state=1,n_jobs=-1) model.fit(X_train,y_train)

prediction_rm=model.predict(X_test)

print('--------------The Accuracy of the model----------------------------') print('The accuracy of the Random Forest Classifier is', round(accuracy_score(prediction_rm,y_test)*100,2))

kfold = KFold(n_splits=10, random_state=22) # k=10, split the data into 10 equal parts

result_rm=cross_val_score(model,all_features,Targeted_feature,cv=10,scoring='accuracy')

print('The cross validated score for Random Forest Classifier is:',round(result_rm.mean()*100,2))

y_pred = cross_val_predict(model,all_features,Targeted_feature,cv=10) sns.heatmap(confusion_matrix(Targeted_feature,y_pred),annot=True,fmt='3.0f',cmap=\

plt.title('Confusion_matrix', y=1.05, size=15)