Discovering Data
  • Home
  • Blog
  • become_a_data_scientist
  • Code-examples

#100DaysOfDataScience

Day 2 - comparing algorithms, Saturday July 7, 2018

7/7/2018

0 Comments

 
The code from day one can also be used with the Random Forest algorithm:

clf_rf = ske.RandomForestClassifier(n_estimators=50)
clf_rf = clf_rf.fit(X_train, y_train)
print(clf_rf.score(X_test,y_test))

This produces a score of around 55%

Another algorithm is Gradient Boost:

clf_gb = ske.GradientBoostingClassifier(n_estimators=50)
clf_gb = clf_gb.fit(X_train, y_train)
print(clf_gb.score(X_test,y_test))

which generates an accuracy of aroung 59%.

The Random Forest is/was popular because it is easy to use and can be used for both regression and classification problems. The problem described on day 1 was a classification problem. Regression problems involve predicting a numeric value based on past patterns. A forest in this case is an ensemble of decision trees; decision trees are notorious for overfitting, generating many trees then averaging over them can reduce the overfitting problem. 
Hyper parameters are parameters that can be adjusted to improve accuracy. For example with the Random Forest you can increase the n_estimators parameter which controls the number of trees built. K-nearest neighbours did not perform well in this problem (average accuracy = 49%). KNN does not work well with categorical data for example gender. It also performs less well as the number of data features increases. 
0 Comments



Leave a Reply.

Proudly powered by Weebly
  • Home
  • Blog
  • become_a_data_scientist
  • Code-examples