Learning note from datacamp; tree-based model

เป็นโมเดลที่เราสามารถเห็นการแปลผล (interpretability)

Tree-based learning model เรียงจาก easy -- > complicate

Update: 21-Mar-2021

การทำงานกับ categorical data เราต้องเปลี่ยนให้มันเป็น factor เพื่อที่จะนำค่าไปคำนวนทางสถิติได้

ทดลองกับข้อมูล imbalanced data, pubmed fingerprint

print(model)

Call:

randomForest(formula = factor(Name) ~ ., data = df3)

Type of random forest: classification

Number of trees: 500

No. of variables tried at each split: 20 ## default value is square root of number of features

OOB estimate of error rate: 5.88% ## out of bag

Confusion matrix: ## this is based on out of bag samples

active inactive class.error

active 1558 23 0.01454775

inactive 85 172 0.33073930

ข้อมูลข้างบนหมายถึง

Machine learning - randomforest -- > ทำการ convert name ซึ่งเป็น character ให้เป็น factor ก่อน

เพื่อแปลงให้เป็น categorical data (based on uniqueness ของตัวข้อมูล เช่น active นับ 1 inactive นับ 2 activesนับ 3 เป็นต้น -- ดังนั้นอย่าพิมพ์ผิด)

Number of trees -- หรือ mtree หมายถึงจำนวนของต้นไม้ที่มาใช้ต่อ ๆ กัน
No. of variables tried at each split (mtry)-- ถ้าเป็น pubmed fingerprint จะมีทั้งหมด 881
feature ซึ่งในแต่ละ split จะ random เลือกมา 20 ตัวในการตัดสินใจ

Random Records