Task2 Construct a classification tree
Split the data in to training(70%) and testing (30%) sets.
Alter the minbucket, cp, maxdepth parameters. Each of these parameters can be used to prevent overfitting. The "minbucket" parameter sets the smallest size for any leaf node in the tree. The "cp" paramter sets the minimum improvement needed in order to make a split. The "maxdepth" parameter sets the minimum number of levels for the splits.
Estimate the cp parameter using cross-validation. Dividing the training data up into folds, training the model on all but one of the folds, and measuring the performance on the remaining fold. This process is repeated to develop a distribution of performance values for a given cp value. The cp value that yields the best accuracy is then selected.
Task3 Construct a random forest
Generally, variables that are used to make splits more frequently and earlier in the trees in the random forest are determined to be more important.
Task4 Consider a boosted tree
Random forests use bagging to produce many independent trees. When a split is to be considered, only a randomly selected subset of variables is considered. each tree is built from a bootstrap sample from the data. This approach is designed to reduce overfitting, which is likely when a single tree is used.
Boosted trees are built sequentially. A first tree is built in the usual way. Then another tree is fit to the residuals (errors) and is added to the first tree. This allows the second tree to focus on observations on which the first tree performed poorly. Additional trees are then added with cross-validation used to decide when to stop (and prevent overfitting).
Both random forests and boosted trees are ensemble methods, which means that rather than creating a single tree, many trees are produced. Neither method is transparent, so they require variable importance measures to understand the extent to which input variable is used.
Random forests do not use the residuals to build successive trees, so there is less risk of overfitting as a result of building too many trees, while boosted trees will result in a lower bias. The best boosted trees learn slowly (use lots of trees) and thus can take longer than a random forest to train.
'SOA > PA' 카테고리의 다른 글
| SOA/ASA/PA 기출 및 내용정리 - 20.06시험(기록용) (0) | 2024.04.13 |
|---|---|
| SOA/ASA/PA 기출 및 내용정리 - 20.12.07시험(기록용) (0) | 2024.04.08 |
| SOA/ASA/PA 기출 및 내용정리 - 21.12.13시험(기록용) (0) | 2024.04.07 |
| SOA/ASA/PA 기출 및 내용정리 - 22.04.12시험 Task6~13(기록용) (0) | 2024.04.06 |
| SOA/ASA/PA 기출 및 내용정리 - 22.10.11시험 Task6~12(기록용) (0) | 2024.04.05 |