Task2 Construct a classification tree

 Split the data in to training(70%) and testing (30%) sets.

 Alter the minbucket, cp, maxdepth parameters. Each of these parameters can be used to prevent overfitting. The "minbucket" parameter sets the smallest size for any leaf node in the tree. The "cp" paramter sets the minimum improvement needed in order to make a split. The "maxdepth" parameter sets the minimum number of levels for the splits.

 Estimate the cp parameter using cross-validation. Dividing the training data up into folds, training the model on all but one of the folds, and measuring the performance on the remaining fold. This process is repeated to develop a distribution of performance values for a given cp value. The cp value that yields the best accuracy is then selected.

 

 

 

Task3 Construct a random forest

 Generally, variables that are used to make splits more frequently and earlier in the trees in the random forest are determined to be more important.

 

 

 

Task4  Consider a boosted tree

  Random forests use bagging to produce many independent trees. When a split is to be considered, only a randomly selected subset of variables is considered. each tree is built from a bootstrap sample from the data. This approach is designed to reduce overfitting, which is likely when a single tree is used.

 Boosted trees are built sequentially. A first tree is built in the usual way. Then another tree is fit to the residuals (errors) and is added to the first tree. This allows the second tree to focus on observations on which the first tree performed poorly. Additional trees are then added with cross-validation used to decide when to stop (and prevent overfitting).

 Both random forests and boosted trees are ensemble methods, which means that rather than creating a single tree, many trees are produced. Neither method is transparent, so they require variable importance measures to understand the extent to which input variable is used.

 Random forests do not use the residuals to build successive trees, so there is less risk of overfitting as a result of building too many trees, while boosted trees will result in a lower bias. The best boosted trees learn slowly (use lots of trees) and thus can take longer than a random forest to train.

 

 

Task6

(a) GLM, each of the Gaussian, Poisson and Gamma distribution.

 Gaussian

 - The domain is all real values.

 - A target variable that could be modeled with the Gaussian distribution is the response time.

 Poisson

 - The domain is non-negative integers.

 - A target variable that could be modeled with the Poisson distribution is the number of calls in an hour.

 Gamma distribution

 - The domain is positive real values.

 - A target variable that could be modeled with the Gamma distribution is the turnout time plus one second since it is continuous and postive.

 

Task9

(a) Explain how measures of impurity are related to information gain in a decision tree

 Information gain is the decrease in impurity created by a decision tree split. At each node of the decision tree, the model selects the split that results in the most information gain. Therefore, the choice of impurity measure(e.g, Gini or entropy) directly impacts information gain calculations.

 

 

(c) When concerned about the model variance, Recommend whether to use a random forest or a gradient boosting machine.

 I recommend a random forest, which tends to do well in reducing variance while having a similar bias to that of a basic tree model. The variance reduction arises from the use of many small trees and sampling of the data (bagging). Both practices hinder overfitting to the idiosyncrsies of the training data, and hence keep the variance low.

 Gradient boosting machines use the same underlying training data at each step. This is very effective at reducing bias but is very sensitive to the training data(high variance)

 

Task10

(a) explain why decision trees overfit to factor variables with many levels.

In this task, the number of levels is 31. This means the number of ways to split day of the month in to two groups is very large, making it likely that the tree will find spurious splits that happens to produce information gain for that particular training data.

 Decision trees tend to create splits on categorical variables with many levels because it is easier to choose a split where the inofrmation gain is large. However, splitting on these variables will likely to lead to overfitting.

 

(b) Describe the handling of categorical variables in linear models and tree-based models.

 Linear Models : 

 The coefficients for each level represents the impact relative to the base level of the variable.

 Tree-Based Models : 

 The more levels the variable has, the more potential ways to split the category into groups. split based on maximizing information gain. Decision trees naturally allow for interactions between categorical variables based on how the tree is created.

 

 

Task11

(c) Explain what let to the large difference in AUC values between the train and test datasets.

 An AUC of 1.0 indicates the model perfectly predicts the data while an AUC of 0.5 indicates the model performs as well as random chance.

 This indecates that the models are overfit to noise in the training datasets. The likely cause is that the chosen hyperparameters create very deep trees, and therefore the hyperparameters need to be adjusted to create simpler trees which are less prone to overfitting.

 

 

Task13

(a) Explain how cost-complexity pruning works, including how complexity is optimized.

 Cost-complexity pruning involves growing a large tree and then pruning it back by dropping splits that do not reduce the model error by a fixed value determined by the complexity parameter. We can use cross validation to optimize the complexity parameter, which is the process repeatedly training models and testing models on different folds of the data. This is done for different values for the complexity parameter, and the one with the lowest cross validation error is selected as the optimal choice. We then prune back our trained tree using the complexity parameter from the cross validation.

 

(b) State why changing the random seed would affect the tree constructed using cost-complexity pruning.

 Each time the cost-complexity pruning algorithm is run, the splits of data used in the cross validation are randomly assigned.

 

 

Task8

(a) Compare and contrast stepwise selection with shrinkage methods.

Similarities

 - both avoid overfitting to the data, especially when the number of observations is small compared to the number of predictors.

 - both can be used for variable selection to reduce model complexity.

Differences

 - Stepwise selection takes iterative steps, until there is no improvements as measured by AIC.

 - Shrinkage methods can reduce the size of coefficients without entirely eliminating variables.

 

(b) Explain why variables are standardized as part of the lasso model fitting procedure.

 Variables that are on a larger scale typically have smaller coefficients and vice-versa. Without standardizing, the regularization will focus on shrinking the variables on a smaller scale over those on a larger scale.

 

(c) Describe the process of searching for the optimal value of the hyperparameter lambda in a lasso regression.

 The optimal value for lambda can be found using cross-validation. First, a grid of lambda values is chosen for the search. Then for each lambda value, a cross-validation error is calculated.

 The first step in calculating a cross-validation error is to partition the data into k folds. A single fold is removed for testing, and the remaining folds are used to train a lasso model with the current lambda value. This process is prepeated for each of the k partition, and a cross-validation error is calculated as the average of an error measure (e.g. RMSE or AUC) across all k testing partitions.

 The optimal lambda value is the one with the lowest cross-validation error.

 

https://www.youtube.com/watch?v=fSytzGwwBVw

cross validation statQuest에도 있는데 이주제는 ISLR책이 더 잘 이해되는거 같음.

 

(f) confusion matrix

pred\ref negative positive
negative TN FN
positive FP TP

sensitivity = TP/(TP+FN)

specificity = TN/(TN+FP)

 

https://www.youtube.com/watch?v=vP06aMoz4v8

텍스트로 볼때는 와닿지 않았는데 영상으로 보니 언제 sensitivity나 specificity가 높은걸 써야하는지 알수 있었음. 3개, 4개의 경우도.

 

(g) lowering the cutoff threshold?

 Assess the consequences of this recommendation as it relates to the business problem.

 This will increase positive predictions (both TP and FP) while reducing negative predictions (both TN and FN), increasing sensitivity. 

 

 

 

Task9

(a) Describe how baagging is used in the random forest algorithm and the advantage it gives random forests over a single decision tree in terms of the bias/variance trade-off

 Random forests are created by applying bagging and taking random feature subsets to construct multiple trees, which are averaged to produce a prediction.

 Bagging is the process of training of multiple models in parallel on different random subsets of the data. Each individual tree is trained on a different training dataset. Variance refers to the sensitivity of the model to changes in the training dataset. Bagging reduces variance because each individual tree is trained on different data.

+ Recent posts