Task6

(a) GLM, each of the Gaussian, Poisson and Gamma distribution.

 Gaussian

 - The domain is all real values.

 - A target variable that could be modeled with the Gaussian distribution is the response time.

 Poisson

 - The domain is non-negative integers.

 - A target variable that could be modeled with the Poisson distribution is the number of calls in an hour.

 Gamma distribution

 - The domain is positive real values.

 - A target variable that could be modeled with the Gamma distribution is the turnout time plus one second since it is continuous and postive.

 

Task9

(a) Explain how measures of impurity are related to information gain in a decision tree

 Information gain is the decrease in impurity created by a decision tree split. At each node of the decision tree, the model selects the split that results in the most information gain. Therefore, the choice of impurity measure(e.g, Gini or entropy) directly impacts information gain calculations.

 

 

(c) When concerned about the model variance, Recommend whether to use a random forest or a gradient boosting machine.

 I recommend a random forest, which tends to do well in reducing variance while having a similar bias to that of a basic tree model. The variance reduction arises from the use of many small trees and sampling of the data (bagging). Both practices hinder overfitting to the idiosyncrsies of the training data, and hence keep the variance low.

 Gradient boosting machines use the same underlying training data at each step. This is very effective at reducing bias but is very sensitive to the training data(high variance)

 

Task10

(a) explain why decision trees overfit to factor variables with many levels.

In this task, the number of levels is 31. This means the number of ways to split day of the month in to two groups is very large, making it likely that the tree will find spurious splits that happens to produce information gain for that particular training data.

 Decision trees tend to create splits on categorical variables with many levels because it is easier to choose a split where the inofrmation gain is large. However, splitting on these variables will likely to lead to overfitting.

 

(b) Describe the handling of categorical variables in linear models and tree-based models.

 Linear Models : 

 The coefficients for each level represents the impact relative to the base level of the variable.

 Tree-Based Models : 

 The more levels the variable has, the more potential ways to split the category into groups. split based on maximizing information gain. Decision trees naturally allow for interactions between categorical variables based on how the tree is created.

 

 

Task11

(c) Explain what let to the large difference in AUC values between the train and test datasets.

 An AUC of 1.0 indicates the model perfectly predicts the data while an AUC of 0.5 indicates the model performs as well as random chance.

 This indecates that the models are overfit to noise in the training datasets. The likely cause is that the chosen hyperparameters create very deep trees, and therefore the hyperparameters need to be adjusted to create simpler trees which are less prone to overfitting.

 

 

Task13

(a) Explain how cost-complexity pruning works, including how complexity is optimized.

 Cost-complexity pruning involves growing a large tree and then pruning it back by dropping splits that do not reduce the model error by a fixed value determined by the complexity parameter. We can use cross validation to optimize the complexity parameter, which is the process repeatedly training models and testing models on different folds of the data. This is done for different values for the complexity parameter, and the one with the lowest cross validation error is selected as the optimal choice. We then prune back our trained tree using the complexity parameter from the cross validation.

 

(b) State why changing the random seed would affect the tree constructed using cost-complexity pruning.

 Each time the cost-complexity pruning algorithm is run, the splits of data used in the cross validation are randomly assigned.

 

 

+ Recent posts