Task5 k-means clustering

 The goal is to assign records into one of k groups or clusters such that members of each group are overall more similar to one another than they are to menbers of other groups. The numbers of groups, k, is specified at the beginning and the group members are determined through an iterative process. k random centers are chosen and the group assignment for each record is determined by which of these centers is closest. New centers for each group are calculated based on its members, and then group membership is redetermined based on these new centers. This process continues until the centers and group membership are stable or stopped by an interation limit.

 

 

Task7 Boosting

 Boosting builds up an ensembel model, taking the aggregate prediction of many individual models or learners, each successive model building on the deficiencies of the prior model. Unlike bagging, another ensemble method, the individual models are not independent. The technique gains accuracy not by the particular predictions of any of its individual models, often called weak learners, but by its iterative process for improving the aggregate performance of the models in total.

 The first model is trained on unweighted data, and then second model is trained on the residuals produced by the first model. The thired model is trained based on the residuals of the first two models taken together, and so on. The boosting process is typically stopped after a set number of iterations, and the sum of all model output is used.

 To prevent overfitting within this terative process, a shrinkage parameter is applied to individual models so that the aggregate performance of the models approaches the training data in a controlled manner and avoids being overly sensitive to the structure of any one model.

 Boosting is appropriate for this business problem because its predictions, by directly addressing the errors of prior model fittings, are typically more accurate than those of other predictive modeling techniques. Being a more complex ensemble method, it is difficult to gather insight into how the model is making these accurate predictions, but this seems relatively unimportant to ABC.

 Partial dependence plots use the expected value of the prediction at the variable value shown when paired with all values of the other variables as found in the training data. The yhat values can be compared to the overall average mean bike usage of 189 per hour in the train data.

 The higher shrinkage parameter, representing greater weight for each weak learner, produces a substantially better prediction as measured by mean squared error. The higher shrinkage get faster without overfitting. Because the higher shrinkage parameter gives more weight to each model fitting the successive residuals, the predictions tend to be more spread out given the same number of trees.

 

https://eatchu.tistory.com/entry/앙상블Ensemble-bias-variance-관점에서의-유형-정리-Voting-Bagging-Boosting

 

[ML] 앙상블(Ensemble) - Voting, Bagging, Boosting

기존의 단일 모델은 늘 bias-variance trade off의 문제를 벗어나지 못했다. 모델의 정확도를 올리고자 복잡한 모델을 만들면 과대적합의 우려가 생기고 이를 해결하려 모델을 단순하게 만들면 결국은

eatchu.tistory.com

네이버에서 검색한 블로그인데 유튜브에서 찾은 것들보다 설명이 잘되어있다. (블로그 주인님께 감사드림...)

 

Task8 Compare disitribution choices for a glm

 Poisson distribution with log link function is a reasonable choice. The target varaible only has non-negative integer values. The log link function allows the predicted mean to vary multiplicatively rather than linearly with the coefficients for each predictor variable, more naturally fitting the right-skewed distribution of the targer variable.

 The gamma distribution with inverse link function is also a reasonable choice. There is not material harm in applying the gamma distribution function to only integer values when the values span a large range. The data matches its support of strictly non-negative values, though it is conceivable that future data could include zero bike rentals. The inverse link function allows the predicted mean to vary hyperbolically rather than linearly with the coefficients for each predictor variable. Unlike the log link function, the inverse link function can result in nagative predictions, typically massive and unusable when they occur.

 

 

Task6

(a) GLM, each of the Gaussian, Poisson and Gamma distribution.

 Gaussian

 - The domain is all real values.

 - A target variable that could be modeled with the Gaussian distribution is the response time.

 Poisson

 - The domain is non-negative integers.

 - A target variable that could be modeled with the Poisson distribution is the number of calls in an hour.

 Gamma distribution

 - The domain is positive real values.

 - A target variable that could be modeled with the Gamma distribution is the turnout time plus one second since it is continuous and postive.

 

Task9

(a) Explain how measures of impurity are related to information gain in a decision tree

 Information gain is the decrease in impurity created by a decision tree split. At each node of the decision tree, the model selects the split that results in the most information gain. Therefore, the choice of impurity measure(e.g, Gini or entropy) directly impacts information gain calculations.

 

 

(c) When concerned about the model variance, Recommend whether to use a random forest or a gradient boosting machine.

 I recommend a random forest, which tends to do well in reducing variance while having a similar bias to that of a basic tree model. The variance reduction arises from the use of many small trees and sampling of the data (bagging). Both practices hinder overfitting to the idiosyncrsies of the training data, and hence keep the variance low.

 Gradient boosting machines use the same underlying training data at each step. This is very effective at reducing bias but is very sensitive to the training data(high variance)

 

Task10

(a) explain why decision trees overfit to factor variables with many levels.

In this task, the number of levels is 31. This means the number of ways to split day of the month in to two groups is very large, making it likely that the tree will find spurious splits that happens to produce information gain for that particular training data.

 Decision trees tend to create splits on categorical variables with many levels because it is easier to choose a split where the inofrmation gain is large. However, splitting on these variables will likely to lead to overfitting.

 

(b) Describe the handling of categorical variables in linear models and tree-based models.

 Linear Models : 

 The coefficients for each level represents the impact relative to the base level of the variable.

 Tree-Based Models : 

 The more levels the variable has, the more potential ways to split the category into groups. split based on maximizing information gain. Decision trees naturally allow for interactions between categorical variables based on how the tree is created.

 

 

Task11

(c) Explain what let to the large difference in AUC values between the train and test datasets.

 An AUC of 1.0 indicates the model perfectly predicts the data while an AUC of 0.5 indicates the model performs as well as random chance.

 This indecates that the models are overfit to noise in the training datasets. The likely cause is that the chosen hyperparameters create very deep trees, and therefore the hyperparameters need to be adjusted to create simpler trees which are less prone to overfitting.

 

 

Task13

(a) Explain how cost-complexity pruning works, including how complexity is optimized.

 Cost-complexity pruning involves growing a large tree and then pruning it back by dropping splits that do not reduce the model error by a fixed value determined by the complexity parameter. We can use cross validation to optimize the complexity parameter, which is the process repeatedly training models and testing models on different folds of the data. This is done for different values for the complexity parameter, and the one with the lowest cross validation error is selected as the optimal choice. We then prune back our trained tree using the complexity parameter from the cross validation.

 

(b) State why changing the random seed would affect the tree constructed using cost-complexity pruning.

 Each time the cost-complexity pruning algorithm is run, the splits of data used in the cross validation are randomly assigned.

 

 

Task3

(a) Explain three differences between fitting a normal linear regression to log(X) compared to fitting a GLM with a log link function to the unaltered X variable.

 - prior : lognormal model, latter : GLM with a log link

 (1) The normal linear regression has a log tranformation applied to the response variable, and the GLM does not. The log transformation is reasonable for a variable that has right-skew.

 (2) The GLM has flexibility to select a probability distribution that best fits the shape of the response variable, whereas the normal linear regression model only allows for one distribution.

 (3) In the normal linear model the variance of the (transformed) response variable is constant while in the GLM the variance can be a function of the mean. > 익숙치 않다

 

(b) residual plot을 분석하라는데... 분석은 됬고 관련개념.

Homoscedasticity(등분산성) and heteroscedasticity(이분산성)

Wikipedia

Task4

(a) Absolute error, bagging example

  First Tree Sencond Tree Bagging Ensemble
Prediction 43000 40000 Avg(43000,40000)=41500
Actual 32084 32084 32084
Absolute Error 10916 7916 41500-3084=9416

 

https://www.youtube.com/watch?v=tjy0yL1rRRU

DataMListic

Bagging/Boosting 참조용... 

 

+ Recent posts