Task5 k-means clustering

 The goal is to assign records into one of k groups or clusters such that members of each group are overall more similar to one another than they are to menbers of other groups. The numbers of groups, k, is specified at the beginning and the group members are determined through an iterative process. k random centers are chosen and the group assignment for each record is determined by which of these centers is closest. New centers for each group are calculated based on its members, and then group membership is redetermined based on these new centers. This process continues until the centers and group membership are stable or stopped by an interation limit.

 

 

Task7 Boosting

 Boosting builds up an ensembel model, taking the aggregate prediction of many individual models or learners, each successive model building on the deficiencies of the prior model. Unlike bagging, another ensemble method, the individual models are not independent. The technique gains accuracy not by the particular predictions of any of its individual models, often called weak learners, but by its iterative process for improving the aggregate performance of the models in total.

 The first model is trained on unweighted data, and then second model is trained on the residuals produced by the first model. The thired model is trained based on the residuals of the first two models taken together, and so on. The boosting process is typically stopped after a set number of iterations, and the sum of all model output is used.

 To prevent overfitting within this terative process, a shrinkage parameter is applied to individual models so that the aggregate performance of the models approaches the training data in a controlled manner and avoids being overly sensitive to the structure of any one model.

 Boosting is appropriate for this business problem because its predictions, by directly addressing the errors of prior model fittings, are typically more accurate than those of other predictive modeling techniques. Being a more complex ensemble method, it is difficult to gather insight into how the model is making these accurate predictions, but this seems relatively unimportant to ABC.

 Partial dependence plots use the expected value of the prediction at the variable value shown when paired with all values of the other variables as found in the training data. The yhat values can be compared to the overall average mean bike usage of 189 per hour in the train data.

 The higher shrinkage parameter, representing greater weight for each weak learner, produces a substantially better prediction as measured by mean squared error. The higher shrinkage get faster without overfitting. Because the higher shrinkage parameter gives more weight to each model fitting the successive residuals, the predictions tend to be more spread out given the same number of trees.

 

https://eatchu.tistory.com/entry/앙상블Ensemble-bias-variance-관점에서의-유형-정리-Voting-Bagging-Boosting

 

[ML] 앙상블(Ensemble) - Voting, Bagging, Boosting

기존의 단일 모델은 늘 bias-variance trade off의 문제를 벗어나지 못했다. 모델의 정확도를 올리고자 복잡한 모델을 만들면 과대적합의 우려가 생기고 이를 해결하려 모델을 단순하게 만들면 결국은

eatchu.tistory.com

네이버에서 검색한 블로그인데 유튜브에서 찾은 것들보다 설명이 잘되어있다. (블로그 주인님께 감사드림...)

 

Task8 Compare disitribution choices for a glm

 Poisson distribution with log link function is a reasonable choice. The target varaible only has non-negative integer values. The log link function allows the predicted mean to vary multiplicatively rather than linearly with the coefficients for each predictor variable, more naturally fitting the right-skewed distribution of the targer variable.

 The gamma distribution with inverse link function is also a reasonable choice. There is not material harm in applying the gamma distribution function to only integer values when the values span a large range. The data matches its support of strictly non-negative values, though it is conceivable that future data could include zero bike rentals. The inverse link function allows the predicted mean to vary hyperbolically rather than linearly with the coefficients for each predictor variable. Unlike the log link function, the inverse link function can result in nagative predictions, typically massive and unusable when they occur.

 

 

Task6

(a) Explain what the variance and bias values indicate about the relative quality of predictions when comparing predictive models.

 The variance figures indicate how much the predictions vary depending on the training data used. As more predictors are used, the variance increases because the model more predicsely fits the training data for each trial and becomes less generalized. The bias figures indicate how close expected predictions and actual results are on unseen data. Generally, as more predictors are used, the bias decreases as more accurate predictions are made.

 

 

Task7

(a) Explain, for a general audience, what cost complexity pruining does.

 Cost complexity pruninig is part of a two-step approach to building a tree model. The first step is to build a large, complex decision tree, which is essentially a flow chart for deciding whether to try to transfer an animal.

 A second step called pruning is taken. Pruning reduces the size and complexity of the initial flow chart to a more useful one. That is is called cost complexity pruning has to do with the technical tradeoff being made between how simple the flow chart is compared to how well it distinguishes whether animals can be transferred or not.

 

 

Task8

(a) Boosting - Setting eta as high as possible?

 In boosting algorithms, which work by iteratively fitting a model the residuals of a prior learner, eta, also called the learning rate or shrinkage parameter, slows down the model fitting process so that the residuals from the prior learner do not have too large an influence on the final model. With eta at its maximum of 1, each model iteration is the prior learner plus the model fitting its residuals. While this will run quickly, it will be prone to high variance, overfitting the training data and not generalizing well to unseen data. Setting eta to less than 1 slows down the fitting process by only adding eta times the model fitting the residuals to form the next learner and will substantially reduce the variance.

 

(b) Explain cross validation and how it can be used to set the eta hyperparameter

 Cross validation divides the availabel data in to multiple folds for a series of model fitting runs. Each fold is used as test data exactly once. The average test metric across the runs is the result of the cross validation.

 To use cross validation to set the eta hyperparameter, a series of reasonable values for eta would be chosen beforehand. Then, for each value of eta, cross validation would be performed with each model fitting run using the same eta. The result is one average test metric result from each cross validation for each value of eta. The value of eta with the superior test metric, some measure of predictive power on unseen data, would be shosen for subsequent predictive modeling work.

 

 

 

Task3

(a) Explain three differences between fitting a normal linear regression to log(X) compared to fitting a GLM with a log link function to the unaltered X variable.

 - prior : lognormal model, latter : GLM with a log link

 (1) The normal linear regression has a log tranformation applied to the response variable, and the GLM does not. The log transformation is reasonable for a variable that has right-skew.

 (2) The GLM has flexibility to select a probability distribution that best fits the shape of the response variable, whereas the normal linear regression model only allows for one distribution.

 (3) In the normal linear model the variance of the (transformed) response variable is constant while in the GLM the variance can be a function of the mean. > 익숙치 않다

 

(b) residual plot을 분석하라는데... 분석은 됬고 관련개념.

Homoscedasticity(등분산성) and heteroscedasticity(이분산성)

Wikipedia

Task4

(a) Absolute error, bagging example

  First Tree Sencond Tree Bagging Ensemble
Prediction 43000 40000 Avg(43000,40000)=41500
Actual 32084 32084 32084
Absolute Error 10916 7916 41500-3084=9416

 

https://www.youtube.com/watch?v=tjy0yL1rRRU

DataMListic

Bagging/Boosting 참조용... 

 

+ Recent posts