Task5 k-means clustering

 The goal is to assign records into one of k groups or clusters such that members of each group are overall more similar to one another than they are to menbers of other groups. The numbers of groups, k, is specified at the beginning and the group members are determined through an iterative process. k random centers are chosen and the group assignment for each record is determined by which of these centers is closest. New centers for each group are calculated based on its members, and then group membership is redetermined based on these new centers. This process continues until the centers and group membership are stable or stopped by an interation limit.

 

 

Task7 Boosting

 Boosting builds up an ensembel model, taking the aggregate prediction of many individual models or learners, each successive model building on the deficiencies of the prior model. Unlike bagging, another ensemble method, the individual models are not independent. The technique gains accuracy not by the particular predictions of any of its individual models, often called weak learners, but by its iterative process for improving the aggregate performance of the models in total.

 The first model is trained on unweighted data, and then second model is trained on the residuals produced by the first model. The thired model is trained based on the residuals of the first two models taken together, and so on. The boosting process is typically stopped after a set number of iterations, and the sum of all model output is used.

 To prevent overfitting within this terative process, a shrinkage parameter is applied to individual models so that the aggregate performance of the models approaches the training data in a controlled manner and avoids being overly sensitive to the structure of any one model.

 Boosting is appropriate for this business problem because its predictions, by directly addressing the errors of prior model fittings, are typically more accurate than those of other predictive modeling techniques. Being a more complex ensemble method, it is difficult to gather insight into how the model is making these accurate predictions, but this seems relatively unimportant to ABC.

 Partial dependence plots use the expected value of the prediction at the variable value shown when paired with all values of the other variables as found in the training data. The yhat values can be compared to the overall average mean bike usage of 189 per hour in the train data.

 The higher shrinkage parameter, representing greater weight for each weak learner, produces a substantially better prediction as measured by mean squared error. The higher shrinkage get faster without overfitting. Because the higher shrinkage parameter gives more weight to each model fitting the successive residuals, the predictions tend to be more spread out given the same number of trees.

 

https://eatchu.tistory.com/entry/앙상블Ensemble-bias-variance-관점에서의-유형-정리-Voting-Bagging-Boosting

 

[ML] 앙상블(Ensemble) - Voting, Bagging, Boosting

기존의 단일 모델은 늘 bias-variance trade off의 문제를 벗어나지 못했다. 모델의 정확도를 올리고자 복잡한 모델을 만들면 과대적합의 우려가 생기고 이를 해결하려 모델을 단순하게 만들면 결국은

eatchu.tistory.com

네이버에서 검색한 블로그인데 유튜브에서 찾은 것들보다 설명이 잘되어있다. (블로그 주인님께 감사드림...)

 

Task8 Compare disitribution choices for a glm

 Poisson distribution with log link function is a reasonable choice. The target varaible only has non-negative integer values. The log link function allows the predicted mean to vary multiplicatively rather than linearly with the coefficients for each predictor variable, more naturally fitting the right-skewed distribution of the targer variable.

 The gamma distribution with inverse link function is also a reasonable choice. There is not material harm in applying the gamma distribution function to only integer values when the values span a large range. The data matches its support of strictly non-negative values, though it is conceivable that future data could include zero bike rentals. The inverse link function allows the predicted mean to vary hyperbolically rather than linearly with the coefficients for each predictor variable. Unlike the log link function, the inverse link function can result in nagative predictions, typically massive and unusable when they occur.

 

 

Task8

(a) Compare and contrast stepwise selection with shrinkage methods.

Similarities

 - both avoid overfitting to the data, especially when the number of observations is small compared to the number of predictors.

 - both can be used for variable selection to reduce model complexity.

Differences

 - Stepwise selection takes iterative steps, until there is no improvements as measured by AIC.

 - Shrinkage methods can reduce the size of coefficients without entirely eliminating variables.

 

(b) Explain why variables are standardized as part of the lasso model fitting procedure.

 Variables that are on a larger scale typically have smaller coefficients and vice-versa. Without standardizing, the regularization will focus on shrinking the variables on a smaller scale over those on a larger scale.

 

(c) Describe the process of searching for the optimal value of the hyperparameter lambda in a lasso regression.

 The optimal value for lambda can be found using cross-validation. First, a grid of lambda values is chosen for the search. Then for each lambda value, a cross-validation error is calculated.

 The first step in calculating a cross-validation error is to partition the data into k folds. A single fold is removed for testing, and the remaining folds are used to train a lasso model with the current lambda value. This process is prepeated for each of the k partition, and a cross-validation error is calculated as the average of an error measure (e.g. RMSE or AUC) across all k testing partitions.

 The optimal lambda value is the one with the lowest cross-validation error.

 

https://www.youtube.com/watch?v=fSytzGwwBVw

cross validation statQuest에도 있는데 이주제는 ISLR책이 더 잘 이해되는거 같음.

 

(f) confusion matrix

pred\ref negative positive
negative TN FN
positive FP TP

sensitivity = TP/(TP+FN)

specificity = TN/(TN+FP)

 

https://www.youtube.com/watch?v=vP06aMoz4v8

텍스트로 볼때는 와닿지 않았는데 영상으로 보니 언제 sensitivity나 specificity가 높은걸 써야하는지 알수 있었음. 3개, 4개의 경우도.

 

(g) lowering the cutoff threshold?

 Assess the consequences of this recommendation as it relates to the business problem.

 This will increase positive predictions (both TP and FP) while reducing negative predictions (both TN and FN), increasing sensitivity. 

 

 

 

Task9

(a) Describe how baagging is used in the random forest algorithm and the advantage it gives random forests over a single decision tree in terms of the bias/variance trade-off

 Random forests are created by applying bagging and taking random feature subsets to construct multiple trees, which are averaged to produce a prediction.

 Bagging is the process of training of multiple models in parallel on different random subsets of the data. Each individual tree is trained on a different training dataset. Variance refers to the sensitivity of the model to changes in the training dataset. Bagging reduces variance because each individual tree is trained on different data.

Task3

(a) Explain three differences between fitting a normal linear regression to log(X) compared to fitting a GLM with a log link function to the unaltered X variable.

 - prior : lognormal model, latter : GLM with a log link

 (1) The normal linear regression has a log tranformation applied to the response variable, and the GLM does not. The log transformation is reasonable for a variable that has right-skew.

 (2) The GLM has flexibility to select a probability distribution that best fits the shape of the response variable, whereas the normal linear regression model only allows for one distribution.

 (3) In the normal linear model the variance of the (transformed) response variable is constant while in the GLM the variance can be a function of the mean. > 익숙치 않다

 

(b) residual plot을 분석하라는데... 분석은 됬고 관련개념.

Homoscedasticity(등분산성) and heteroscedasticity(이분산성)

Wikipedia

Task4

(a) Absolute error, bagging example

  First Tree Sencond Tree Bagging Ensemble
Prediction 43000 40000 Avg(43000,40000)=41500
Actual 32084 32084 32084
Absolute Error 10916 7916 41500-3084=9416

 

https://www.youtube.com/watch?v=tjy0yL1rRRU

DataMListic

Bagging/Boosting 참조용... 

 

+ Recent posts