Task5 Perform a principal components analysis

  Principal components analysis is a method to summarize high dimensional numeric data with fewer dimensions while preserving the spread of the data. It can be particularly helpful when variables are highly correlated. PCA finds orthogonal linear combinations of the input variables (which are typically centered and scaled) called principal components (PCs) that maximize variance to retain as much information as possible. The principal components are ordered according to their variance. The sum of their variances is the total variance explained. It is then common to look at the proportion of variance explained by each principal component to decide how many PCs to use.

 Advantages

 PCA could allow us to build a simpler model with fewer features. PCA can help visualize high-dimensional data to explore relationships between variables. PCA can help identify latent variables.

 Disadvantages

 Using a subset of the principal components results in some information loss. The principal components will be less interpretable than the original variable inputs. Although PCA reduces dimensionality in the model, the original variables must still be collected for future predictions, so no efficiency is obtained.

 

Task6 Construct a decision tree

  When a decision tree is trained, it can become very large and include splits that are not particularly valuable for predictions on new data. When examining the fit of a tree, it is a good idea to try to prune a tree when it has many splits that do not improve performance. Pruning reduces the size of the tree, hopefully removing less valuable splits from the tree. This process reduces overfitting the tree on the training data, can lead to better predictions, and results in a simpler, more interpretable tree.

 

 

 

Task9 Discuss the bias-variance tradeoff

 Bias is the expected loss caused by the model not being complex enough to capture the signal in the data. Variance is the expected loss from the model being too complex and overfitting to the training data.

 With high variance (overfitting), the model will perform better on the training set than on a test set. With high bias (underfitting), the model will perform poorly on both the training set and the test set.

 

Task2 Construct a classification tree

 Split the data in to training(70%) and testing (30%) sets.

 Alter the minbucket, cp, maxdepth parameters. Each of these parameters can be used to prevent overfitting. The "minbucket" parameter sets the smallest size for any leaf node in the tree. The "cp" paramter sets the minimum improvement needed in order to make a split. The "maxdepth" parameter sets the minimum number of levels for the splits.

 Estimate the cp parameter using cross-validation. Dividing the training data up into folds, training the model on all but one of the folds, and measuring the performance on the remaining fold. This process is repeated to develop a distribution of performance values for a given cp value. The cp value that yields the best accuracy is then selected.

 

 

 

Task3 Construct a random forest

 Generally, variables that are used to make splits more frequently and earlier in the trees in the random forest are determined to be more important.

 

 

 

Task4  Consider a boosted tree

  Random forests use bagging to produce many independent trees. When a split is to be considered, only a randomly selected subset of variables is considered. each tree is built from a bootstrap sample from the data. This approach is designed to reduce overfitting, which is likely when a single tree is used.

 Boosted trees are built sequentially. A first tree is built in the usual way. Then another tree is fit to the residuals (errors) and is added to the first tree. This allows the second tree to focus on observations on which the first tree performed poorly. Additional trees are then added with cross-validation used to decide when to stop (and prevent overfitting).

 Both random forests and boosted trees are ensemble methods, which means that rather than creating a single tree, many trees are produced. Neither method is transparent, so they require variable importance measures to understand the extent to which input variable is used.

 Random forests do not use the residuals to build successive trees, so there is less risk of overfitting as a result of building too many trees, while boosted trees will result in a lower bias. The best boosted trees learn slowly (use lots of trees) and thus can take longer than a random forest to train.

 

 

Task5 k-means clustering

 The goal is to assign records into one of k groups or clusters such that members of each group are overall more similar to one another than they are to menbers of other groups. The numbers of groups, k, is specified at the beginning and the group members are determined through an iterative process. k random centers are chosen and the group assignment for each record is determined by which of these centers is closest. New centers for each group are calculated based on its members, and then group membership is redetermined based on these new centers. This process continues until the centers and group membership are stable or stopped by an interation limit.

 

 

Task7 Boosting

 Boosting builds up an ensembel model, taking the aggregate prediction of many individual models or learners, each successive model building on the deficiencies of the prior model. Unlike bagging, another ensemble method, the individual models are not independent. The technique gains accuracy not by the particular predictions of any of its individual models, often called weak learners, but by its iterative process for improving the aggregate performance of the models in total.

 The first model is trained on unweighted data, and then second model is trained on the residuals produced by the first model. The thired model is trained based on the residuals of the first two models taken together, and so on. The boosting process is typically stopped after a set number of iterations, and the sum of all model output is used.

 To prevent overfitting within this terative process, a shrinkage parameter is applied to individual models so that the aggregate performance of the models approaches the training data in a controlled manner and avoids being overly sensitive to the structure of any one model.

 Boosting is appropriate for this business problem because its predictions, by directly addressing the errors of prior model fittings, are typically more accurate than those of other predictive modeling techniques. Being a more complex ensemble method, it is difficult to gather insight into how the model is making these accurate predictions, but this seems relatively unimportant to ABC.

 Partial dependence plots use the expected value of the prediction at the variable value shown when paired with all values of the other variables as found in the training data. The yhat values can be compared to the overall average mean bike usage of 189 per hour in the train data.

 The higher shrinkage parameter, representing greater weight for each weak learner, produces a substantially better prediction as measured by mean squared error. The higher shrinkage get faster without overfitting. Because the higher shrinkage parameter gives more weight to each model fitting the successive residuals, the predictions tend to be more spread out given the same number of trees.

 

https://eatchu.tistory.com/entry/앙상블Ensemble-bias-variance-관점에서의-유형-정리-Voting-Bagging-Boosting

 

[ML] 앙상블(Ensemble) - Voting, Bagging, Boosting

기존의 단일 모델은 늘 bias-variance trade off의 문제를 벗어나지 못했다. 모델의 정확도를 올리고자 복잡한 모델을 만들면 과대적합의 우려가 생기고 이를 해결하려 모델을 단순하게 만들면 결국은

eatchu.tistory.com

네이버에서 검색한 블로그인데 유튜브에서 찾은 것들보다 설명이 잘되어있다. (블로그 주인님께 감사드림...)

 

Task8 Compare disitribution choices for a glm

 Poisson distribution with log link function is a reasonable choice. The target varaible only has non-negative integer values. The log link function allows the predicted mean to vary multiplicatively rather than linearly with the coefficients for each predictor variable, more naturally fitting the right-skewed distribution of the targer variable.

 The gamma distribution with inverse link function is also a reasonable choice. There is not material harm in applying the gamma distribution function to only integer values when the values span a large range. The data matches its support of strictly non-negative values, though it is conceivable that future data could include zero bike rentals. The inverse link function allows the predicted mean to vary hyperbolically rather than linearly with the coefficients for each predictor variable. Unlike the log link function, the inverse link function can result in nagative predictions, typically massive and unusable when they occur.

 

 

+ Recent posts