달리는 말킹

SOA/ASA/PA 기출 및 내용정리 - 20.06시험(기록용)

말킹 — Sat, 13 Apr 2024 16:24:25 +0900

Task5 Perform a principal components analysis

Principal components analysis is a method to summarize high dimensional numeric data with fewer dimensions while preserving the spread of the data. It can be particularly helpful when variables are highly correlated. PCA finds orthogonal linear combinations of the input variables (which are typically centered and scaled) called principal components (PCs) that maximize variance to retain as much information as possible. The principal components are ordered according to their variance. The sum of their variances is the total variance explained. It is then common to look at the proportion of variance explained by each principal component to decide how many PCs to use.

Advantages

PCA could allow us to build a simpler model with fewer features. PCA can help visualize high-dimensional data to explore relationships between variables. PCA can help identify latent variables.

Disadvantages

Using a subset of the principal components results in some information loss. The principal components will be less interpretable than the original variable inputs. Although PCA reduces dimensionality in the model, the original variables must still be collected for future predictions, so no efficiency is obtained.

Task6 Construct a decision tree

When a decision tree is trained, it can become very large and include splits that are not particularly valuable for predictions on new data. When examining the fit of a tree, it is a good idea to try to prune a tree when it has many splits that do not improve performance. Pruning reduces the size of the tree, hopefully removing less valuable splits from the tree. This process reduces overfitting the tree on the training data, can lead to better predictions, and results in a simpler, more interpretable tree.

Task9 Discuss the bias-variance tradeoff

Bias is the expected loss caused by the model not being complex enough to capture the signal in the data. Variance is the expected loss from the model being too complex and overfitting to the training data.

With high variance (overfitting), the model will perform better on the training set than on a test set. With high bias (underfitting), the model will perform poorly on both the training set and the test set.

SOA/ASA/PA 기출 및 내용정리 - 19.12.12시험(기록용)

말킹 — Wed, 10 Apr 2024 18:09:00 +0900

Task2 Construct a classification tree

Split the data in to training(70%) and testing (30%) sets.

Alter the minbucket, cp, maxdepth parameters. Each of these parameters can be used to prevent overfitting. The "minbucket" parameter sets the smallest size for any leaf node in the tree. The "cp" paramter sets the minimum improvement needed in order to make a split. The "maxdepth" parameter sets the minimum number of levels for the splits.

Estimate the cp parameter using cross-validation. Dividing the training data up into folds, training the model on all but one of the folds, and measuring the performance on the remaining fold. This process is repeated to develop a distribution of performance values for a given cp value. The cp value that yields the best accuracy is then selected.

Task3 Construct a random forest

Generally, variables that are used to make splits more frequently and earlier in the trees in the random forest are determined to be more important.

Task4 Consider a boosted tree

Random forests use bagging to produce many independent trees. When a split is to be considered, only a randomly selected subset of variables is considered. each tree is built from a bootstrap sample from the data. This approach is designed to reduce overfitting, which is likely when a single tree is used.

Boosted trees are built sequentially. A first tree is built in the usual way. Then another tree is fit to the residuals (errors) and is added to the first tree. This allows the second tree to focus on observations on which the first tree performed poorly. Additional trees are then added with cross-validation used to decide when to stop (and prevent overfitting).

Both random forests and boosted trees are ensemble methods, which means that rather than creating a single tree, many trees are produced. Neither method is transparent, so they require variable importance measures to understand the extent to which input variable is used.

Random forests do not use the residuals to build successive trees, so there is less risk of overfitting as a result of building too many trees, while boosted trees will result in a lower bias. The best boosted trees learn slowly (use lots of trees) and thus can take longer than a random forest to train.

SOA/ASA/PA 기출 및 내용정리 - 20.12.07시험(기록용)

말킹 — Mon, 8 Apr 2024 22:33:33 +0900

Task5 k-means clustering

The goal is to assign records into one of k groups or clusters such that members of each group are overall more similar to one another than they are to menbers of other groups. The numbers of groups, k, is specified at the beginning and the group members are determined through an iterative process. k random centers are chosen and the group assignment for each record is determined by which of these centers is closest. New centers for each group are calculated based on its members, and then group membership is redetermined based on these new centers. This process continues until the centers and group membership are stable or stopped by an interation limit.

Task7 Boosting

Boosting builds up an ensembel model, taking the aggregate prediction of many individual models or learners, each successive model building on the deficiencies of the prior model. Unlike bagging, another ensemble method, the individual models are not independent. The technique gains accuracy not by the particular predictions of any of its individual models, often called weak learners, but by its iterative process for improving the aggregate performance of the models in total.

The first model is trained on unweighted data, and then second model is trained on the residuals produced by the first model. The thired model is trained based on the residuals of the first two models taken together, and so on. The boosting process is typically stopped after a set number of iterations, and the sum of all model output is used.

To prevent overfitting within this terative process, a shrinkage parameter is applied to individual models so that the aggregate performance of the models approaches the training data in a controlled manner and avoids being overly sensitive to the structure of any one model.

Boosting is appropriate for this business problem because its predictions, by directly addressing the errors of prior model fittings, are typically more accurate than those of other predictive modeling techniques. Being a more complex ensemble method, it is difficult to gather insight into how the model is making these accurate predictions, but this seems relatively unimportant to ABC.

Partial dependence plots use the expected value of the prediction at the variable value shown when paired with all values of the other variables as found in the training data. The yhat values can be compared to the overall average mean bike usage of 189 per hour in the train data.

The higher shrinkage parameter, representing greater weight for each weak learner, produces a substantially better prediction as measured by mean squared error. ~~The higher shrinkage get faster without overfitting.~~ Because the higher shrinkage parameter gives more weight to each model fitting the successive residuals, the predictions tend to be more spread out given the same number of trees.

https://eatchu.tistory.com/entry/앙상블Ensemble-bias-variance-관점에서의-유형-정리-Voting-Bagging-Boosting

[ML] 앙상블(Ensemble) - Voting, Bagging, Boosting

기존의 단일 모델은 늘 bias-variance trade off의 문제를 벗어나지 못했다. 모델의 정확도를 올리고자 복잡한 모델을 만들면 과대적합의 우려가 생기고 이를 해결하려 모델을 단순하게 만들면 결국은

eatchu.tistory.com

네이버에서 검색한 블로그인데 유튜브에서 찾은 것들보다 설명이 잘되어있다. (블로그 주인님께 감사드림...)

Task8 Compare disitribution choices for a glm

Poisson distribution with log link function is a reasonable choice. The target varaible only has non-negative integer values. The log link function allows the predicted mean to vary multiplicatively rather than linearly with the coefficients for each predictor variable, more naturally fitting the right-skewed distribution of the targer variable.

The gamma distribution with inverse link function is also a reasonable choice. There is not material harm in applying the gamma distribution function to only integer values when the values span a large range. The data matches its support of strictly non-negative values, though it is conceivable that future data could include zero bike rentals. The inverse link function allows the predicted mean to vary hyperbolically rather than linearly with the coefficients for each predictor variable. Unlike the log link function, the inverse link function can result in nagative predictions, typically massive and unusable when they occur.

<로니로티> 천호역 맛집, 이탈리안

말킹 — Mon, 8 Apr 2024 00:00:23 +0900

목살스테이크

치킨버섯크림리조또, 체리에이드(?)

대학교때 몇번 갔던 서가앤쿡이 종종 떠올랐는데... 지나가다가 비슷한 느낌의 이탈리안 체인이라고 가보게 되었다. 메뉴구성이 서가앤쿡이랑 비슷해서 향수를 자극했다.

역주변인데도 사람도 많지않고 쾌적하게 먹을 수 있었다.

목살스테이크도 생각했던 맛인데 가격대비 혜자스러웠고, 리조또도 훌륭했다.(리조또는 조금 느끼했지만 느끼한거 좋아하는 편)

종종 재방문 할 곳 같다.

SOA/ASA/PA 기출 및 내용정리 - 21.12.13시험(기록용)

말킹 — Sun, 7 Apr 2024 20:13:59 +0900

Task6

(a) Explain what the variance and bias values indicate about the relative quality of predictions when comparing predictive models.

The variance figures indicate how much the predictions vary depending on the training data used. As more predictors are used, the variance increases because the model more predicsely fits the training data for each trial and becomes less generalized. The bias figures indicate how close expected predictions and actual results are on unseen data. Generally, as more predictors are used, the bias decreases as more accurate predictions are made.

Task7

(a) Explain, for a general audience, what cost complexity pruining does.

Cost complexity pruninig is part of a two-step approach to building a tree model. The first step is to build a large, complex decision tree, which is essentially a flow chart for deciding whether to try to transfer an animal.

A second step called pruning is taken. Pruning reduces the size and complexity of the initial flow chart to a more useful one. That is is called cost complexity pruning has to do with the technical tradeoff being made between how simple the flow chart is compared to how well it distinguishes whether animals can be transferred or not.

Task8

(a) Boosting - Setting eta as high as possible?

In boosting algorithms, which work by iteratively fitting a model the residuals of a prior learner, eta, also called the learning rate or shrinkage parameter, slows down the model fitting process so that the residuals from the prior learner do not have too large an influence on the final model. With eta at its maximum of 1, each model iteration is the prior learner plus the model fitting its residuals. While this will run quickly, it will be prone to high variance, overfitting the training data and not generalizing well to unseen data. Setting eta to less than 1 slows down the fitting process by only adding eta times the model fitting the residuals to form the next learner and will substantially reduce the variance.

(b) Explain cross validation and how it can be used to set the eta hyperparameter

Cross validation divides the availabel data in to multiple folds for a series of model fitting runs. Each fold is used as test data exactly once. The average test metric across the runs is the result of the cross validation.

To use cross validation to set the eta hyperparameter, a series of reasonable values for eta would be chosen beforehand. Then, for each value of eta, cross validation would be performed with each model fitting run using the same eta. The result is one average test metric result from each cross validation for each value of eta. The value of eta with the superior test metric, some measure of predictive power on unseen data, would be shosen for subsequent predictive modeling work.

SOA/ASA/PA 기출 및 내용정리 - 22.04.12시험 Task6~13(기록용)

말킹 — Sat, 6 Apr 2024 16:10:57 +0900

Task6

(a) GLM, each of the Gaussian, Poisson and Gamma distribution.

Gaussian

- The domain is all real values.

- A target variable that could be modeled with the Gaussian distribution is the response time.

Poisson

- The domain is non-negative integers.

- A target variable that could be modeled with the Poisson distribution is the number of calls in an hour.

Gamma distribution

- The domain is positive real values.

- A target variable that could be modeled with the Gamma distribution is the turnout time plus one second since it is continuous and postive.

Task9

(a) Explain how measures of impurity are related to information gain in a decision tree

Information gain is the decrease in impurity created by a decision tree split. At each node of the decision tree, the model selects the split that results in the most information gain. Therefore, the choice of impurity measure(e.g, Gini or entropy) directly impacts information gain calculations.

(c) When concerned about the model variance, Recommend whether to use a random forest or a gradient boosting machine.

I recommend a random forest, which tends to do well in reducing variance while having a similar bias to that of a basic tree model. The variance reduction arises from the use of many small trees and sampling of the data (bagging). Both practices hinder overfitting to the idiosyncrsies of the training data, and hence keep the variance low.

Gradient boosting machines use the same underlying training data at each step. This is very effective at reducing bias but is very sensitive to the training data(high variance)

Task10

(a) explain why decision trees overfit to factor variables with many levels.

In this task, the number of levels is 31. This means the number of ways to split day of the month in to two groups is very large, making it likely that the tree will find spurious splits that happens to produce information gain for that particular training data.

Decision trees tend to create splits on categorical variables with many levels because it is easier to choose a split where the inofrmation gain is large. However, splitting on these variables will likely to lead to overfitting.

(b) Describe the handling of categorical variables in linear models and tree-based models.

Linear Models :

The coefficients for each level represents the impact relative to the base level of the variable.

Tree-Based Models :

The more levels the variable has, the more potential ways to split the category into groups. split based on maximizing information gain. Decision trees naturally allow for interactions between categorical variables based on how the tree is created.

Task11

(c) Explain what let to the large difference in AUC values between the train and test datasets.

An AUC of 1.0 indicates the model perfectly predicts the data while an AUC of 0.5 indicates the model performs as well as random chance.

This indecates that the models are overfit to noise in the training datasets. The likely cause is that the chosen hyperparameters create very deep trees, and therefore the hyperparameters need to be adjusted to create simpler trees which are less prone to overfitting.

Task13

(a) Explain how cost-complexity pruning works, including how complexity is optimized.

Cost-complexity pruning involves growing a large tree and then pruning it back by dropping splits that do not reduce the model error by a fixed value determined by the complexity parameter. We can use cross validation to optimize the complexity parameter, which is the process repeatedly training models and testing models on different folds of the data. This is done for different values for the complexity parameter, and the one with the lowest cross validation error is selected as the optimal choice. We then prune back our trained tree using the complexity parameter from the cross validation.

(b) State why changing the random seed would affect the tree constructed using cost-complexity pruning.

Each time the cost-complexity pruning algorithm is run, the splits of data used in the cross validation are randomly assigned.

<the.고로케> - 천호 현대백화점 팝업

말킹 — Sat, 6 Apr 2024 00:00:11 +0900

주말에 간식살겸 들린 현대백화점 지하.

사람들이 줄서서 먹고있길래 먹어보았다. 식빵이다 보니 밀가루 부분이 얇아서 일반고로케보다 맛있었다.

그런데 줄서서 먹을정도 인가는 모르겠다. 동선에 있으면 종종 먹게 될텐데 기다려서 먹을정도인가 싶은 그정도.

궁금해서 본지점 어딘지 검색해보는데 안찾아진다. 팝업만 하는곳인가.

SOA/ASA/PA 기출 및 내용정리 - 22.10.11시험 Task6~12(기록용)

말킹 — Fri, 5 Apr 2024 19:28:04 +0900

Task8

(a) cost-complexity pruning algorithm

Pruning is a technique used to reduce the complexity of a decision tree and protect against overfitting. This process is repeated for each remaining split until further pruning would result in decreased model accuracy.

https://www.youtube.com/watch?v=D0efHEJsfHo

StatQuest

(b) Choosing a complexity parameter based on cross-validation results.

Choosing the value that results in the minimum corss-validation error.

Employing the one standard-error rule. This approachproposes using the complexity parameter for the smallest model within one standard-error of the minimum cross-validation error.

Task9

(a) Explain the difference between accuracy and AUC in terms of overall model assessment.

Accuracy is measured by the ratio of correct number of predictions to total number of predictions made.

AUC measures the area under the ROC curve. It assesses the overall model performance by measuring how true positive rate and false positive rate trade off across a range of possible classification thresholds.

AUC measures performance across the full range of thresholds while accuracy measures performance only at the selected threshold.

(b) Explain why the ROC curve always goes through (0,0) and (1,1)

(0,0) : true positive rate(Sensitivity) is zero, and true negative rate(Specificity) is 1.

(1,1) : true positive rate(Sensitivity) is 1, and true negative rate(Specificity) is zero. everything is classified as postiive.

https://www.youtube.com/watch?v=4jRBRDbJemM

(c) Gradient boosting machine tree model, Explain why model performance deteriorates as the number of trees increases.

A GBM iteratively builds trees fit to the residuals of prior trees. Depending on the hyperparameters, this model can produce a very complex model, which is susceptible to overfitting to patterns in the training data.

AUC on the testing data starts to drop, which indecates the model is overfit to the training data.

https://www.youtube.com/watch?v=LsK-xG1cLYA

https://www.youtube.com/watch?v=3CC4N4z3GJc

https://www.youtube.com/watch?v=2xudPOBz-vs

잘 이해 안된다. 틈틈히 다시보자.

(d) Describe two hyperparameters to improve model performance.

Early stopping : Early stopping criteria, such as improvement of the performance metrics in each subsequent tree, can stop training when it detects the improvement is marginal. This avoids overfitting.

Controlling learning rate : Learning rate controls the impact of subsequent trees to the overall model outcome. This reduces the extent to which a single tree is able to influence the model fitting process.

(e) How to tune a hyperparameter.

Tuning a hyperparamether requires first varying the hyperparameter across a range of possible values and performing cross validation at each value. Performance is then determined based on a cross-validation performance metric, for example AUC, and the hyperparameter value with best performance based on this metric is selected.

Task11

(d) How changing the link function in the GLM impacts the model fitting and how this can impact predictor significance.

The link function specifies a functional relationship between the linear predictor and the mean of the distribution of the outcome conditional on the predictor variables. Different link functions have different shapes and can therefore fit to different nolinear relationships between the predictors and the target variable.

When the link function matches the relationship of a predictor variable, the mean of the outcome distribution (the prediction) will generally be close to the actual values for the target variable, resulting in smaller residuals and more significant p-values.

Task12

(a) Proxy variable

Proxy variables are variables that are used in place of other information, usually because the desired information is either impossible or impractical to measure. For a variable to be a good proxy it must have a close relationship with the variable of interest.

SOA/ASA/PA 기출 및 내용정리 - 22.10.11시험 Task1~5(기록용)

말킹 — Fri, 5 Apr 2024 15:27:44 +0900

Task2

(c) Discuss the benefits of stratified sampling

Stratified sampling results in test and train datasets that are similar with respect to the stratification variables.

Wikipedia

In statistics, stratified sampling is a method of sampling from a population which can be partitioned into subpopulations.

In computational statistics, stratified sampling is a method of variance reduction when Monte Carlo methods are used to estimate population statistics from a known population.[1]

Stratified sampling - Wikipedia

From Wikipedia, the free encyclopedia Sampling from a population which can be partitioned into subpopulations In statistics, stratified sampling is a method of sampling from a population which can be partitioned into subpopulations. Stratified sampling exa

en.wikipedia.org

Task3

(c) best subset selection vs stepwise selection

best subset selection > global minimum

stepwise selection > local minimum, computationally more efficient

> 시험장가서 각 기법 설명를 영어로 할수 있을지 모르겠다 ㅋ

Task4

(a) Describe two ways impurity measures are used in a classification tree.

- which split in the decision tree should be made next.

- which branches of the tree to prune back after building a decision tree.

https://www.youtube.com/watch?v=_L39rN6gz7Y&t=348s

Statquest 6분경부터 Gini impurity 나옴

Task5

(a) Poisson regression vs Quasi-Poisson regression

An underlying assumption of Poisson regression is that the mean and variance are equal.

Quasi-Poisson regression is equipped to deal with the problem of overdispersion. the estimates of the coefficients are the same when compared to the Poisson output. However, the standard errors are all higher and fewer coefficients are statistically significant. If any further analysis is conducted such as deriving confidence intervals or conducting hypothesis tests, the quasi-Poisson distribution should be used.

<달인즉석계란말이김밥-성내직영점> - 천호역 맛집, 분식

말킹 — Fri, 5 Apr 2024 00:00:21 +0900

지나가면서 들려봐야지 하던곳. 내부는 소소하다.

포장한 계란말이김밥과 떡볶이, 집에서 끓인 우동.

계란말이김밥이 메인이지만 떡볶이 또한 성공적이었다.

새로 이사한 곳에서 성공적인 새로운 분식집을 찾아서 기분 좋았다.