Task5 Perform a principal components analysis

  Principal components analysis is a method to summarize high dimensional numeric data with fewer dimensions while preserving the spread of the data. It can be particularly helpful when variables are highly correlated. PCA finds orthogonal linear combinations of the input variables (which are typically centered and scaled) called principal components (PCs) that maximize variance to retain as much information as possible. The principal components are ordered according to their variance. The sum of their variances is the total variance explained. It is then common to look at the proportion of variance explained by each principal component to decide how many PCs to use.

 Advantages

 PCA could allow us to build a simpler model with fewer features. PCA can help visualize high-dimensional data to explore relationships between variables. PCA can help identify latent variables.

 Disadvantages

 Using a subset of the principal components results in some information loss. The principal components will be less interpretable than the original variable inputs. Although PCA reduces dimensionality in the model, the original variables must still be collected for future predictions, so no efficiency is obtained.

 

Task6 Construct a decision tree

  When a decision tree is trained, it can become very large and include splits that are not particularly valuable for predictions on new data. When examining the fit of a tree, it is a good idea to try to prune a tree when it has many splits that do not improve performance. Pruning reduces the size of the tree, hopefully removing less valuable splits from the tree. This process reduces overfitting the tree on the training data, can lead to better predictions, and results in a simpler, more interpretable tree.

 

 

 

Task9 Discuss the bias-variance tradeoff

 Bias is the expected loss caused by the model not being complex enough to capture the signal in the data. Variance is the expected loss from the model being too complex and overfitting to the training data.

 With high variance (overfitting), the model will perform better on the training set than on a test set. With high bias (underfitting), the model will perform poorly on both the training set and the test set.

 

Task2

(a) Explain how PCA is typically used.

 PCA is an unsupervised learing techniqque that creates new uncorrelated variables that mazimize variance. Often, the first few principal components explain most of the variability in the original variables. These principal components can be used in place of the original variables to reduce dimensionality and create a simpler model.

(b) PCA 특징

 PCA is effective when there is high dimensionality (many variables) which can make univariate and bivariate data exploration and visualization techniques less effective. PCA is used to summarize high-dimensional data into fewer composite variables while retaining as much information as possible.

 PCA attempts to maximize the variance or spread in our data distribution by linearly combining original variables.

 

 

Task3

(a) assumptions for OLS

 - The residuals have a normal distribution.

 - The mean of the residual is zero.

 - The residual variance is constant.(homoscedasticity)

 

 

Task4 ridge/lasso/elastic net

(b) 

  Model1 Model2 Model3
Type Ridge or Elastic-Net Elastic Net Lasso or Elastic Net
alpha 0 <= alpha < 1 0 < alpha < 1 0 < alpha <= 1
Benefit Reduces variance by shrinking coefficients Reduces variance by shrinking coefficients, can also be used to perform model selection and is helpful in instances where there is high-dimensional data with few data points. Reduces variance by shrinking coefficients and can also be used to perform model selection and remove nonpredictive variables.

 

> ridge/lasso/elastic-net 주제도 뭔가 텍스트로 보면 잘 와닿지 않은데 StatQuest에서 시각화 잘해서 알려준다...

overfitting 방지하기 위한 방법론이다, ridge vs lasso 차이(coefficient 0 가능한지) 이상의 문제가 나오면 대응 가능할지 모르겠다. 

 

https://www.youtube.com/watch?v=Q81RR3yKn30

StatQuest - ridge

https://www.youtube.com/watch?v=NGf0voTMlcs

StatQuest - lasso

https://www.youtube.com/watch?v=Xm2C_gTAl8c

StatQuest - ridge vs lasso

https://www.youtube.com/watch?v=1dKRdX9bfIo

StatQuest - elastic-net

 

Task8

(a) Interpret standard daeviation and proportion of variance in the output.

 A standard deviation of 1.6983 implies strong correlation among the three SAT features.

 The proportion of variance 0.9614(PC1) implies that the three SAT variables are highly correltated and that it is reasonable to use PC1 as a replacement for the three SAT scores a predictive model.

 

(b) Interpret the "Loadings of Principal Components" for PC1 and PC2

 The similar loading values in PC1 implies that the SAT scores in three-dimensional space fall near the line Writing SAT = Math SAT = Reading SAT. PC1 primarily prepresents the average score because PC1 is correlated with the direction of any of the SAT variables.

 PC2 shows that the residual variance not explained by PC1 can be mostly explained by the Reading SAT and the Math SAT being positively correlated with each other and negatively correlated with the Writing SAT.(흠?)

 

https://www.youtube.com/watch?v=FgakZw6K1QQ

StatQuest

 설명변수가 많아질때 차원줄이는 방법론이다 보니 글로 봐선 직관적으로 이해 안됬는데 시각화해서 잘 설명해주는 듯.

 

https://www.youtube.com/watch?v=HMOI_lkzW08

StatQuest

 위 영상보다 컴팩트한 버젼! 뭐있나해서 봤는데 위에꺼 봤으면 아래꺼 안봐도 되긴할듯.

 

 

Task10

(a) Explain why tree-based models are resilient to outliers in predictor variables.

 By partitioning the outliers, their effect can be isolated from the other leaves, resulting in the body of the distribution being unaffectied.

(b) Recommend which metric to use for outliers in the target variable. RMSE vs MAE

 I recommend MAE because it is more robust to outliers. Outliers tend to result in large error terms, which have an outsized impact on RMSE, due to the squaring of the error term.

+ Recent posts