Machine Learning Final

Discuss with an example: The core of machine learning is pattern recognition

Life and the world are not random collections of atoms. There are patterns everywhere, which form, for example, the basis of science, and also the basis of human decision making. Many of these patterns are not easily detected by humans. The usefulness of

Why is Python a good language to use for machine learning?

Python offers a framework for machine learning, so if you run one machine learning analysis with one algorithm, it is easy to re-run the same code with another algorithm just by changing the name of the algorithm

Briefly explain the concept and purpose of a Jupyter notebook

The Jupyter notebook allows the user to interactively program and run analyses with full documentation.

What is a package manager? Why do we use a package manager for our Python work, and which package manager do we use?

Base Python by itself makes an excellent programming language. But one does not want to program everything, but use other, developed software, such as for machine learning. This additional software is organized by packages of related functions. A package

What are the two types of cells in a Jupyter notebook? What is the purpose of each?

A code cell is for entering and running Python code. A markdown cell is for documentation.

Explain the concept of a current working directory

The Jupyter notebook needs a starting point for referencing files to read and write. That reference point is the current working directory. All file references are relative to this directory

What is the relationship of Python and Pandas?

Pandas is a package of functions that add pre-built data analysis capabilities to Python. The primary Pandas data structure is the data frame.

What is a data table and how are the data values organized?

A data table is a rectangular table of the data values subject to analysis. The first row contains the variable names, each other row contains the data for a single unit of analysis, such as a person or company. Each column contains the data values for a

What is a csv file? What are its properties, its primary advantage and its primary disadvantage (compared to an equivalent Excel file)?

A csv file is a comma separated values file, pure text, so readable by virtually every application that can read text. Its primary advantage is near-universal readability. Its primary disadvantage is unlike a worksheet, its columns are not aligned when vi

What is the distinction between categorical and continuous variables? Provide an example variable of each along with some sample values.

The values of categorical variables are non-numeric categories, even if with integer values, and there are relatively few unique values. Continuous variables are always numeric and have many possible values.

What is data wrangling? Why is it important?

Data almost never arrives ready for analysis. Many issues need to be addressed, such as inconsistent coding of responses, missing data, and superfluous variables. Even with those issues settled, the data usually needs pre-processing, including standardiza

Explain the concept of a row name in a data frame. Describe the default row names and the advantage of replacing them with a suitable column from the read data frame.

Each row has a unique identifier. By default, the identifiers are the consecutive integers, starting from zero. However, the data file may contain a unique identifier as one of the already existing columns, such as Name for a data file of employees. In th

What is a variable transformation and how do you perform one with Python Pandas?

A variable transformation defines a new variable, or creates new values for an existing variable, by performing an arithmetic operation on existing variables. To specify in Python, simply specify the arithmetic operation, realizing that variables are iden

What is the distinction between the Pandas .loc and .iloc methods? What is their purpose?

These two methods subset a data frame, by rows and/or columns. .loc subsets by row name or column name. .iloc subsets by index, i.e., the ordinal position of the row or column, starting with (unfortunately) 0.

Write the Python expression for referring to variables x1, x2 and x3 in the df data frame

most general, df.loc[:, 'x1','x2','x3'], or, sometime works, df['x1','x2','x3']

Explain the following Pandas code: data[rows, columns]

A data frame is a 2-D object, rows and columns. Any one data value in a data frame is identified by its row and column coordinates. This notation specifies the name of the data frame and then references data values in one or more rows and columns.

What does it mean to filter rows by data values?

Filtering subsets a data frame by rows, selecting only those samples that satisfy some logical criterion, such as Gender == 'F', which reduces a data frame down to only those rows of data marked with F as the gender

What is the purpose of the variable type category? When should it be used?

A Pandas object refers to a non-numerical object, which is necessarily a categorical variable. But categorical variables can also be integer variables. A category is a newer data type meant specifically to refer to any categorical variables. By default, n

What is the purpose of an indicator (or dummy) variable?

An indicator variable is a numerical representation of a categorical variable. The number of indicator variables formed is equal to the number of levels or categories. A dummy variable is an indicator variable that assumes the value 0 if the category leve

Consider an item on a survey with three possible responses D (disagree) N (neutral) A (agree). What indicator variables would be defined and how are the values of those variables determined?

Three indicators would be defined, one for each potential response. The values of these variables would be 0's and 1's, with the indicator variables getting a '0' where that response was not present in the initial categorical variable, and a '1' where tha

What are quartiles of a distribution and how are they computed?

For a given variable, the first quartile (Q1) is the middle number between the smallest number and the median of the data set. The second quartile (Q2) is the median of the data. The third quartile (Q3) is the middle value between the median and the highe

How is the inter-quartile range analogous to the standard deviation in terms of both being summary statistics of a distribution of a continuous variable?

The more variable the values of a distribution, the more extreme are the first and third quartiles of the distribution. The IQR is the positive difference between the first and third quartiles. So the larger the variability of the values of a variable, th

Define outliers of a distribution in terms of its inter-quartile range.

The traditional definition is that an outlier is beyond 1.5 IQR's from the first or third quartile of the distribution.

What does it mean to say that the median and inter-quartile range are robust to outliers?

If all the values of a distribution remain the same except that the largest value of the distribution changes from 10 to 10,000,000,000, the median and IQR remain unchanged. On the contrary, the mean and standard deviation will be drastically affected.

What is standardization to z-scores?

A z-score indicates how many standard deviations the original value is from the mean of the distribution. The distribution of z-scores has the same shape as the original distribution, but with a different scaling.

What are the mean and standard deviation of a distribution of z-scores?

The mean of a distribution of z-scores is 0, with a standard deviation of 1

What is their range if the distribution is normal.

For a normal distribution, little more than 95% of the values fall within two standard deviations of the mean.

What is the only way to know for sure which rescaling is best for the data values of the predictor variables (features) - MinMax, Standardization, or Robust Scaling?

A theme of machine learning is to see what works. Keep testing data separate from training data, and do whatever you want with the data, choosing what ultimately works best. Most machine learning algorithms perform better, that is, more accurate forecasts

What is supervised machine learning?

Based on a prediction equation, or in more complex cases, a network of inter-related prediction equations, the supervised machine learning forecasts unknown values of a variable of interest.

What are the two goals of supervised machine learning?

Two important goals are accomplished with regression analysis:
� Understand the relationship between a predictor variable (feature) and the response variable
� Forecast the unknown future values of a response variable
Adding relevant predictor variables w

What is a linear model? What are its parameters?

A weighted sum of variables plus a constant term.

What is the shape of the visualization of a linear model?

A straight surface. In two dimensions, that is a line. In three dimensions, a cube. And beyond.

What is y^? What are the two primary situations in which it is applied?

Given a regression model, Y^ is the value calculated from the values of the predictor variables. 2 If applied to the data from which the model was estimated, Y^ is the of y, that is, fit by the model for the associated values of the predictor variables, c

Graph of X with y vs the graph of X with y^.

The graph of X with Y^ is a single line (for a linear function to predict y). The graph of X and y is a scatter plot.

Meaning of the slope coefficient in y^ = b0 + b1X1

In this regression model with a single predictor (feature), B1 is the slope coefficient, estimated from the data. The slope coefficient determines, on average, how much y changes with a increase of 1 unit in X. When applied to a regression model, this cha

Meaning and interpretation of the hypothesis test of the slope coefficient.

Each predictor variable in the model is associated with a slope-coefficient. The purpose of each T ? test, as specified by its null hypothesis, is to evaluate the hypothesis that the relation, as specified by the population slope coefficient Yfor the pred

Meaning and interpretation of the confidence interval of the slope coefficient.

Each predictor variable in the model is associated with a slope-coefficient. The slope-coefficient specifies the average change in y for an unit increase in the change in the predictor variable, . For each estimated slope coefficient,BJ, that is,the sampl

Meaning of the residual variable e

Residual variable E is the difference between the actual value of y and the estimated value of y, Y^. The residual or error represents the influences on the value y not explained or accounted for by the model.

Criterion of ordinary least squares regression to obtain the estimated model.

The least squares criterion is the choice of the regression model that minimizes the sum of squared residuals across all the rows of data in the analysis. That is, this estimation process yields values of each BJ such that, as a set, yield the linear func

How is the least-squares regression model obtained with a gradient descent solution?

An initial, even arbitrary solution for the model parameters is given. Then, to minimize the squared errors across all the rows of data, the parameter values are changed. Then again. Then again, each time getting closer to the smallest possible sum of squ

Model Fit: The standard deviation of the residuals to interpret model fit

If the residuals are normally distributed as the result of a random process, as they usually are, then +2 and -2 standard deviations on either side of zero contains about ~95% of the forecasting errors. How is the standard deviation of the residuals used

Why is R-squared called a relative index of fit?

R-sq literally compares the residuals from two models: the specified model, and the null model where the X's are unrelated to y so that the forecast is just the mean of y.

What is model validation and what is the problem training data to validate a model?

Every data set sampled from a population differs from any other data set sampled from the same population. Every sample reflects the underlying population values, but every corresponding sample value, such as the mean, does not equal the corresponding pop

Briefly explain how multiple regression enhances the two primary purposes of regression analysis?

Predictor variables (features) that are relevant (correlate with y), and provide unique information (don't correlate with the other X's), lead to a) better, more accurate prediction, and b) a better understanding of how the variables are related to each o

Criteria that a potential feature (predictor variable) should satisfy before added to a model.

Predictor variables (features) should be relevant (correlate with y), and provide unique information (do not correlate with the other X's).

What is the purpose and benefit of feature selection (i.e., select predictor variables for a model)

Not all potential features are relevant (correlate with y) and unique (do not correlate with other X's). As such, they contribute little, or even detract, from forecasting efficiency and model interpretability. Particularly for large data sets, they can a

One potential issue with multiple regression is collinearity. Describe the problem and how it can be addressed

Collinearity means that predictor variables (features) correlate substantially with each other. Collinearity increases the standard errors of the estimated collinear slope coefficients as the estimation algorithm cannot readily separate their effects (hol

How does a heat map facilitate feature selection?

The heat map is a visualization of a correlation matrix. An informal, but useful, feature selection technique is to delete some collinear features. The heat map can not only provide the correlations, but also color codes each according to the magnitude of

For a given set of customers, almost all weigh between 110 and 300 lbs. One customer, an outlier, reports a weight of 460 lbs. What is the basis for dropping the customer from the analysis? How does such an action change the reported results?

Either the data value is mis-entered, or, if correct, any generalizations would not properly apply to these people, and perhaps bias the model for the vast majority of customers. In practice, experiment with different deletion thresholds as perhaps a bett

What is generally the best statistic to use to identify outliers in regression analysis? What is its meaning?

Cook's Distance is an influence statistic. That is, it indicates the influence of a single observation, the values of the predictor variables in a single row of data, on the value of the estimated regression coefficients, the y-intercept and slope coeffic

Distinguish between training data and testing data

A core concept of machine learning is that forecasting efficiency cannot be evaluated on the data on which the model trained, i.e., the data from which the model coefficients are estimated. That evaluation can only occur by observing the errors of applyin

Why can a model not be properly evaluated on its training data?

Every data set sampled from a population differs from any other data set sampled from the same population. Every sample reflects the underlying population values, but every corresponding sample value, such as the mean, does not equal the corresponding pop

Define overfitting.

Overfitting is when a model is too complex, where the extra complexity takes advantage of random sampling fluctuations in the training data to increase fit.

What is the problem overfitting presents in evaluating model forecasting performance?

An overfit model fits the training data well, but has poor generalization to actual forecasting, that is, to new data (e.g., the testing data). The good fit to the training data is irrelevant.

How can the analyst determine if a model is overfit?

Compare the fit of the model from the training data to the testing data. If there is a big decrease, the model is overfit to the training data.

Define underfitting, and discuss the problem it presents for model development.

Underfitting means the model is too simple to capture all the information in the training data that is not random variation, but reflects stable aspects of the underlying population.

What does it mean to state that "A model should be made as simple as possible, but not simpler."?

Make the model as complex as it can be to capture the relevant information in the training data to avoid underfitting without so much complexity the model overfits.

What is a hold-out sample, and what is its purpose?

A hold-out sample is the testing data, data on which the model was not trained (fit). Its purpose is to evaluate the forecasting efficacy of the model in a true forecasting situation of which the model is "unaware" of the value of y, but the analyst is aw

. How does k-fold cross-validation extend the concept of a hold-out sample?

With k-fold validation there are k hold-out samples and so k cross-validations.

Why is k-fold cross-validation preferred to just splitting the original data into training and testing data, a train/test split?

Instead of just one arbitrary, usually randomly selected hold-out sample, there are k holdout samples. Any one train/test split may result, by chance, in a weird test sample or training sample. With k different such training/test splits the average perfor

Identify and briefly explain the two types of supervised machine learning regarding the nature of the target variable

Supervised machine learning trains a model to predict a target. The two types of supervised machine learning either forecast a continuous variable, such as linear regression, or forecast a classification into a category, such as logistic regression.

Why is binary prediction the process of classification?

Binary prediction is classification into only one of two categories. Distinguish classification into a category from measurement of a quantitative variable. For example, classify someone as Male or Female body type, but measure their height.

In machine learning the variable to forecasted or predicted is called the target or the label. When is the term label most appropriate?

Label is most appropriate when forecasting the level or category of a categorical variable. The level is described by a label in the usual English definition of the word.

When predicting a binary outcome, what are the two ways to be correct?

There are two groups. The two ways to be correct are to correctly classify a sample into its correct group, one called the positive group and the other the negative group, so a true positive or a true negative.

When predicting a binary outcome, what are the two ways to be wrong?

False Negative, when the model predicts the sample in a group is not in the group. False Positive, when the model predicts the sample is in the group, and it is not.

What is the accuracy of a binary prediction? When it is not of the most interest?

Accuracy is the percent of correct classifications, the number of true positives plus true negatives divided by the total number of samples, including the false classifications.

What is the purpose of the sensitivity (recall) metric?

Sensitivity assesses how many samples in the positive group are correctly classified as positive. It is applicable to situations where the concern is of missing something that exists, such as cancer in a medical diagnosis, a terrorist as a passenger on an

What is the purpose of the precision metric?

The purpose of precision is to determine how many samples in the negative group are incorrectly classified into the positive group. How many airline passengers were incorrectly identified initially as terrorists? How many patients were told they have canc

What metric balances sensitivity and precision? How does it accomplish the balance?

The value of the F1 metric lies between the sensitivity and precision values, as their harmonic mean.

Why is the logit transformation of best fit more appropriate for binary classification than a straight line of best fit?

A straight line cannot effectively summarize a scatter plot of a target variable that only has two values. Instead of a cloud of points, there are two lines of points across the values of the x-variable. Instead of a straight line, an S-shaped curve of th

What is an iterative solution for model coefficients?

When a direct algebraic solution is not possible, then the method to compute the estimated parameter values relies upon iteration, the method of gradient descent. Start with a somewhat if not completely random guess as to the parameter values. Then, using

When in the sklearn Python machine learning environment, how similar is the code for doing k-fold validation for least-squares regression vs. logistic regression? What is the distinction?

This is a huge strength of the sklearn machine learning analysis environment. Simple code changes, such as instantiating another estimation module, can invoke an entirely different estimation algorithm. The analyst can easily test multiple algorithms and

When breaking data into training/test subsets, when forecasting a categorical variable why do we want the same proportion of people in each group in each subset as in the full data set?

To evaluate how good a model is, we need to compare to forecasting without the model. That forecast is from what is called the null model, which, for logistic regression, is the forecast to the group with the most members. If the proportion of members in

What is meant by the term homogeneous group in the context of classification?

A homogeneous group consists of samples that are in the same group, such as Male or Female body type.

What is one statistic for assessing homogeneity? What is its range and how is it interpreted?

The Gini coefficient is a primary statistic for which to evaluate homogeneity of classification. The value of the coefficient ranges from 0 for maximum equality to 1 for maximum inequality with no benefit obtained from the classification system.

What is the root node of a decision tree? What does it represent?

The root node is the beginning node, before any classification takes place. Its membership is the number of samples in each group as the analysis begins, such as the number of Men and Women in the analysis.

What is a leaf from a decision tree? What does it represent?

A leaf is a node at the bottom of the decision tree, a final classification in which no more splits

Consider the following scatterplot as related to forecasting body type according to Gender. To forecast Gender, would a decision tree algorithm choose the Waist or Shoe feature to make the first split? Why? About where would the split occur (the decision

The algorithm would choose the Shoe size feature because a split at about 8 � does a good, though not perfect, job of separating Male and Female body types. There is no split on the Waist feature that attains any decent accuracy of differentiation

How is it that given a decision tree with enough levels of depth, the model can recover the correct class value (e.g., Gender) with perfect accuracy?

Increasing the complexity of the model ensures better fit on the training data. With a decision tree analysis, the analyst can add enough depth (splits) to the model to correctly classify everything. Of course, such classification will not generalize beyo

7. When conducting a machine learning analysis, how can the analyst detect overfitting?

The fit indices will look great on the training data, and much worse on the testing data, the indicator of real-world performance.

What is the distinction between a model parameter and a hyper-parameter? Give an example of each

A model parameter is a characteristic of a specific model estimated from the training data, such as the slope coefficients of a regression model. A hyper-parameter is a characteristic of any one model that is set by the analyst, but could vary across mode

How is hyper-parameter tuning related to fishing? When is OK to do so?

Hyper-parameter tuning is searching for the best parameter setting, such as the number of features in a model, without any real theoretical reason for choosing the value. Instead, use modern computers to grind away at a large range of possibilities, choos

. A machine learning analyst investigates fit for a decision tree model with 2, 3 and 4 features at depths of 2 and 6, with a 3-fold cross-validation.
a. How many distinct models are analyzed?
b. How many analyses are performed?
c. How many hyper-paramete

Three features and two depths lead to the analysis of 3x2=6 models.
The 3-fold cross-validation subjects each model to three analyses, so 6x3=18 analyses.
Two hyper-parameters are investigated, tree depth and number of features.

11. What is the relation of a random forest estimator/model to a decision tree?

The random forest is an evaluation of many different decision trees. The algorithm constructs a series of decision trees where each tree is based on a different random sample of (a) the data, with replacement, and (b) the available features, with the fina

What is local optimization (such as regarding the decision tree solution)? What is its primary disadvantage?

Given the initial model configuration, such as the first split in a decision tree analysis, a 3 different final model likely emerges than if another first split had been taken on a different variable. The problem is that the splitting process to move down

When classifying customers to identify those most likely to churn (exist as a customer), which of the three-classification metrics is the most useful: accuracy, recall, or precision? Why?

The analysis of customer churn is primarily concerned with not losing existing customers. Unless the amount of resources dedicated to those customers predicted as likely to leave is not excessive, better to avoid the false positives. That is, OK to have s

What makes machine learning a 2nd-decade 21st century technology, as opposed to, say, the 1990's?

Computer power. Having massively more computer power allows more intensive analyses of algorithms that existed for decades, such as applying hyper-parameter tuning to multiple regression. Further, new estimation algorithms have been developed, such as ran

K-means cluster analysis minimizes the cluster inertia for each cluster. What is cluster inertia?

� Cluster inertia is the sum of distance of each point from its assigned centroid.

1. A cluster should demonstrate cohesion and separation. What are these concepts?
What fit index simultaneously assesses these two concepts? How is it interpreted?

Cohesion describes how tight a cluster is - ie how closely the members of a cluster are together. Separation describes how distinct the clusters are from each other.
The silhouette metric asses both concepts. Silhouette varies between 1 and -1, and we wan

1. What is the Pythagorean theorem so important to cluster analysis?

� The Pythagorean theorem allows us to calculate the distance between two points of a cluster.

What is a cluster centroid and how is it computed?

A cluster centroid is the center of the cluster that is calculated as the mean of the corresponding feature values of each sample in the cluster for each feature

Why is it good practice to investigate multiple initial configurations of centroids when pursuing a K-means cluster analysis?

It is good practice because different initial configurations can lead to a different cluster solution. By starting with different initial configurations, we can ultimately select that configuration which best balances cohesion with separation.

1. a. What is the distinction between supervised learning and unsupervised learning?
b. How can unsupervised learning be a preliminary step to supervised learning?

� In supervised machine learning we provide use a set of x variables to predict a y variable. In unsupervised machine learning, we use only x variables to identify patterns within the data.
� Through unsupervised learning, we can identify relationships wi

1. Express in words the calculation of Euclidean distance between two points calculated over p features. [Write the formula if you wish, but do describe verbally.]

� To calculate the Euclidean distance, we take the square root of the squared distance of the first feature plus the squared distance of the second feature, up to the squared distance of the pth feature.

1. Why is it important to standardize (or otherwise normalize) the data before pursuing a K-Means cluster analysis?

� We need to standardize the data before doing a K-Means cluster analysis because we need to be able to calculate the distance between each feature using the same unit.

1. Describe the statistical procedure to assess the best number of clusters for a given data set.

� To find the best number of clusters, select a value that provides both a low inertia with a high silhouette.

1. What is the distinction between a parameter of a model, and a hyper-parameter? Give an example of each.

� A parameter of a model is a value that is estimated by the model such as the slope coefficient. A hyper parameter is a characteristic of the model such as the number of clusters.