1.1.1. Types of statistical learning problems
Supervised Learning: Predicting Numerical Values
Supervised learning is about solving problems where we have a target variable, denoted as Y, and a set of features or covariates, often represented as X, which are multidimensional. Our goal is to build a predictive model or a “black box” that can ingest these features and produce predictions for Y.
For example, in Project 1, students are provided with real estate transaction data in Ames, Iowa; the objective is to predict the sale price of a house based on its features. In Project 2, students are asked to build predictive models for Walmart stores, forecasting sales at a department level based on historical sales information.
In the two examples (Projects 1 & 2) mentioned above, the target variable (Y) takes numerical values. We call these regression problems in supervised learning.
Supervised Learning: Classifying Categorical Data
Our second category within supervised learning deals with classification problems, where Y takes on categorical values, such as positive or negative, or even multiple categories.
Take Project 3 as an example, where students need to determine whether movie reviews, represented as text, are positive or negative. In a project assigned in prior semesters, students were asked to assess the likelihood that a borrower will miss a payment next month based on various borrower and loan characteristics.
In these scenarios, our goal remains unchanged: to craft predictive models, but what distinguishes these problems from the previous category is that the target variable (Y) is categorical. Instead of numerical values, it has categories such as “positive” or “negative.” These tasks fall under the classification problem within supervised learning.
Summary of Statistical Learning Problem Types
In recap, we’ve now classified our statistical learning problems into two main categories: supervised learning and unsupervised learning.
- Supervised learning aims to predict a target variable (Y) using features (X).
If the target variable is numerical, it’s a regression problem, and
if it’s categorical, it’s a classification problem.
Unsupervised learning involves analyzing data without a specific target prediction. Instead, we aim to uncover hidden structures, like clusters or associations, within the data.
Beyond the Basics
These categorizations serve as guides, and real-world problems can often blur these boundaries. For example, we may encounter semi-supervised learning, where we blend elements of both supervised and unsupervised learning. Recall that in supervised learning we aim to build a predictor model that requires labeled data (Y) for training. However, obtaining these labels can be expensive and time-consuming, such as manually rating movie reviews as positive or negative.
On the flip side, there’s a wealth of unlabeled data available on the Internet, like untagged movie reviews. Unsupervised learning operates in this realm, focusing on data (X) without predefined labels (Y).
Semi-supervised learning steps in to combine these two scenarios. It’s all about crafting algorithms that leverage both labeled and unlabeled data to create highly accurate predictor models. In essence, it’s a cost-effective way to achieve accurate predictions when labeled data is scarce or costly to obtain.
Another interesting area within machine learning, which we’ll explore in class toward the end of the semester, is recommender systems. Consider, for instance, that Walmart, beyond forecasting store sales, may also seek to create a system that recommends products to individual customers. In tackling challenges of this nature, it’s common to draw upon a blend of techniques from both supervised and unsupervised learning, sometimes even combining them for more robust solutions.
In practical scenarios, real-world problems often require a blend of both supervised and unsupervised learning approaches for effective solutions. However, it’s important to note that in this course, our primary focus is on supervised learning. While we will dedicate two to three weeks to unsupervised learning, the majority of our time will be spent delving into the realm of supervised learning.