1.1.1. Types of statistical learning problems

Supervised Learning: Predicting Numerical Values

Supervised learning is about solving problems where we have a target variable, denoted as Y, and a set of features or covariates, often represented as X, which are multidimensional. Our goal is to build a predictive model or a “black box” that can ingest these features and produce predictions for Y.

For example, in Project 1, students are provided with real estate transaction data in Ames, Iowa; the objective is to predict the sale price of a house based on its features. In Project 2, students are asked to build predictive models for Walmart stores, forecasting sales at a department level based on historical sales information.

In the two examples (Projects 1 & 2) mentioned above, the target variable (Y) takes numerical values. We call these regression problems in supervised learning.

Supervised Learning: Classifying Categorical Data

Our second category within supervised learning deals with classification problems, where Y takes on categorical values, such as positive or negative, or even multiple categories.

Take Project 3 as an example, where students need to determine whether movie reviews, represented as text, are positive or negative. In a project assigned in prior semesters, students were asked to assess the likelihood that a borrower will miss a payment next month based on various borrower and loan characteristics.

In these scenarios, our goal remains unchanged: to craft predictive models, but what distinguishes these problems from the previous category is that the target variable (Y) is categorical. Instead of numerical values, it has categories such as “positive” or “negative.” These tasks fall under the classification problem within supervised learning.

Unsupervised Learning: Discovering Hidden Patterns

Here is the third type of problems. For example, based on the housing data provided in Project 1, can we identify any home buying or selling trends? Where are the highly sought neighborhoods for first home buyers? What is the average price for houses purchased by the first home buyer? Further, can we identify any distinctive groups of buyers, so the real estate companies and the mortgage companies can design their service and financial products to meet their different needs? In this instance, we’re not aiming to predict a specific target variable (Y); rather, our goal is to extract patterns and structures from the data (X).

A similar problem could arise in the data analysis problem for Walmart data. For example, based on the transaction data at Walmart, can we recommend any marketing strategies based on associations between purchased items. This might involve using techniques like association rules or market segmentation through clustering algorithms. (For association rule, a classical algorithm designed to identify items that are often purchased together by consumers, you can read Chapter 14.2 of Elements of Statistical Learning.) Again, the goal is to identify patterns within the data (X) to inform decision-making, without necessarily predicting a specific target variable (Y).

This type of challenge aligns with unsupervised learning – where the focus is on uncovering hidden patterns, structures, and groupings within the data, rather than predicting a designated target.

Summary of Statistical Learning Problem Types

In recap, we’ve now classified our statistical learning problems into two main categories: supervised learning and unsupervised learning.

  • Supervised learning aims to predict a target variable (Y) using features (X).
    • If the target variable is numerical, it’s a regression problem, and

    • if it’s categorical, it’s a classification problem.

  • Unsupervised learning involves analyzing data without a specific target prediction. Instead, we aim to uncover hidden structures, like clusters or associations, within the data.

Beyond the Basics

These categorizations serve as guides, and real-world problems can often blur these boundaries. For example, we may encounter semi-supervised learning, where we blend elements of both supervised and unsupervised learning. Recall that in supervised learning we aim to build a predictor model that requires labeled data (Y) for training. However, obtaining these labels can be expensive and time-consuming, such as manually rating movie reviews as positive or negative.

On the flip side, there’s a wealth of unlabeled data available on the Internet, like untagged movie reviews. Unsupervised learning operates in this realm, focusing on data (X) without predefined labels (Y).

Semi-supervised learning steps in to combine these two scenarios. It’s all about crafting algorithms that leverage both labeled and unlabeled data to create highly accurate predictor models. In essence, it’s a cost-effective way to achieve accurate predictions when labeled data is scarce or costly to obtain.

Another interesting area within machine learning, which we’ll explore in class toward the end of the semester, is recommender systems. Consider, for instance, that Walmart, beyond forecasting store sales, may also seek to create a system that recommends products to individual customers. In tackling challenges of this nature, it’s common to draw upon a blend of techniques from both supervised and unsupervised learning, sometimes even combining them for more robust solutions.

In practical scenarios, real-world problems often require a blend of both supervised and unsupervised learning approaches for effective solutions. However, it’s important to note that in this course, our primary focus is on supervised learning. While we will dedicate two to three weeks to unsupervised learning, the majority of our time will be spent delving into the realm of supervised learning.