Distinguishing Between Supervised and Unsupervised Learning

Objective

After completing this lesson, you will be able to distinguish between supervised and unsupervised learning approaches

Key Differences Between Supervised and Unsupervised Learning

Supervised learning uses labeled training datasets, whereas unsupervised learning does not have a target variable. Unsupervised learning algorithms attempt to learn the data's intrinsic structure with no human supervision or instruction. Under unsupervised learning, unlabeled input data is provided to the algorithm for finding patterns within the dataset.

Supervised and unsupervised learning are the major approaches in machine learning. Supervised machine learning is appropriate for classification and regression tasks, such as weather forecasting, predicting stock prices, fraud detection, customer churn prediction, among others. Unsupervised learning on the other hand learns from data without human supervision and is frequently used for exploratory data analysis and clustering activities, such as anomaly detection, big data visualization.

Supervised learning modeling focuses on learning the relationships between input and output data items. For instance, classification attempts to predict the categorical values of the target variable given the input data, whereas regression attempts to understand the relationship between independent variable(s) and the dependent variable.

In contrast, for unsupervised learning, there are no labels but only data attributes, i.e., features, describing such block groups. Unsupervised learning supports uncovering patterns in datasets. For instance, activities such as clustering split data into collections of similar entities.

In summary, the key differences between supervised and unsupervised learning lay on the way the models are trained and the type of training datasets fed to the algorithms.

Supervised Learning Training Datasets

Supervised learning algorithms are trained on labeled dataset, that is, the dataset has a 'Target Variable'.

In a labeled dataset, each instance in the dataset is accompanied by a label or 'Target Variable' that denotes the correct output. This label is used by supervised learning algorithms to learn a function that maps inputs to desired outputs.

Supervised learning algorithms are trained on dataset, where both the input features and the correct outputs (labels) are provided. The model learns to predict the output from the input data. Unsupervised learning algorithms, on the other hand, work with unlabeled data and attempt to identify inherent patterns or groupings without prior knowledge of the outcomes. The duration and complexity of the training process for both types of learning can vary depending on factors such as the size of the dataset, the available computational resources, among others.

The California housing dataset, available on the California open data portal, (https://data.ca.gov/dataset) is used as reference in the below paragraphs.

The table below illustrates the sample input dataset for a supervised learning task, which aims to train a model to predict the 'Median house value' for California districts. In other words, the 'Target Variable' is 'Median house value'.

The columns are also known as 'Features'. Such features are data attributes that can help predict the 'Target Variable'. Intuitively, 'Median house age', 'Average number of rooms per household', and 'Average number of bedrooms per household' are characteristics of a house that influence its value. Lastly, 'Features' and 'Labels' are the input data for supervised model training.

An image showing a labeled dataset for a supervised learning task, including various features such as 'Population', 'Median income', 'Housing median age', and 'Total rooms', with the aim of training a model to predict the 'Median house value' for districts in California.

This dataset was derived from the 1990 U.S. census, and every row represents a census block group. A block group is the smallest geographical unit for which the U.S. census bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

A table illustrating the example of 'Target Variable' of the California housing dataset

For example, the values '4.526' and '3.585' are labels for the continuous 'Target Variable', meaning the variable can take any value between its minimum and maximum value.

Every row is an observation or example that the model will learn from.

Supervised Learning Approach - Model Prediction

After model training completes, new input data can be provided to the model.

In this scenario, a block group is used as input.

Recall that a block group is the smallest geographical unit for which the U.S. census bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

The features are given as inputs in the table, and the model delivers a prediction: the 'Median house value'.

The image depicts a diagram where the features (such as 'Population', 'Median income', 'Housing median age', and 'Total rooms') are provided as inputs to a model, that calculates as final result, the 'Median house value'.
Unsupervised Learning Training Datasets

For unsupervised learning, the dataset doesn't have labels but does have data attributes, i.e., 'Features', describing such block groups. Unsupervised learning supports uncovering patterns in datasets. For instance, activities such as clustering split data into collections of similar entities.

Unsupervised learning can be applied to identify the different varieties of housing markets.

The table below shows a census block and it is used by a clustering algorithm to generate groupings of regions based on the similarity of census block groups. For example, one category can be a housing market with high 'Median income' and an average number of rooms of at least 5.

The California Housing Dataset - Census Block

Median incomeMedian house ageAverage number of roomsAverage number of bedroomsPopulationAverage number of household membersLatitudeLongitudeMedian house value
15.00132.08.8450411.0351241318.02.72314037.44-122.225.00001
15.00143.05.6875000.75000058.03.62500037.46-121.875.00001
15.00152.07.9944751.027624483.02.66850837.79-122.445.00001
15.00117.08.5203761.0219441011.03.16927932.99-117.235.00001
15.00127.07.6519230.9807691351.02.59807734.10-118.405.00001
Unsupervised learning training datasets

It's important to note that these categories emerge through the unsupervised learning process, rather than being predetermined. The insights gained from clustering results can be valuable for home buyers and property investors.

Unsupervised Learning Approach - Model Prediction

After the model training is complete, it can be used to make predictions. Thus, by using the trained model, you can predict the best fit for the given input block group as shown below.

Predicting the 'Housing Region' based on the model built.

Unsupervised Learning Approach

Unsupervised learning, on the other hand, can always be relied upon to uncover patterns in unlabeled datasets. Data labeling can be a labor-intensive task, which might not always be possible. Additionally, you may not even know the meaningful categories or groups into which datasets can be split.

For example, discovering meaningful clusters of housing regions in California can help home buyers and property investors better understand the Californian housing market.

By identifying these patterns, unsupervised learning provides valuable insights that would otherwise be difficult to obtain.

An icon representing datasets when they are not labeled. An icon representing datasets when they are labeled.

Log in to track your progress & complete quizzes