Artificial Intelligence with Machine Learning in Java: Entropy and ID3 Algorithm

A summary of the AiML Oracle course: Part 3

--

In this post I’ll summarize the main topics of chapter 4: Entropy and ID3 Algorithm, which is the last chapter in the course.

Part 1 can be found here: Intro to AI & ML
Part 2 can be found here: Trees & Recursion

Things You’ll Learn:

※ What is a decision tree algorithm.
※ The 6 most popular decision tree algorithms.
※ Entropy & information Gain.
※ Solve a problem with ID3 algorithm.

Table of Content

● Decision Tree Algorithms
● Entropy
● ID3 Worked Example

Note: The content and the structure of the post are based on my understanding of the concepts provided by Oracle Academy AiML course.

Decision Tree Algorithms

What is a Decision Tree?
It belongs to a family of supervised learning algorithms, and can be used for solving both regression & classification problem.

A decision tree algorithm constructs a model based on answers from data. A model is an output of the algorithm you can think of it as a function or the way I imagine it: as a big if statements. if an event met then output A.

A model constructed using a decision tree algorithm for the titanic data might look like the image below. Instead of asking random questions the algorithm will structure the tree such that the most relevant question is asked first.

Titanic decision tree

Most popular decision tree algorithms:
▢ CART ( Classification And Regression Tree ).
▣ ID3 ( Iterative Dichotomiser 3 ).
▢ C4.5 ( Successor of ID3).
▢ CHAID ( Chi-squared Automatic Interaction Detection ).
▢ Decision Stump.
▢ Conditional Decision tree.

▶ ID3 is what the course focuses on.

Entropy

A decision tree algorithm will structure the tree such that the most relevant question is asked first.

How would we know which question is the most relevant (less random)? how did we know that asking whether the person is a Woman was less random then asking the age first? This is where entropy comes into play.

What is variance?

“The average of the squared difference from the mean”
in other words: Measures how far the dataset is spread out.

Low variance: We are more confident about the values in the dataset.
High variance: We don’t know much about the dataset (data is random).

What is Entropy( Shannon Entropy )?
Measures the randomness in an event, high level of randomness will provide high level of uncertainty. Information entropy allows quantifications of making the best split in our decision tree by looking at the variance in the data.

The course works with log base 2

Fair coin Toss:
flipping a coin has 2 outcomes (heads or tails), probability of each is ½.

Coin toss example

Equal probabilities gives the highest entropy, and it means that half of the time we will have a random outcome. The higher the entropy the higher the uncertainty of the outcome.

Biased coin Toss:
In this example the coin is biased, what we can tell from the graph below is that the entropy is 0 when the probability is either 1 or 0, the entropy is maximum(1) when the probability is ½ because in this case the event is random and there is low chance of perfectly predicting the outcome.

Shannon entropy for a biased coin

ID3 Worked Example

what is information gain?
Measures how much entropy is reduced in S when splitting S on attribute A. If entropy is high (high uncertainty) then the information gain is low and vice versa.

ID3 is a decision tree algorithm used to generate a decision tree from a dataset. We will solve the “Play outdoor sport” problem using ID3.

Searching for the root

We will use Entropy formula and Information Gain to search for a candidate column(Attribute) in the table and pick as the root.

Play(Table)

We have 14 data samples, 9 of which are YES and 5 NO, the complete entropy of dataset is 0.94.

we will calculate the Entropy and Gain of each attribute on the entire dataset.

Outlook(Table)

Outlook attribute has 3 types of categorical values ( Sunny, Overcast, Rain )
we do the same as above but for each categorical value. Then we calculate information gain: how much entropy reduced in the entire table when splitting on Outlook = 0.246.

We continue in the same manner for the rest of the attributes.

Temperature(Table)

Humidity(Table)

Windy(Table)

Pick a Split

From the image below the outlook attribute has the highest information gain, which makes it as the best candidate for the root of the tree.

This is how our tree currently looks like:

The outlook attribute is the best candidate for the root, if its Sunny we will branch to a different path then that of overcast or rain.

Now we search for the next best split in each of the paths ( Sunny, Overcast, Rain ).

Sunny

Play(Sunny)

Calculate the Entropy only on the datasets that have sunny as a categorical value in it. There are 5 data samples that have sunny in them, 2 of which are YES and 3 NO.

We do the same as previous except we only care if the data sample has sunny in it.

Temperature(Sunny)

The temperature attribute has 3 categorical values ( Hot, Mild, Cold )
The Hot and Cold have an entropy of 0, which means that they are perfectly predictable. All sunny labels when the temperature is hot has play tennis as NO. All sunny labels when the temperature is cold has play tennis as YES. However the Information Gain is low compared to the attribute Humidity.

Humidity(Sunny)

The Attribute humidity has the highest information gain. All its categorical values are perfectly predictable when the outlook is sunny.
If it is sunny and high humidity the target attribute(play tennis) is NO. On the other hand if it is sunny and humidity is normal then the choice for playing tennis is YES.

Windy(Sunny)

Pick a Split

According to the image below the humidity attribute has the highest information gain.

This is how our tree looks like after the new split:

Now we search for the next best split in path Overcast.

Overcast

Looking back at the table again, we can see that all target attributes for overcast is perfectly predictable, regardless of the temperature, humidity and wind it is always YES.

This is how our tree looks like after the new split:

We are almost done, we only have to now search for the best split in path Rain.

Rain

Play(Rain)

We do the same thing we did previously, except we only care about the data samples that have rain in it. There are 5 data samples that have rain in them, 3 of which are YES and 2 NO.

Temperature(Rain)

Humidity(Rain)

Windy(Rain)

All the categorical values are perfectly predictable when the outlook is rainy. And accordingly it has the highest information gain. If it is rainy and no wind the target attribute(play tennis) is YES, otherwise NO.

Pick a Split

Final Tree split

we are finally done splitting our tree, using the ID3 algorithm.

DONE (ノ◕ヮ◕)ノ*:・゚✧

--

--

Nina Almaamary
Nina Almaamary

Written by Nina Almaamary

(ノ◕ヮ◕)ノ*:・゚✧ Interested in A.I.

No responses yet