Basics of Machine Learning - Classification
Narasimha Agaccha
The fourth incarnation of the god Vishnu as Narasimha; the god with the lion-head. The pose that the god takes on in this statue is called the Yoga Narasimha – from knee to knee there is a meditation-belt meant to keep the meditator in position.
The goddess Laxmi used to be on Narasimha’s knee, but was destroyed when the sultans of the south invaded Hampi in the 16th century. Her right arm is still visible in embrace.
Hovering over Narasimha’s head are 7 cobras.
A Childhood Game
Have you played a Childhood Game called 20 Questions? Someone has a “target” entity in mind (a person or a thing or a literary character) and the others need to discover that entity by asking 20 questions.
- How does one create questions in the game?
- Categories?
- Numbers? How?
- Comparisons?
- What sort of answers can you expect for each question?
- If the “target” was the Narasimha avatara, what questions would you create?
20Q Game as a Play with Data…
Assuming we think of a 20Q Target as say, celebrity singer like Taylor Swift, or a cartoon character like Thomas the Tank Engine, what would an underlying “data structure” look like? We would ask Questions for instance in the following order to find the target of Taylor Swift:
- Human?(Yes)
- Living?(Yes)
- Male?(No)
- Celebrity?(Yes)
- Music?(Yes)
- USA?(Yes)
Oh…Taylor Swift!!!
Let us try to construct the “datasets” underlying this game!
Name | Occupation | Sex | Living | Nationality | Genre | Pet |
---|---|---|---|---|---|---|
Taylor Swift | Singer | F | TRUE | USA | country/rock | Scottish Fold Cats |
Name | Type | Living | Human | Nationality | colour | material |
---|---|---|---|---|---|---|
Thomas, the Tank Engine | Cartoon Character | FALSE | FALSE | UK | blue | metal |
It should be fairly clear that the Questions we ask are based on the COLUMNs in the respective 1-row datasets! The TARGET Column in both cases is the name column.
What is a Decision Tree?
Can you imagine how the 20 Questions Game can be shown as a tree?
Each Question we ask, based on one of the Feature columns, begets a Yes/No answer and we turn the left or right accordingly. When we arrive at the leaf, we should be in a position to guess the answer !
Twenty times 20 Questions !!
What if the dataset we had contained many rows, instead of just one row? How would we play the same 20Q Game in this situation? Here is a sample of the famous penguins dataset:
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
Chinstrap | Dream | 50.6 | 19.4 | 193 | 3800 | male | 2007 |
Adelie | Torgersen | 38.7 | 19.0 | 195 | 3450 | female | 2007 |
Gentoo | Biscoe | 49.4 | 15.8 | 216 | 4925 | male | 2009 |
Gentoo | Biscoe | 46.7 | 15.3 | 219 | 5200 | male | 2007 |
Gentoo | Biscoe | 50.0 | 15.3 | 220 | 5550 | male | 2007 |
Chinstrap | Dream | 51.5 | 18.7 | 187 | 3250 | male | 2009 |
Gentoo | Biscoe | 45.5 | 14.5 | 212 | 4750 | female | 2009 |
Adelie | Dream | 42.2 | 18.5 | 180 | 3550 | female | 2007 |
Gentoo | Biscoe | 42.6 | 13.7 | 213 | 4950 | female | 2008 |
Adelie | Biscoe | 39.7 | 18.9 | 184 | 3550 | male | 2009 |
Adelie | Biscoe | 38.1 | 17.0 | 181 | 3175 | female | 2009 |
Adelie | Torgersen | 36.2 | 16.1 | 187 | 3550 | female | 2008 |
As before, we would need to look at the dataset as containing a TARGET column which we want to predict using several other FEATURE columns. Let us choose species.
When we look at the FEATURE columns, We would need to formulate questions based on entire columns at a time. For instance:
- “Is the bill_length_mm greater than 45mm?” considers the entire bill_length_mm FEATURE column
- Is the sex female? considers the entire sex column
- If the specific FEATURE column is a Numerical (N) variable, the question would use some “thresholding” as shown in the question above, to convert the Numerical Variable into a Categorical variable.
- If a specific FEATURE column is a Categorical (C) variable, the question would be like a filter operation in Excel.
Either way, we end up answering with a smaller and smaller subset of rows in the dataset, to which the questions are answered with a Yes. It is as if we played the same 20 Questions games in parallel, since there are so many simultaneous “answers”!
Once we exhaust all the FEATURE columns, then what remains is a subset (i.e. rows) of the original dataset and we read off the TARGET column, which should now contain a set of identical entries, e.g. “Adelie”. Thus we can extend a single-target 20Q game to a multiple-target one using a larger dataset. (Note how the multiple targets are all the same: “Adelie”, or “Gentoo”, or “Chinstrap”)
This forms the basic intuition for a Machine Learning Algorithm called a Decision Tree. Note that we being human, our tree is plotted inverted: trunk on top and the leaves below. Only God can make a tree.
What did we learn?
- The 20Q Game can be viewed as a “Decision Tree” of Questions and Answers,
- Each fork in the game is a Question.
- Depending upon whether the current answer is yes or no, we turn in one direction or the other. (Remember “binary choices” in our work on the Poisson Distribution! More shortly!)
- Each of our questions is based on the information available in one or other of the columns!!
- We arrive at a final “answer” or “target” after a particular sequence of yes/no answers. This is the one of the leaf nodes in the Tree.
- The
island
and thespecies
columns are categories and are especially suited to being the targets for a 20 Questions Game. - We can therefore use an entire column of data as our 20Questions target, rather than just one entity, person.
- In doing so, we play the same 20Q Game many times over and obtain multiple answers from the target columns.
This is how we will use this Game as a Model for our first ML algorithm, classification using Decision Trees.
Looking at Data in Orange
Let us visualize this Decision Tree in Orange. Look at the now famous penguins
dataset, available here:
https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv
We see that there are three species
of penguins, that live on three islands
. The measurements for each penguin are flipper_length_mm
, bill_length_mm
, bill_depth_mm
, and body_mass_g
.
Task 1: Create a few data visualizations for the variables, and pairs of variables from this dataset.
Task 2: Can you inspect the visualizations and imagine how each of this dataset can be used in a
20 Questions Game
, to create a Decision Tree for this dataset as shown below?
Once you are comfortable with the interface and the way to generate graphs, then you can save this Orange Workflow file locally for reference.
How do we Make Predictions using our Decision Tree in Orange
Download this penguin tree file and open it in Orange.
Our aim is to make predictions
. Predictions of what? When we are given new unseen data in the same format, we should be able to predict TARGET variable using the same FEATURE columns.
NOTE: This that is usually a class/category (We CAN also predict a numerical value with a Decision Tree; but we will deal with that later.)
In order to make predictions with completely unseen data, we need to first check if the algorithm is working well with known data. The way to do this is to use a large portion of data to design the tree, and then use the tree to predict some aspect of the remaining, but similar, data. Let us split the penguins
dataset into two pieces: a training set
to design our tree, and a test set
to check how it is working.
How good are the Predictions? What is the Classification Error Rate?
How Many Trees do we Need? Enter the Random Forest!
Check all your individual Decision Trees:
- Do they ask the same Questions?
- Do they fork pretty much in the same way?
Yes, they all seem to use the same set of parameters to reach the target. So they are capable of being “biased” and make the same mistakes. So we ask: Does it help to use more than one tree, if all the questions/forks in the Trees are similar?
No…we need different Trees to be able to ask different questions, based on different variables or features in the data. That will make the Trees as different as possible and so…unbiased. This is what we also saw when we played 20Q: offbeat questions opened up some avenues for predicting the answer/target.
A forest of such trees is called 🎞️ the Wild Wood a Random Forest !
An Introduction to Random Forests
In the Random Forest method, we do as follows:
- Split the dataset into
training
andtest
subsets (70::30 proportion is very common). Keep aside thetesting
dataset for final testing. - Decide on a number of trees, say 100-500 in the forest.
- Take the training dataset and repeatedly sample some of the rows in it. Rows can be repeated too; this is called
bootstrap sampling
. - Give this sampled training set to each tree. Each tree develops a question from this dataset, in a random fashion, using a randomly chosen variable.
E.g. with
penguins
, if our target isspecies
, then some trees will will useisland
, some others will usebody_mass_g
and some others may usebill_length_mm
. - Each tree will “grow its questions” in a unique way !! Since the questions are possibly based on a different variable at each time, the trees will grow in very different ways.
- Stop when the required accuracy has been achieved (the sets contain observations/rows from only one
species
predominantly) - With the
test set
let each tree vote on whichspecies
it has decided upon. Take the majority vote.
Phew!!
Let’s get a visual sense of how all this works:
Are Random Forests an Example of Complexity?
Does The Random Forest somehow embody the “4A”s of Complexity? (Agents, Actions, Again, Aggregate
)? Can we try to interpret the algorithm as such? Here goes:
- Each little tree in the Random Forest is an independent Agent
- Each agent-tree has a vocabulary of Actions
- to choose randomly a subset of variables from the big dataset
- To independently vote using their choice of variables
- And the “votes” are Aggregated to decide the final prediction
- Which is greater (better, accurate with higher confidence) than each individual Agent-Trees
- 🎷 Do it Again when presented with fresh data
Not bad, is it, as a metaphor for Random Forests? Can 📃More is Different be true here too?
Random Forest Classification in Orange
Let us see how to create a Random Forests ML model in Orange:
RF for for Penguins Data
Let us develop develop a Random Forest Model from ground up using the penguins data downloaded earlier.
We will create a Random Forest Model for this dataset, and compare with the Decision Tree for the same dataset.
RF for Heart Patients Data
Do you want to develop an ML model for heart patients? We have a dataset of heart patients at the University of California, Arvind Irvine ML Dataset Repository:
Cleveland Heart Patient Data CSV.
We will download this data and use it with this Orange Workflow file. Random Forests Workflow File
What are the variables?
- (age): age in years
- (sex): 1 = male; 0 = female
- (cp): chest-pain type( 4 types, 1/2/3/4)
- (trestbps): resting blood pressure (in mm Hg on admission to the hospital)
- (chol) : serum cholesterol in mg/dl
- (fbs): (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- (restecg): resting electrocardiograph results (0 = normal; 1= ST-T wave abnormality; 3 = LV hypertrophy)
- (thalach): maximum heart rate achieved
- (exang): exercise induced angina (1 = yes; 0 = no) (remember Puneet Rajkumar)
- (oldpeak): ST depression induced by exercise relative to rest
- (slope): the slope of the peak exercise ST segment
- Value 1: upsloping
- Value 2: flat
- Value 3: downsloping
- (ca): number of major vessels (0-3) colored by fluoroscopy
- (thal): 3 = normal; 6 = fixed defect; 7 = reversible defect
- (num) : the target attribute, diagnosis of heart disease (angiographic disease status)
- Value 0: < 50% diameter narrowing
- Value 1: > 50% diameter narrowing
(in any major vessel: attributes 59 through 68 are vessels)
We will create a Random Forest Model for this dataset, and compare with the Decision Tree for the same dataset.
How good is my Random Forest?
There are good few metrics to state the performance of our Random Forest. We should know these:
Classification Error
: How many mismatches? Simple enough.Gini Impurity
: Each Group may end up mis-classifying observations from any of the other groups. The Gini Impurity index measures the variance across all groups. If there are \(M\) groups identified from what are actually \(K\) classes, then \(p_{mk}\) defines the proportion ofk-class
observations that are actually inGroup-m
. The Gini Impurity is defined as:
\[ G = \sum_{k=1}^{K} ~ p_{mk} * (1 - p_{mk}) \] (Very similar to the formula for variance of a binary variable).
A little examination of the formula will show that when any proportion \(p_{mk}\) is either very high or very low, then the Gini index is small. A small \(p_{mk}\) means that each leaf is highly “pure”; a large one means that we have labels completely the other way around or swapped!!
A small Gini index is a good thing!
Cross Entropy
In line with Claude Shannon’s idea ofinformation entropy
, we can define the Cross-entropy as:
\[ D = \sum_{k=1}^{K} p_{mk} * log(p_{mk}) \] Again, this tells us how much a particular group has members from another class…so a small Cross-Entropy is a good thing.
References
The beauty of Random Forests: https://orangedatamining.com/blog/2016/12/22/the-beauty-of-random-forest/
Pythagorean Trees for Random Forests: https://orangedatamining.com/blog/2016/07/29/pythagorean-trees-and-forests/
data.tree sample applications, Christoph Glur, 2020-07-31. https://cran.r-project.org/web/packages/data.tree/vignettes/applications.html