Predicting Hepatitis Survival with Decision Tree and Random Forest

Part 1 - Decision Tree

Python code in this project: add_header_plot_missing.txt + decision_tree.txt
In this article, I have a set of data on patients with acute hepatitis. The dataset is small and contains information on only 155 patients. What is very interesting about it is that at least three research papers analyzed it with different results: Breiman 2001, Cestnik, Kononenko, & Bratko 1987, and Diaconis & Efron 1983.

I will take a look at this dataset myself, beginning with fitting a decision tree to the data using the CART algorithm.

Data Description & Cleaning


The dataset comes from https://archive.ics.uci.edu/ml/datasets/hepatitis. According to the link, it was put together and generously contributed by Gail Gong at Carnegie-Mellon University. As mentioned before, this is a set of data on 155 patients with acute hepatitis. The key variable is "class", indicating whether the patient survived. In addition, there are 19 other attributes.



The binary attributes described by words (as shown in the figure above) have already been converted to 1 to represent the category on the left, and 2 for the category on the right. For example, the "steroid" variable is either a 1 for "no", or a 2 for "yes".

It is unclear as to how exactly the "class" variable is defined. My guess is that these patients were admitted to hospital for treatment. "Live" might mean that they survived their stay and were later discharged.

The data set is missing a header, which I added. I also feel that calling the variable "class" is a little cryptic, and used this opportunity to rename it to "outcome" instead. Next, I looked at the values that are missing, all of which have been replaced with a "?" symbol in the dataset.



We can see from the plot above that there are quite a few missing values!

Handling Missing Values

Decision tree algorithms iteratively split datasets into two subgroups, and so could technically ignore missing values, by deciding the optimal split based only on the data points that are present. Unfortunately, the sklearn module for decision trees does not support missing values!

Dealing with missing data and imputation, is a big topic that I will (probably) talk about in another article. For this set of data, I am going to try simply replacing missing values with an extreme value of -10000. If these values has an impact, the algorithm should make a cut at a large negative value, and gather them into a node.

The python code for adding a header and plotting missing values can be found here: add_header_plot_missing.txt

CART Decision Tree Algorithm


Now that the data has been cleaned, we are ready to fit a decision tree to it.

There are various methods for constructing decision trees. I chose to use the famous CART algorithm introduced by Breiman, Friedman, Olshen, Stone in their 1984 book, "Classification and Regression Trees". CART has been considered one of the "top 10 algorithms in data mining". That top 10 algorithms research paper also contains a great description of how CART works.

Algorithm Description

The CART algorithm selects an attribute, using the criteria described below, to partition the set of data into two subsets based on the value of the attribute.



For example, in the figure above, 72 data points with attribute values \( x_{16} <= 3.85 \) is assigned to the left subset, while the 83 data points with \( x_{16} > 3.85 \) are assigned to the right subset. The same criteria is then applied independently to each of the two resulting subsets, to split them into more subsets. This process is repeated on all future subsets until they contain elements only from one class, or we run out of attributes to use.

Gini Impurity Splitting Criteria

Two important questions that arise here are: which attribute do we choose? How do we choose the cutoff point for assigning to the left and right subsets?

CART does this by selecting the variable and split that result in the greatest decrease in gini impurity. This decrease is calculated by subtracting the gini impurity of both child nodes from that of the parent. The gini impurity is reported at each node. For example, the root node in the figure above has gini impurity \( = 0.328 \).

Gini impurity is an interesting metric for how "mixed" a set of data is, using this process:


For example, if the proportion of a label is \( p \), then the probability of a random data point getting the wrong label is \( p(1-p) + (1-p)p = 2p(1-p) \). This is maximized at \( p = 0.5 \), when the data is evenly split. This concept can be extended to sets with multiple labels by summing up \( p_i (1-p_i) \) for each label class \( i \).

There is an alternative metric called "information gain criteria", but research on their differences suggests that both metrics are interchangeable.

My Decision Tree





The code for generating and plotting this decision tree can be found here: decision_tree.txt.

Note that there might be slight differences in the decision tree generated each time this code is run. This is because if two splits are tied for the "best", the choice of splits is chosen randomly.

Score and Confusion Matrix


The sklearn decision tree classifier module has a "score" method that returns the subset accuracy of my model, which is simply the percentage of data points that are correctly labeled.


The accuracy is surprisingly good at around 92.25%! Next, let's take a look at the confusion matrix for my model. Good thing for us, sklearn has a module to do this for us.


As shown above, it produces a matrix with the \( (i,j) \) entry containing the number of data points that has true label \( i \), but is classified \( j \) by our model.

Note that the diagonals entries have coordinates \( (i,i) \) and therefore contain the number of data points that are correctly labeled by our model. Also notice that in the figure above, there are only \( 9 + 3 \) mislabled data points in the off diagonal!



My CART decision tree model turned out surprisingly well. I will analyze this data set with a random forest model in part 2 of this article, as well as take a deeper look at the variables that our model decided were important.