Given five patient attributes — age, sex, blood pressure, cholesterol, and sodium-to-potassium ratio — which drug should be prescribed? That's the problem PharmaTree solves. A Decision Tree Classifier trained on 200 patient records, wrapped in a Streamlit application that makes predictions interactive and interpretable.
The best ML model is the one your stakeholders can actually understand. A decision tree isn't always the most accurate — but it's the most explainable. That matters in healthcare contexts.
The Dataset
The UCI drug dataset contains 200 patients, each labeled with one of five drug types (DrugA, DrugB, DrugC, DrugX, DrugY). Features include age (integer), sex (M/F), blood pressure (LOW/NORMAL/HIGH), cholesterol (NORMAL/HIGH), and Na_to_K ratio (continuous). Categorical variables are encoded using scikit-learn's LabelEncoder before training.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])
df['BP'] = le.fit_transform(df['BP'])
df['Cholesterol'] = le.fit_transform(df['Cholesterol'])
df['Drug'] = le.fit_transform(df['Drug'])
The Model
Decision Tree Classifiers work by recursively splitting the data on the feature that produces the greatest information gain at each node. The tree grows until it reaches a specified depth or minimum sample threshold — both exposed as tunable parameters in the app.
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(
max_depth=max_depth,
min_samples_split=min_samples_split
)
model.fit(X_train, y_train)
The sodium-to-potassium ratio (Na_to_K) consistently emerges as the dominant feature — accounting for roughly 50% of the model's decision weight. This aligns with clinical literature: Na_to_K is a strong biomarker for drug-class selection, particularly for diuretics.
The Application
The Streamlit app exposes four views: a live prediction interface with probability visualization, a data analysis tab showing age distributions by drug type, a model performance tab with accuracy metrics and real-time feature importance, and a technical explainer. Model parameters update dynamically — adjust max depth or minimum samples and the accuracy and feature importance charts recalculate immediately.
prediction = model.predict(input_data)
proba = model.predict_proba(input_data)
# Actual probabilities, not hardcoded values
fig.add_trace(go.Bar(
x=['Drug 0', 'Drug 1', 'Drug 2', 'Drug 3', 'Drug 4'],
y=proba[0].tolist()
))
What This Is — and What It Isn't
200 rows is not a clinical dataset. This model should not inform actual prescriptions. What it demonstrates is the full stack of a production ML application: data ingestion, preprocessing, model training, interactive inference, and real-time visualization — packaged and deployed. The same architecture scales to real clinical data with a dataset swap.
The next iteration would add model comparison — Decision Tree vs Random Forest vs XGBoost — and train on a larger clinical dataset to earn the credibility ceiling this version can't reach.
Try the live application — input patient attributes and see the prediction and probability breakdown in real time.
Built using the IBM Machine Learning with Python course on Coursera — the certificate that started this track.