Install prerequsite: you may have to do Runtime -> Restart runtime after the installation
!pip install –upgrade pip
!pip install –upgrade setuptools wheel
!pip install –upgrade “mxnet<2.0.0"
!pip install autogluon
import autogluon
Mount Google Drive (You don't need to run this if you are running notebooks on your laptop)
from google.colab import drive
# The following command will prompt a URL for you to click and obtain the
# authorization code
drive.mount("/content/drive")
Mounted at /content/drive
# Set up data folder
from pathlib import Path
# Change this to where you put your hw2 files
DATA = Path("")
Problem 2: Classifying Skin Sample Source Using Transcriptomic Profile¶
In Problem 1, we have done differential expression between sun-exposed skin and non-exposed skin. As we see a distinct transcriptomic pattern in these two types of samples, we will try to build a binary classification model, that given an transcriptomic profile of a skin sample, we can prodict the whether the skin has been exposed to the sun or not.
We'll use the log2 CPM values generated from DGEList to build our model.
We'll also load the meta data and clean up some columns as we did in Problem 1 and extract sample IDs that match the column names of the expression table.
import pandas as pd
gefile = DATA / "normal_gtex_subset_rnaseq_tmm_logcpm.txt"
metafile = DATA / "gtex_subset_sample_sheet.txt"
ge = pd.read_csv(gefile, sep="\t", index_col=0)
meta = pd.read_csv(metafile, sep="\t", index_col="SAMPID")
As in Problem 1, we are only interested in skin samples. We will also create a biinary target variable indicating whether the sample was from sun-exposed skin.
# Align the samples in two files
meta = meta[
meta.SMTSD.isin(
{"Skin - Not Sun Exposed (Suprapubic)",
"Skin - Sun Exposed (Lower leg)"}
common_samples = list(set(meta.index) & set(ge.columns))
meta = meta.loc[common_samples]
ge = ge.loc[:, common_samples]
meta["sun_exposed"] = 1
meta.loc[meta.SMTSD == "Skin - Not Sun Exposed (Suprapubic)", "sun_exposed"] = 0
# count patients per group
meta.groupby("sun_exposed").size()
sun_exposed
dtype: int64
Split test set¶
Next, we will split our data by extracting the 10 test subjects we defined here:
[skin_test_subjects.txt]
We will create an additional column in meta called subset, which indicate whether this sample belongs to train or test.
Next, we will split our data by extracting the 10 test subjects we defined here:
[skin_test_subjects.txt]
We will create an additional column in meta called subset, which indicate whether this sample belongs to train or test.
test_subject_id = pd.read_csv(
DATA / "skin_test_subjects.txt", sep="\t", header=None
test_subject_id
0 GTEX-11VI4
1 GTEX-11WQK
2 GTEX-1212Z
3 GTEX-12696
4 GTEX-1269C
5 GTEX-14PK6
6 GTEX-16BQI
7 GTEX-V1D1-
8 GTEX-X8HC-
9 GTEX-ZYT6-
Name: 0, dtype: object
# define train and test set
meta["subset"] = "train"
meta.loc[meta.subject_id.isin(test_subject_id), "subset"] = "test"
As a sanity check, let's see how many samples we have in each class after the split:
meta.groupby(["subset", "sun_exposed"]).size()
subset sun_exposed
test 0 10
1 10
train 0 42
1 50
dtype: int64
You would notice that here we are working with a very small data set. Ideally in a machine learning use case we would like to have at least 100 samples per each class for training, while also have much more testing samples. This exercise is just a toy example for you to practice how to prepare data and apply AutoML. It is definitely not an ideal use case. Therefore, you might notice some behaviors of the models are counter-intuitive or the outcomes might not be meaningful.
Feature selection¶
To build our first model. Let's use the differential expressed genes we found in Problem 1 as the features. Use only genes with logFC > 1 or logFC < -1 (and FDR < 0.05 as we already selected). create x_train, x_test from RNAseq data (with selected genes as columns and training samples and testing samples as rows respectively), and y_train, y_test vector from the sun_exposed column of meta.
#=======================================================
# Your code here
# Create x_train, x_test, y_train, y_test
#=======================================================
Training a classifier using AutoML¶
Again follow what we did in class. Using the features selected above, we will build a model using AutoML. You can use the preset of good_quality_faster_inference_only_refit to save time and space.
#===========================================================================
# Your code here
# Train a classification model using AutoGluon TabularPrediction module
#===========================================================================
Once the model is trained, evaluate the best model from AutoML on test set using the performance scores function provided below, like what we did in class.
from sklearn.metrics import (accuracy_score, balanced_accuracy_score,
roc_auc_score, f1_score)
def performance_scores(y_true, y_pred_score, y_pred=None):
# We can find which class has the highest score as its predicted class
if y_pred is None:
y_pred = y_pred_score.idxmax(axis=1)
"accuracy": accuracy_score(y_true, y_pred),
"balanced_accuracy": balanced_accuracy_score(y_true, y_pred),
"auroc": roc_auc_score(y_true, y_pred_score[:, 1], average="weighted",
multi_class="ovr"),
"f1": f1_score(y_true, y_pred, average="weighted")
#===========================================================
# Your code here
# Calculate the test performance scores of your model
#===========================================================
Now also plot the confusion matrix to show correct and incorrect predictions in the test set
#============================================================================
# Your code here
# Plot confusion matrix for the trained model
#============================================================================
Answer the following questions¶
2.2. Try to investigate the model performance you by extracting the following information: What is the top model in the AutoML leaderboard? What are the validation scores of the best AutoML model? What are the most important features in the AutoML model?¶