Prognosis prediction on pediatric neuroblastoma
Submit your answer (in notebook or PDF) along with your code (in notebook or scripts) in a zip
file and submit to Coursework by Nov 7th, 23:59 PM.
Neuroblastoma (NBL) is a cancer that arises in immature nerve cells of the sympathetic nervous
system, primarily affecting infants and children. In this datathon, we will analyze the genomic
data of NBL samples from the Therapeutically Applicable Research to Generate Effective
Treatments (TARGET) project to identify predictive mutation / gene expression features that
could predict risk of patients, potentially informing the treatment strategy.
I have splitted the original data set into a training set and a test set. We will perform exploratory
analyses on the training set, build a machine learning model, and test its performance on the
PART 1: Identify risk-associated mutations (13 pts)
You can find the TARGET NBL somatic mutation data in MAF format on our share drive:
TARGET_NBL_WXS_somatic_verified.maf.txt
This is similar to what we analyzed in the class. You can find the patient ID in the Case_USI
column, e.g. TARGET-30-PANYGR. Use this data table to answer the following questions:
1.1. What is the most mutated gene in the data set? The most
mutated gene is defined as the gene that has at least one
mutation in the most number of patients. (3 pts)
1.2. What is the most prevalent mutation? Similarly, this is defined
as the mutation (not gene, but a unique mutation) observed in the
most number of patients. (2 pts)
Hint: A unique mutation is defined the same as a unique variant. We try to match identical
variants in HW1.
Next, we would like to investigate if the mutations in pediatric NBL are observed in other cancer
dataset. Cancer hotspots is a database collecting frequently observed mutations from public
cancer datasets. You can load the database as a file here:
cancerhotspots.v2.light.maf.gz
https://www.cancerhotspots.org/#/home
Note the alternate alleles in the MAF file can generally be found in the column
Tumor_Seq_Allele2.
Now let’s check how many mutations in TARGET NBL datasets are also hotspots observed in
other datasets.
1.3. How many unique mutations in the TARGET NBL mutation
file have been recorded in the cancer hotspots database? (3 pts)
Next, let’s investigate the mutation information with the clinical information. You can find the
clinical data of the data set here:
nbl_target_clinical.txt
You can find the patient ID in the `TARGET USI` column. There is a `5yr_efs` indicating
whether the patient belongs to the `high risk` group (progression, relapse, or death within
the 5 years of the genomic profiling) or the `low risk` (no event for more than 5 years). There
is also a `train_test` column indicating this patient belongs to the training set (`train`) or
the test set (`test`).
You will also notice there are `in_rnaseq` and `in_maf` columns indicating whether the patient
has rnaseq data or has mutation data. You’ll find the number of patients having both rnaseq and
mutation data is very low (N=53). Hence we can only build a model using rnaseq data only later
in the analysis, instead of both mutation and rnaseq.
For now, let’s focus on the full set of mutation data. Combining the `5yr_efs` information of the
patients, we’ll investigate if the most mutated gene has association with high risk of disease
development events.
Out of the 103 unique patients in the mutation MAF file (try to count it yourself!), subset the
clinical information table to contain these 103 patients only. Create an additional column to
indicate whether the patient has at least one mutation in the top-mutated gene (let’s name it
TMG here, you’ll use the real symbol when you do it) in 1.1. Then create a count table as the
following by combining the mutation info with the info from the `5yr_efs` column:
High risk Low risk
TMG mutated c11 c12
TMG not mutated c21 c22
(c11, c12, c21, c22) are patient counts that satisfy each condition. The sum of c11 + c12 + c21 +
c22 should be 103 (we will not consider the train / test split for this part of the analysis). Once
you have this table, you should also perform a statistical test to see if the TMG mutation is
associated with risk. You can use either Fisher’s exact test or the chi-square test for this.
1.4. Produce a patient count table as described above, perform a
statistical test to check if the top-mutated gene mutation has
significant association (P value < 0.05) with 5-year disease event
risk (5 pts).
PART 2: NBL gene expression profiles difference
between high-risk and low-risk patients (10 pts)
We are now moving into gene expression raw count data. You can find the training set RNASeq
raw count data here:
nbl_target_counts.train.txt
Note that in the TARGET RNASeq table, like TCGA, each column is a sample name, whose first
16-character prefix indicates the patient ID. For example, the first sample in the table has an ID
TARGET-30-PAPVRN-01A, meaning it’s the tumor sample from patient TARGET-30-PAPVRN. You
can use the prefixes to find whether the patient belongs to the high risk group or the low risk
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html#scipy.stats.fisher_exact
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html#scipy.stats.chi2_contingency
2.1. Perform unsupervised learning on the expression data and
visualize the result (e.g. MDS, PCA, hierarchical clustering,
choose your favorite, but note if using PCA or hierarchical
clustering, the expression value should be normalized), color the
samples by risk group (high risk vs low risk). Describe if you see
samples in different risk group can be separated in your plot. (2
2.2. Perform differential expression analysis between high risk
group vs low risk group. List the top 10 genes over-expressed in
the high risk group, and do a box plot showing the distribution of
top over-expressed gene between two groups (5 pts)
2.3. Perform reactome pathway analysis between high risk group
versus the low risk group. Report your result. (3 pts)
Now you should export all the significantly differentially expressed genes in your analysis in 2.2.
(everything with FDR < 0.05) to be used in PART 3 to build a model. Also, you should perform
the same normalization as you did on training set on the test set RNASeq data, which can be
found here:
nbl_target_counts.test.txt
You can either perform trimmed mean of M-values (TMM) normalization on concatenated
training and test set together, or you can separately normalize the test set.
Save the TMM-normalized Log2 CPM values of RNASeq training set, RNASeq test set, and the
top differentially expressed genes (found using ONLY TRAINING SET) as files to be used by
Colaboratory in PART 3.
PART 3: Predictive modeling using RNAseq for
disease risk prediction (13 pts)
3.1. Build a machine learning model using RNASeq training set.
Only input the significantly differentially expressed genes you
found in 2.2. as features. Report what the top model is based on
cross validation by autoML and what is its validation score (5 pts).
3.2. Evaluate your trained model on the TMM-normalized
RNASeq test set (4 pts). Report accuracy, balanced accuracy,
and F1 score (2 pts). Also produce a confusion matrix (2 pts).
BONUS (4 pts)
This question involves looking into cell line data in Module 3.
We are going back to the mutation data. Now we look at the cell lines from CCLE:
sample_info.csv
CCLE_mutations.csv
CRISPR_gene_effect.csv
We would like to find the neuroblastoma cell lines. You can identify the cancer type of the cell
line in the primary_disease column of the sample info table. Once we identify the
neuroblastoma cell lines, try to create a waterfall plot like we did in 2022-10-24’s class using the
CRISPR data targeting the TMG gene in 1.1. Highlight the cell lines that have TMG mutation,
and comment if you see the TMG-mutated cell lines are dependent on TMG, meaning if most
TMG-mutated neuroblastoma cell lines are concentrated to the lower-score end of the plot.
Identify neuroblastoma cell lines in CCLE. Create a waterfall plot
as in the 2022-10-24 lecture using the NBL cell lines. Highlight the
TMG (top mutated gene from 1.1.) mutated cell lines and
comment if they have high dependency on TMG (4 pts).