CBMFW4761 hw1

Install prerequisite this will take a while
# Install prerequisite, no need to run this if you are running the notebook from
# your laptop and you already install the packages in Anaconda

!pip install biopython
!pip install ipytree
!pip install scikit-allel
!pip install zarr

# install mafft commandline
!wget https://mafft.cbrc.jp/alignment/software/mafft_7.471-1_amd64.deb
!sudo dpkg -i mafft_7.471-1_amd64.deb

Mount Google Drive (You don’t need to run this if you are running notebooks on your laptop)

from google.colab import drive

# The following command will prompt a URL for you to click and obtain the
# authorization code

drive.mount(“/content/drive”)

Mounted at /content/drive

# Set up data folder
from pathlib import Path

# Change this to where you put your hw1 files
DATA = Path(“”)

Problem 1: SARS-CoV-2 Spike Protein Multiple Alignment¶

Severe acute respiratory syndrome coronavirus 2, or SARS-CoV-2, is responsible for the COVID-19 pandemic. The outbreak of the COVID-19 has brought together the global genomic research community to focus on understanding the virus. However, as scientists curate the virus genome and develop vaccines at an unprecedented speed, evidence of the SARS-CoV-2 evolution has been characterized by the emergence of sets of mutations that could potentially increase transmissibility or cause more severe disease. The Center of Disease Control and Prevention (CDC) has published a SARS-CoV-2 Variant Classifications and Definitions that categorizes the viral variants into three categories:

Variants being monitored (VBM): Variants designated as VBM include those where data indicates there is a potential or clear impact on approved or authorized medical countermeasures or that have been associated with more severe disease or increased transmission but are no longer detected, or are circulating at very low levels, in the United States. These variants do not pose a significant and imminent risk to public health in the United States.
Variant of Interest (VOI): A variant with specific genetic markers that have been associated with changes to receptor binding, reduced neutralization by antibodies generated against previous infection or vaccination, reduced efficacy of treatments, potential diagnostic impact, or predicted increase in transmissibility or disease severity, e.g. Eta, Iota.
Variant of Concern (VOC): A variant for which there is evidence of an increase in transmissibility, more severe disease (e.g., increased hospitalizations or deaths), significant reduction in neutralization by antibodies generated during previous infection or vaccination, reduced effectiveness of treatments or vaccines, or diagnostic detection failures, e.g. Alpha, Beta, Delta.
Variant of High Consequence (VOHC): A variant of high consequence has clear evidence that prevention measures or medical countermeasures (MCMs) have significantly reduced effectiveness relative to previously circulating variants. Currently no variant has been categorized into this class.

In this exercise, we will perform multiple protein alignment on the spike protein of different variants of SARS-CoV-2. The S surface glycoprotein is found on the outside of the virus particle and gives coronavirus viruses their crown-like appearance. It mediates attachment of the virus particle and entry into the host cell. As a result, S protein is an important target for vaccine development, antibody therapies and diagnostic antigen-based tests.

Load spike protein sequences¶
You can download all the spike protein sequences that have ever been submitted to NCBI as a data package from the NCBI Datasets website. We will be focusing on the following sequences for our analysis.

seq_id_of_interest = [
‘QRN78347.1’,
‘QRX39425.1’,
‘QUD52764.1’,
‘QWE88920.1’,
‘UFO69279.1’,
‘UOZ45804.1’,
‘UTM82166.1’,
‘YP_009724390.1′,

Note the YP_009724390.1 sequence is what we call a reference sequence. It is the base sequence we used to refer to all the mutations in the genome. For example, if we say a new virus genome has an N501Y mutation in its spike protein, it means that the virus has a Y(Tyrosine) at its 501st amino acid instead of an N(Asparagine), compared to the reference genome sequence.

As your first task, fetch these 8 sequences from NCBI as we did in the class, save them into a fasta file so we can perform multiple sequence alignment later.

#==============================================================================
# Write your code here
# Fetch the 8 sequences of interest of S protein and save them in a fasta file
#==============================================================================

Answer the following questions¶

1.1. When we look into the sequence information of SARS-CoV-2 on NCBI, you’ll notice that its first line is:¶

LOCUS NC_045512 29903 bp ss-RNA linear VRL 18-JUL-2020

What does the ss-RNA mean? Feel free to look it up on the internet and share your understanding. However, remember to include your information source.

Perform multiple sequence alignment¶

Now we have the sequences, let’s run multiple sequence alignment. Recent publications have been using MAFFT for aligning SARS-CoV-2 genome sequences. Here we will use MAFFT to perform multiple sequence alignment on S protein sequences. Similar to other multiple alignment algorithms, there is a biopython commandline wrapper for us to run easily: MafftCommandline under Bio.Align.Applications. Refer to the documentation in biopython.

#============================================================================
# Write your code here
# Perform multiple sequence alignment using MafftCommandline class and write
# the alignment result to another fasta file
#============================================================================

With the alignment done, we can construct a phylogenetic tree to quantify and visualize the genomic evolution of the virus taken from different locations. You can follow the document of biopython’s Phylo module.

First you will need to use the DistanceCalculator to generate distance matrix. As MAFFT by default uses BLOSUM62 as the scoring matrix, you will need to set the model parameter of the DistanceCalculator class to blosum62.

Once you generate the distance matrix using DistanceCalculator, we will build the phylogenetic tree using the UPGMA method, which can be called through the DistanceTreeConstructor class. Draw your phylogenetic tree to visualize your result like we did in the class.

#============================================================================
# Write your code here
# Generate distance matrix using DistanceCalculator with `blosum62` as the
# scoring matrix, then construct a phylogenetic tree using
# DistanceTreeConstructor using UPGMA method
#============================================================================

#============================================================================
# Write your code here
# Draw your phylogenetic tree
#============================================================================

Answer the following questions¶

1.2. From the phylogenetic tree you created on the sequences of interest, if we use a branch length cutoff of 0.005, how many clusters do you observe? list the sequence IDs in each cluster¶

Assign variant sequence identities¶

Based on the alignment, using the reference sequence YP_009724390.1 as the reference sequence, we can identify all mutations in spike proteins in the other 7 sequences. This can be done similarly to what we did for influenza hemagglutinin sequences in class.

Once you identify the mutations for each sequence in the spike protein, you can assign their variant identity (e.g. alpha, delta, omicron) based on the signature given in the following plot from :

#============================================================================
# Write your code here
# Identify mutations in each sequence compared to YP_009724390.1, and assign
# variant identity based on the figure in viralzone above
#============================================================================

Answer the following questions¶

1.3. From the mutations in each sequence and the figure from Viralzone, determine which variant each sequence belongs to. Note you only need to provide on a high level of variant (e.g. alpha, beta, omicron), no need for subvariant like BA2.5. Also you might not find the exact set of variants as described in the figure, but for each sequence you should be able to identify some unique mutations that only exist in one variant but not the others¶

Problem 2: Prostate Cancer Risk Screening Using Germline Variant Data¶

One of the applications of genetic variant calling is to identify possibly disease-causing variant. ClinVar is a well-curated clinical variants database for pathogenic disease variant annotation.

In this exercise, we will perform a retrospective screening for prostate cancer by identifying pathogenic germline variants that are likely to lead to prostate cancer in the 1000 genome project subjects.

Note we are using variant mapped to human reference genome version GRCh37 since the 1000 genome VCF file we have was also mapped to the same version. We can load the clinvar vcf file using allel.read_vcf and interact with it like we did in the class. You can also convert it to zarr. However, given the small size of the file it does not seem necessary.

# change the filename parameter to where you download the vcf.gz file
clinvar_file = “clinvar_20220910_grch37.vcf.gz”

#==========================
# Your code here
# Load the clinvar vcf file
#==========================

Now you’ve loaded clinvar, first we would like to find all the variants that are annotated by clinvar as Pathogenic to prostate cancer in chromosome X. Note we restrict the scope down to chromosome X for simplicity, in real life one would search for variants in all chromosomes.

To perform the search, you can find pathogenicity in the field variants/CLNSIG, and the disease name in variants/CLNDN. Put extra care into the fact that CLNDN field might contain more than one disease, e.g. it might be something like disease1|disease2 while one of the two diseases could be prostate cancer. Also prostate cancer might be annotated as either Prostate_cancer or Malignant_tumor_of_prostate. You should try to be inclusive first and check out all the possible names of prostate cancer and revise your query.

#===============================================================================
# Your code here
# Filter the clinvar variants to identify pathogenic variants to prostate cancer
# in chromosome X
#===============================================================================

Answer the following questions¶

2.1. How many pathogenic variants to prostate cancer do you find in chromosome X? Which genes do these variants belong to?¶

2.2. Can you find one example literature connecting the genes you found in 2.1. to prostate cancer? Briefly explain the genes’ function.¶

Identifying subjects carrying prostate cancer pathogenic variants¶

Now we’ve identified the prostate cancer pathogenic variants, we will perform the screening in the genetic data of 1000 genome project subjects. First given the variants of interest we identified in 2.1., let’s try to find how many of them we also observed in 1000 genome project. Basically if a variant does not exist in the VCF file, it means the variant was not observed in the cohort.

We’ll start with the 1000 genome VCF file. Load the VCF file or the zarr file you created in class and find how many variants of interest were observed in 1000 genome.

#============================================================================
# Your code here
# Read the VCF file using allel and identify how many variants of interest
# were observed in 1000 genome
#============================================================================

Answer the following questions¶

2.3. How many pathogenic variants to prostate cancer you found in 2.1. were observed in 1000 genome project?¶

Now we know which variants were observed in 1000 genome project, we just need to extract subjects with a non-homozygous-reference genotype to identify subjects carrying pathogenic variants of prostate cancer, namely identify in the genotype array of the VCF file subjects with genotypes that’s not 0/0 (for female) or 0/. (for male).

#==============================================================================
# Your code here
# Find 1000 genome project subjects that have non-homozygous-reference genotypes
# at prostate cancer pathogenic variant locations
#==============================================================================

Answer the following questions¶

2.4. How many subjects have the pathogenic variants of prostate cancer?¶

Finally, for these subjects, create a dataframe, where each row is a non-reference variant with the following columns:

subject_id: 1000 genome ID
gender: gender of the subject as in the Sex column in the 1kgn_phase3_population.csv file we used in the class
super_population_code: the Superpopultaion Code of the subject as in the 1kgn_phase3_population.csv file we used in the class
variant_of_interest: which variant of interest the subject has. You can just use the Location format as in the VEP file
genotype: the genotype of the variant (i.e. 0/1, 1/1 or 1/.)

You should merge the genotype information to the 1kgn_phase3_population.csv we used in class.

#=============================================================================
# Your code here
# Create a dataframe as described above
#=============================================================================

Answer the following questions¶

2.5. How many male subjects have prostate pathogenic variants? What superpopulations do these male subjects belong to?¶

Problem 3: mutation load¶

We mentioned that DNA has a mismatch repair (MMR) mechanism to fix errors that occur in DNA during replication in addition to proofreading. In this problem we will investigate how the damages in the MMR pathway would affect the number of mutations in cancer. The number of mutations in sample is called mutation load.

First following what we did in the lecture note, load the MAF files of the TCGA breast cancer data set, which you can download here:

Remember to filter the mutations leaving only the ones passing the filter.

#=========================================================
# Your code here to load the MAF file and filter mutations
#=========================================================

Now define an MMR gene set. Using the MAF file, identify which patients have at least one mutation with MODERATE / HIGH impact in the 11 MMR genes as defined in the Cortes-Ciriano et al.) paper.

MLH1, MLH3, MSH2, MSH3, MSH6, PMS1, PMS2, POLE, POLD1, PRKDC, APC

(for simplicity we’ll skip the BRAF mutation)

#==============================
# Define your MMR gene set here
#==============================

mmr = {“MLH1”, “MLH3”, “MSH2”, “MSH3”, “MSH6”, “PMS1”, “PMS2”, “POLE”, “POLD1”,
“PRKDC”, “APC”}

Now we can calculate the mutation load of each sample, and also identify samples with mutations in their MMR genes.

First, calculate the mutation load of each sample by aggregating the number of mutations (i.e. number of rows) in individual sample. Hint: use the groupby function in pandas.DataFrame

#===================================
# Your code here
# get number of mutations per sample
#===================================

Let’s identify the samples who have mutations in their MMR genes.

#===========================================================
# Your code here
# identify the samples that have mutation in their MMR genes
#===========================================================

We can then do a boxplot depicting the distribution of mutation load corresponding to whether the sample has mutation in their MMR gene. Because of the distribution of mutation load is highly distorted, you might want to set the y-axis to log-scale. But remember to align the IDs of the two variables you just calculated.

One way to create boxplots is to use the seaborn.boxplot function. Refer to its document fore more information.

#======================================
# Your code here
# align the sample IDs and do a boxplot
#======================================

To further confirm the observation is statistically significant, we can perform a Mann-Whitney U Test for the mutation_load and obtain the P value of the significance of the difference. Mann-Whtney U test is a non-parametric, rank-based test hence it is more robust to the distribution of the data. Given that the distribution of mutation load generally has a long tail and given our large sample size applying non-parametric test is appropriate here.

We can use the package scipy.stats.mannwhitneyu (see here for details.)

#==================================================================================
# Your code here
# perform t test between the mutation numbers in samples with mutation in MMR genes
# versus mutation numbers in samples with no mutation in MMR genes
#==================================================================================

Answer the following questions¶

3.1. Do you see a significant P value (P value < 0.05)? Describe in your own word the association between mutations in MMR genes and mutation load¶ 3.2. Provide your hypothesis why we observe the difference in mutation loads between samples with mutations in MMR genes and samples without mutation in MMR genes.¶