Mount Google Drive (You don’t need to run this if you are running notebooks on your laptop)
from google.colab import drive
# The following command will prompt a URL for you to click and obtain the
# authorization code
drive.mount(“/content/drive”)
Mounted at /content/drive
Biomarker Identification in Drug Development¶
The Cancer Dependency Map project¶
The Cancer Dependency Map (DepMap) project is an umbrella project that aims to providing high quality genomic profilings of cancer cell lines (CCLE), their sensitivities to genetic perturbations (Achilies, DRIVE, DEMETER2, etc.), and their sensitivities to small molecule perturbations (PRISM). Mining these data allows us to better understand cancer biology, and potentially discover new genes for targeted therapies.
Cell line annotation file¶
Each cancer cell line is derived from a tumor. Knowing the information from the source of the cell line sometimes could help the drug developers identify the indications of interest by answering questions like: in which cancer types are BRCA1 / BRCA2 mutations most prevalent?
You can find the cell line sample info from DepMap in the shared google drive here:|
[sample_info.csv]
Now let’s load it and do some simple manipulation.
# Set data file location
# If you are running notebooks on your laptop, change this to the directory
# where you put downloaded files
from pathlib import Path
DATA = Path(“/content/drive/My Drive/2022 ECBM E4060/data/2022-10-24”)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option(“max.columns”, 100)
pd.set_option(“max.rows”, 100)
annot = pd.read_csv(DATA / “sample_info.csv”)
annot.head()
DepMap_ID cell_line_name stripped_cell_line_name CCLE_Name alias COSMICID sex source Achilles_n_replicates cell_line_NNMD culture_type culture_medium cas9_activity RRID WTSI_Master_Cell_ID sample_collection_site primary_or_metastasis primary_disease Subtype age Sanger_Model_ID depmap_public_comments lineage lineage_subtype lineage_sub_subtype lineage_molecular_subtype
0 ACH-000001 NIH:OVCAR-3 NIHOVCAR3 NIHOVCAR3_OVARY OVCAR3 905933.0 Female ATCC NaN NaN NaN NaN NaN CVCL_0465 2201.0 ascites Metastasis Ovarian Cancer Adenocarcinoma, high grade serous 60 SIDM00105 NaN ovary ovary_adenocarcinoma high_grade_serous NaN
1 ACH-000002 HL-60 HL60 HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE NaN 905938.0 Female ATCC NaN NaN NaN NaN NaN CVCL_0002 55.0 haematopoietic_and_lymphoid_tissue Primary Leukemia Acute Myelogenous Leukemia (AML), M3 (Promyelo… 35 SIDM00829 NaN blood AML M3 NaN
2 ACH-000003 CACO2 CACO2 CACO2_LARGE_INTESTINE CACO2, CaCo-2 NaN Male ATCC NaN NaN NaN NaN NaN CVCL_0025 NaN Colon NaN Colon/Colorectal Cancer Adenocarcinoma NaN SIDM00891 NaN colorectal colorectal_adenocarcinoma NaN NaN
3 ACH-000004 HEL HEL HEL_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE NaN 907053.0 Male DSMZ 2.0 -3.079202 Suspension RPMI + 10% FBS 52.4 CVCL_0001 783.0 haematopoietic_and_lymphoid_tissue NaN Leukemia Acute Myelogenous Leukemia (AML), M6 (Erythrol… 30 SIDM00594 NaN blood AML M6 NaN
4 ACH-000005 HEL 92.1.7 HEL9217 HEL9217_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE NaN NaN Male ATCC 2.0 -2.404409 Suspension RPMI + 10% FBS 86.6 CVCL_2481 NaN bone_marrow NaN Leukemia Acute Myelogenous Leukemia (AML), M6 (Erythrol… 30 SIDM00593 NaN blood AML M6 NaN
annot[annot.stripped_cell_line_name.str.lower() == “hela”]
DepMap_ID cell_line_name stripped_cell_line_name CCLE_Name alias COSMICID sex source Achilles_n_replicates cell_line_NNMD culture_type culture_medium cas9_activity RRID WTSI_Master_Cell_ID sample_collection_site primary_or_metastasis primary_disease Subtype age Sanger_Model_ID depmap_public_comments lineage lineage_subtype lineage_sub_subtype lineage_molecular_subtype
1084 ACH-001086 NaN HELA HELA_CERVIX HeLa 1298134.0 Female ATCC NaN NaN NaN NaN NaN CVCL_0030 1799.0 cervix NaN Cervical Cancer Carcinoma NaN SIDM00846 NaN cervix cervical_carcinoma NaN NaN
annot.groupby(“primary_disease”).size().sort_values(ascending=False)
primary_disease
Lung Cancer 337
Skin Cancer 282
Brain Cancer 248
Leukemia 197
Lymphoma 152
Colon/Colorectal Cancer 137
Sarcoma 119
Breast Cancer 119
Ovarian Cancer 112
Bone Cancer 112
Head and Neck Cancer 96
Gastric Cancer 92
Neuroblastoma 87
Pancreatic Cancer 84
Engineered 83
Kidney Cancer 79
Esophageal Cancer 77
Endometrial/Uterine Cancer 52
Myeloma 48
Unknown 48
Bile Duct Cancer 46
Fibroblast 44
Bladder Cancer 40
Liver Cancer 32
Non-Cancerous 29
Eye Cancer 29
Rhabdoid 27
Cervical Cancer 26
Thyroid Cancer 25
Liposarcoma 24
Prostate Cancer 20
Embryonal Cancer 12
Gallbladder Cancer 11
Teratoma 4
Prostate cancer 4
Adrenal Cancer 2
Thymic Cancer 1
Acute Myeloid Leukemia 1
Oesphageal Adenocarcinoma 1
Immortalized 1
Carcinoma 1
Breast cancer 1
lung cancer 1
dtype: int64
# Get cell line count for each site and histology
annot.groupby([“lineage”, “lineage_subtype”]).size()
lineage lineage_subtype
adrenal_cortex adrenal_carcinoma 2
bile_duct cholangiocarcinoma 46
gallbladder_adenocarcinoma 11
blood ALL 69
AML 73
uterus endometrial_squamous 3
endometrial_stromal_sarcoma 1
mullerian_carcinoma 1
uterine_carcinosarcoma 2
uterus_mixed 1
Length: 128, dtype: int64
Drug sensitivity data¶
Profiling Relative Inhibition Simultaneously in Mixtures, or PRISM, project within DepMap uses high-throughput multiplexed screening on hundreds of human cancer cell lines. It allows researchers to validate the mechanisms of action (MoA) of drugs, or identify potential novel targets of existing drugs for drug repurposing. You can find the screening data from PRISM here:
[secondary-screen-dose-response-curve-parameters.csv]
prism = pd.read_csv(DATA / “secondary-screen-dose-response-curve-parameters.csv”)
prism.head()
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:3326: DtypeWarning: Columns (14,15) have mixed types.Specify dtype option on import or set low_memory=False.
exec(code_obj, self.user_global_ns, self.user_ns)
broad_id depmap_id ccle_name screen_id upper_limit lower_limit slope r2 auc ec50 ic50 name moa target disease.area indication smiles phase passed_str_profiling row_name
0 BRD-K71847383-001-12-5 ACH-000879 MFE296_ENDOMETRIUM HTS002 1 2.122352 -0.022826 -0.026964 1.677789 8.415093e+06 NaN cytarabine ribonucleotide reductase inhibitor POLA1, POLB, POLD1, POLE hematologic malignancy acute lymphoblastic leukemia (ALL), chronic ly… Launched True ACH-000879
1 BRD-K71847383-001-12-5 ACH-000320 PSN1_PANCREAS HTS002 1 1.325174 -0.237504 -0.147274 1.240300 9.643742e+00 NaN cytarabine ribonucleotide reductase inhibitor POLA1, POLB, POLD1, POLE hematologic malignancy acute lymphoblastic leukemia (ALL), chronic ly… Launched True ACH-000320
2 BRD-K71847383-001-12-5 ACH-001145 OC316_OVARY HTS002 1 2.089350 -0.302937 0.193893 1.472333 2.776687e-02 NaN cytarabine ribonucleotide reductase inhibitor POLA1, POLB, POLD1, POLE hematologic malignancy acute lymphoblastic leukemia (ALL), chronic ly… Launched True ACH-001145
3 BRD-K71847383-001-12-5 ACH-000873 KYSE270_OESOPHAGUS HTS002 1 1.311820 -0.209393 -0.005460 1.207160 2.654701e+00 NaN cytarabine ribonucleotide reductase inhibitor POLA1, POLB, POLD1, POLE hematologic malignancy acute lymphoblastic leukemia (ALL), chronic ly… Launched True ACH-000873
4 BRD-K71847383-001-12-5 ACH-000855 KYSE150_OESOPHAGUS HTS002 1 1.369799 -0.277530 0.132818 1.229332 5.889041e-01 NaN cytarabine ribonucleotide reductase inhibitor POLA1, POLB, POLD1, POLE hematologic malignancy acute lymphoblastic leukemia (ALL), chronic ly… Launched True ACH-000855
In this dataset, each row is a cellline-drug pair, where depmap_id column indicates the cellline ID as in the cellline info table above. The compound (drug) name can be found in the name column. If the molecule is designed to target a certain gene, it will be annotated in the target and moa columns. This allows us to investigate how inhibition of a certain pathway affect the viability of cancer cell lines with a certain mutation profiles.
Biomarker identification — target gene¶
To unerstand the potential target subpopulation of a drug, the easiest path is to follow the biology. The goal is to find in which subpopulation (with specific genomic feature such as mutations, copy number variation, or over / under expression), the drug will have the highest potency. We’ll take EGFR inhibitor Erlotinib for example.
Erlotinib is a compound targeting epidermal growth factor receptor, or EGFR. In multiple cancer types such as lung cancer and pancreatic cancer, EGFR is often mutated and thus leads to EGFR overexpression and uncontrolled growth of the cells. We expects to see an increased sensitivity of the drug in cell liens with these genomic profiles.
You can download the expression profiles and mutation information here:
[CCLE_expression.csv]
[CCLE_mutations.csv]
Let’s first check the correlation between the activity area of Erlotinib with the EGFR mutations in cell lines.
# load maf file, note we explicitly convert a few columns to boolean
maf = pd.read_csv(DATA / “CCLE_mutations.csv”)
maf.head()
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:3326: DtypeWarning: Columns (3,19,22,27,28,29,30,31) have mixed types.Specify dtype option on import or set low_memory=False.
exec(code_obj, self.user_global_ns, self.user_ns)
Hugo_Symbol Entrez_Gene_Id NCBI_Build Chromosome Start_position End_position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 dbSNP_RS dbSNP_Val_Status Genome_Change Annotation_Transcript DepMap_ID cDNA_Change Codon_Change Protein_Change isDeleterious isTCGAhotspot TCGAhsCnt isCOSMIChotspot COSMIChsCnt ExAC_AF Variant_annotation CGA_WES_AC HC_AC RD_AC RNAseq_AC SangerWES_AC WGS_AC
0 VPS13D 55187 37 1 12359347 12359347 + Nonsense_Mutation SNP C A NaN NaN g.chr1:12359347C>A ENST00000358136.3 ACH-000001 c.6122C>A c.(6121-6123)tCa>tAa p.S2041* True False NaN False 0.0 NaN damaging 34:213 NaN NaN NaN 34:221 NaN
1 AADACL4 343066 37 1 12726308 12726322 + In_Frame_Del DEL CTGGCGTGACGCCAT – rs58218425|rs139261871|rs369427733|rs560787141 byFrequency g.chr1:12726308_12726322delCTGGCGTGACGCCAT ENST00000376221.1 ACH-000001 c.786_800delCTGGCGTGACGCCAT c.(784-801)tcctggcgtgacgccatc>tcc p.WRDAI263del False False NaN False 3.0 NaN other non-conserving 57:141 NaN NaN NaN 9:0 28:32
2 IFNLR1 163702 37 1 24484172 24484172 + Silent SNP G A NaN NaN g.chr1:24484172G>A ENST00000327535.1 ACH-000001 c.1011C>T c.(1009-1011)ggC>ggT p.G337G False False NaN False 0.0 NaN silent 118:0 NaN NaN 10:0 118:0 18:0
3 TMEM57 55219 37 1 25785018 25785019 + Frame_Shift_Ins INS – A NaN NaN g.chr1:25785018_25785019insA ENST00000374343.4 ACH-000001 c.789_790insA c.(790-792)aaafs p.K264fs True False 0.0 False 0.0 NaN damaging NaN NaN NaN 6:28 NaN NaN
4 ZSCAN20 7579 37 1 33954141 33954141 + Missense_Mutation SNP T G NaN NaN g.chr1:33954141T>G ENST00000361328.3 ACH-000001 c.494T>G c.(493-495)gTg>gGg p.V165G False False NaN False 0.0 NaN other non-conserving 28:62 NaN NaN NaN 27:61 NaN
The MAF file uses slightly different annotation from the ones from TCGA as we dealt with before. Here the quality check has already been done (like we did the FILTER==PASS before), and instead of IMPACT column they used several one algorithm to annotate whether the mutation is deleterious in the isDeleterious column, and another to annotate whether the mutation is damagin in column Variant_annotation. We’ll use a union of the two, plus what’s annotated as isCOSMIChotspot. The isCOSMIChotspot column indicates whether this mutation has been observed as a mutation hotspot in the COSMIC database. The hotspot mutations are frequently observed in tumors, usually indicating there are some evolutionary advantage that these mutations are given to the cancer cells. We will try to see if the EGFR inhibitor has suppressing effect on cancer cell lines that harboring these EGFR mutations.
egfr = maf[(maf.Hugo_Symbol == “EGFR”) &
(maf.isDeleterious | (maf.Variant_annotation == “damaging”) | maf.isCOSMIChotspot)]
Hugo_Symbol Entrez_Gene_Id NCBI_Build Chromosome Start_position End_position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 dbSNP_RS dbSNP_Val_Status Genome_Change Annotation_Transcript DepMap_ID cDNA_Change Codon_Change Protein_Change isDeleterious isTCGAhotspot TCGAhsCnt isCOSMIChotspot COSMIChsCnt ExAC_AF Variant_annotation CGA_WES_AC HC_AC RD_AC RNAseq_AC SangerWES_AC WGS_AC
3306 EGFR 1956 37 7 55242466 55242480 + In_Frame_Del DEL GAATTAAGAGAAGCA – rs121913438|rs121913439|rs397517099|rs39751709… NaN g.chr7:55242466_55242480delGAATTAAGAGAAGCA ENST00000455089.1 ACH-000012 c.2101_2115delGAATTAAGAGAAGCA c.(2101-2115)gaattaagagaagcadel p.ELREA701del False True 8.0 True 1571.0 NaN other non-conserving 239:73 966:51 NaN 161:72 37:0 521:151
9521 EGFR 1956 37 7 55242466 55242480 + In_Frame_Del DEL GAATTAAGAGAAGCA – rs121913438|rs121913439|rs397517099|rs39751709… NaN g.chr7:55242466_55242480delGAATTAAGAGAAGCA ENST00000455089.1 ACH-000029 c.2101_2115delGAATTAAGAGAAGCA c.(2101-2115)gaattaagagaagcadel p.ELREA701del False True 8.0 True 1571.0 NaN other non-conserving 1951:337 NaN NaN 206:52 NaN NaN
10022 EGFR 1956 37 7 55242465 55242479 + In_Frame_Del DEL GGAATTAAGAGAAGC – rs121913438|rs121913439|rs397517099|rs39751709… NaN g.chr7:55242465_55242479delGGAATTAAGAGAAGC ENST00000455089.1 ACH-000030 c.2100_2114delGGAATTAAGAGAAGC c.(2098-2115)aaggaattaagagaagca>aaa p.ELREA701del False True 13.0 True 1576.0 NaN other non-conserving 28:35 28:10 NaN 161:59 39:0 73:19
10955 EGFR 1956 37 7 55242465 55242479 + In_Frame_Del DEL GGAATTAAGAGAAGC – rs121913438|rs121913439|rs397517099|rs39751709… NaN g.chr7:55242465_55242479delGGAATTAAGAGAAGC ENST00000455089.1 ACH-000035 c.2100_2114delGGAATTAAGAGAAGC c.(2098-2115)aaggaattaagagaagca>aaa p.ELREA701del False True 13.0 True 1576.0 NaN other non-conserving 27:57 77:102 NaN 69:137 27:0 35:33
20996 EGFR 1956 37 7 55221822 55221822 + Missense_Mutation SNP C A rs149840192 NaN g.chr7:55221822C>A ENST00000455089.1 ACH-000067 c.731C>A c.(730-732)gCc>gAc p.A244D False True 28.0 True 23.0 NaN other non-conserving 29:48 47:141 NaN 107:256 28:49 NaN
40727 EGFR 1956 37 7 55242465 55242466 + Frame_Shift_Del DEL GG – rs121913422|rs121913421|rs397517094|rs12191342… NaN g.chr7:55242465_55242466delGG ENST00000275493.2 ACH-000150 c.2235_2236delGG c.(2233-2238)aaggaafs p.E746fs True True 13.0 True 1560.0 NaN damaging NaN NaN NaN 41:180 NaN NaN
40728 EGFR 1956 37 7 55242470 55242485 + Frame_Shift_Del DEL TAAGAGAAGCAACATC – rs397517100|rs121913438|rs121913439|rs39751709… NaN g.chr7:55242470_55242485delTAAGAGAAGCAACATC ENST00000275493.2 ACH-000150 c.2240_2255delTAAGAGAAGCAACATC c.(2239-2256)ttaagagaagcaacatctfs p.LREATS747fs True False 0.0 True 2229.0 NaN damaging NaN NaN NaN 41:232 NaN NaN
47638 EGFR 1956 37 7 55242470 55242487 + In_Frame_Del DEL TAAGAGAAGCAACATCTC – rs397517100|rs121913438|rs121913439|rs39751709… NaN g.chr7:55242470_55242487delTAAGAGAAGCAACATCTC ENST00000275493.2 ACH-000176 c.2240_2257delTAAGAGAAGCAACATCTC c.(2239-2259)ttaagagaagcaacatctccg>tcg p.747_753LREATSP>S False False 0.0 True 2232.0 NaN other non-conserving NaN 37:101 NaN 73:314 12:0 NaN
47639 EGFR 1956 37 7 55242494 55242494 + Missense_Mutation SNP C A rs121913463|rs397517100|rs397517099 NaN g.chr7:55242494C>A ENST00000275493.2 ACH-000176 c.2264C>A c.(2263-2265)gCc>gAc p.A755D False False 0.0 True 41.0 NaN other non-conserving NaN 36:83 NaN NaN NaN NaN
199965 EGFR 1956 37 7 55249071 55249071 + Missense_Mutation SNP C T rs121434569 NaN g.chr7:55249071C>T ENST00000455089.1 ACH-000587 c.2234C>T c.(2233-2235)aCg>aTg p.T745M False False NaN True 109.0 0.000041 other non-conserving 103:30 398:147 NaN 368:114 105:30 40:8
199966 EGFR 1956 37 7 55259515 55259515 + Missense_Mutation SNP T G rs121434568 NaN g.chr7:55259515T>G ENST00000455089.1 ACH-000587 c.2438T>G c.(2437-2439)cTg>cGg p.L813R False True 28.0 True 1491.0 NaN other non-conserving 103:36 10:3 NaN 321:80 109:38 48:16
232460 EGFR 1956 37 7 55221822 55221822 + Missense_Mutation SNP C T rs149840192 NaN g.chr7:55221822C>T ENST00000455089.1 ACH-000655 c.731C>T c.(730-732)gCc>gTc p.A244V False True 28.0 True 23.0 NaN other non-conserving 113:274 NaN NaN 155:221 14:37 NaN
306149 EGFR 1956 37 7 55210093 55210093 + Frame_Shift_Del DEL C – NaN NaN g.chr7:55210093delC ENST00000455089.1 ACH-000784 c.203delC c.(202-204)accfs p.T68fs True False NaN False 0.0 NaN damaging 65:93 43:55 NaN 75:269 50:19 20:17
361647 EGFR 1956 37 7 55242482 55242482 + Missense_Mutation SNP C T rs121913438|rs397517100|rs397517099|rs12191342… NaN g.chr7:55242482C>T ENST00000455089.1 ACH-000817 c.2117C>T c.(2116-2118)aCa>aTa p.T706I False False NaN True 383.0 0.000008 other non-conserving 20:55 31:99 NaN NaN 20:58 7:47
372536 EGFR 1956 37 7 55221766 55221766 + Nonsense_Mutation SNP C A NaN NaN g.chr7:55221766C>A ENST00000455089.1 ACH-000831 c.675C>A c.(673-675)taC>taA p.Y225* True False NaN False 0.0 NaN damaging 31:55 50:92 NaN NaN NaN NaN
377236 EGFR 1956 37 7 55219054 55219054 + Splice_Site SNP A C NaN NaN g.chr7:55219054A>C ENST00000455089.1 ACH-000838 c.492A>C c.(490-492)aaA>aaC p.K164N True False NaN False 0.0 NaN damaging 30:28 38:49 NaN NaN 43:35 NaN
400156 EGFR 1956 37 7 55249005 55249005 + Missense_Mutation SNP G T rs146024686|rs121913465|rs397517108 NaN g.chr7:55249005G>T ENST00000455089.1 ACH-000865 c.2168G>T c.(2167-2169)aGc>aTc p.S723I False True 6.0 True 29.0 NaN other non-conserving 51:165 10:37 NaN 252:682 51:168 28:81
409397 EGFR 1956 37 7 55259524 55259524 + Missense_Mutation SNP T A rs121913444 NaN g.chr7:55259524T>A ENST00000455089.1 ACH-000873 c.2447T>A c.(2446-2448)cTg>cAg p.L816Q False True 8.0 True 59.0 NaN other non-conserving 56:175 NaN 134:583 167:544 58:185 NaN
440538 EGFR 1956 37 7 55242502 55242502 + Missense_Mutation SNP G A rs121913463|rs397517100 NaN g.chr7:55242502G>A ENST00000455089.1 ACH-000899 c.2137G>A c.(2137-2139)Gaa>Aaa p.E713K False False NaN True 32.0 NaN other non-conserving 30:70 25:72 NaN NaN NaN NaN
478626 EGFR 1956 37 7 55259515 55259515 + Missense_Mutation SNP T G rs121434568 NaN g.chr7:55259515T>G ENST00000275493.2 ACH-000924 c.2573T>G c.(2572-2574)cTg>cGg p.L858R False True 28.0 True 2267.0 NaN other non-conserving NaN NaN NaN 14:242 NaN NaN
508753 EGFR 1956 37 7 55270249 55270249 + Nonsense_Mutation SNP C T NaN NaN g.chr7:55270249C>T ENST00000455089.1 ACH-000938 c.3067C>T c.(3067-3069)Cga>Tga p.R1023* True False NaN False 0.0 0.000008 damaging 77:79 27:43 NaN NaN 77:79 NaN
511290 EGFR 1956 37 7 55240709 55240709 + Frame_Shift_Del DEL G – NaN NaN g.chr7:55240709delG ENST00000455089.1 ACH-000939 c.1818delG c.(1816-1818)gtgfs p.V606fs True False NaN False 0.0 NaN damaging 33:29 NaN 317:354 14:78 25:8 13:10
570033 EGFR 1956 37 7 55241707 55241707 + Missense_Mutation SNP G A rs28929495 NaN g.chr7:55241707G>A ENST00000455089.1 ACH-000958 c.2020G>A c.(2020-2022)Ggc>Agc p.G674S False False NaN True 49.0 NaN other non-conserving 25:37 6:23 364:579 90:213 25:37 20:32
592800 EGFR 1956 37 7 55221822 55221822 + Missense_Mutation SNP C T rs149840192 NaN g.chr7:55221822C>T ENST00000455089.1 ACH-000965 c.731C>T c.(730-732)gCc>gTc p.A244V False True 28.0 True 23.0 NaN other non-conserving NaN 11:156 NaN 80:447 NaN 3:25
608528 EGFR 1956 37 7 55231511 55231511 + Nonsense_Mutation SNP G T NaN NaN g.chr7:55231511G>T ENST00000455089.1 ACH-000969 c.1582G>T c.(1582-1584)Gga>Tga p.G528* True False NaN False 0.0 NaN damaging 29:27 NaN NaN 29:44 29:28 14:12
672381 EGFR 1956 37 7 55221822 55221822 + Missense_Mutation SNP C T rs149840192 NaN g.chr7:55221822C>T ENST00000455089.1 ACH-000984 c.731C>T c.(730-732)gCc>gTc p.A244V False True 28.0 True 23.0 NaN other non-conserving 20:50 43:120 NaN 82:219 NaN NaN
679677 EGFR 1956 37 7 55242430 55242430 + Nonsense_Mutation SNP G T rs121913420 NaN g.chr7:55242430G>T ENST00000275493.2 ACH-000985 c.2200G>T c.(2200-2202)Gaa>Taa p.E734* True False 0.0 False 2.0 NaN damaging NaN 23:178 NaN NaN NaN NaN
764225 EGFR 1956 37 7 55229284 55229284 + Nonsense_Mutation SNP C T NaN NaN g.chr7:55229284C>T ENST00000455089.1 ACH-000994 c.1456C>T c.(1456-1458)Cga>Tga p.R486* True False NaN False 0.0 NaN damaging 36:143 16:72 131:399 NaN NaN NaN
916391 EGFR 1956 37 7 55268010 55268010 + Splice_Site SNP C T NaN NaN g.chr7:55268010C>T ENST00000455089.1 ACH-001333 c.2715C>T c.(2713-2715)tgC>tgT p.C905C True False NaN False 0.0 NaN damaging 122:148 NaN NaN NaN 68:86 NaN
979640 EGFR 1956 37 7 55211157 55211157 + Nonsense_Mutation SNP G T NaN NaN g.chr7:55211157G>T ENST00000455089.1 ACH-001522 c.400G>T c.(400-402)Gag>Tag p.E134* True False NaN False 0.0 NaN damaging 55:245 NaN NaN NaN NaN NaN
993471 EGFR 1956 37 7 55223639 55223639 + Splice_Site SNP G A NaN NaN g.chr7:55223639G>A ENST00000455089.1 ACH-001550 c.871G>A c.(871-873)Gtg>Atg p.V291M True False NaN False 0.0 NaN damaging 110:109 NaN NaN NaN 24:23 NaN
1000385 EGFR 1956 37 7 55242473 55242473 + Missense_Mutation SNP G A rs121913438|rs121913439|rs397517098|rs12191342… NaN g.chr7:55242473G>A ENST00000455089.1 ACH-001566 c.2108G>A c.(2107-2109)aGa>aAa p.R703K False False NaN True 1547.0 NaN other non-conserving 255:273 NaN NaN NaN NaN NaN
To extract the drug screening result. Note for Erlotinib there are two screen_id representing different screening set. We’ll use screen_id == “HTS002” as it contains more cell lines.
plotdata = prism[
(prism.screen_id == “HTS002”) &
(prism.passed_str_profiling) &
(prism.name == “erlotinib”) &
prism.depmap_id.isin(set(maf.DepMap_ID))
[“depmap_id”, “auc”, “ec50”, “ic50”,]
plotdata[“EGFR_mut”] = False
plotdata.loc[plotdata.depmap_id.isin(set(egfr.DepMap_ID)), “EGFR_mut”] = True
# there are some outliers of EC50 that has extremely high values. We’ll set an
# upper limit to them
plotdata.loc[plotdata.ec50 > 10, “ec50”] = 10
plotdata.head()
depmap_id auc ec50 ic50 EGFR_mut
37303 ACH-000879 0.778186 0.481569 0.603944 False
37304 ACH-000320 1.421381 0.378348 NaN False
37305 ACH-001145 1.096977 0.009129 NaN False
37306 ACH-000873 0.775731 0.556030 0.767366 True
37307 ACH-000855 0.749209 0.365400 0.594647 False
from scipy.stats import mannwhitneyu
import seaborn as sns
mwu, pval = mannwhitneyu(plotdata.loc[plotdata.EGFR_mut, “auc”],
plotdata.loc[~plotdata.EGFR_mut, “auc”])
ax = sns.boxplot(x=”EGFR_mut”, y=”auc”, data=plotdata, fliersize=0)
ax = sns.swarmplot(x=”EGFR_mut”, y=”auc”, data=plotdata, color=”k”,
alpha=0.5, ax=ax)
ax.text(1, 1.6, “M-W U P-value = {}”.format(np.round(pval, 4)), ha=”center”, va=”center”)
/usr/local/lib/python3.7/dist-packages/seaborn/categorical.py:1296: UserWarning: 21.2% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
warnings.warn(msg, UserWarning)
Text(1, 1.6, ‘M-W U P-value = 0.0387’)
from scipy.stats import mannwhitneyu
import seaborn as sns
mwu, pval = mannwhitneyu(plotdata.loc[plotdata.EGFR_mut, “ec50”],
plotdata.loc[~plotdata.EGFR_mut, “ec50″])
ax = sns.boxplot(x=”EGFR_mut”, y=”ec50″, data=plotdata, fliersize=0)
ax = sns.swarmplot(x=”EGFR_mut”, y=”ec50″, data=plotdata, color=”k”,
alpha=0.5, ax=ax)
ax.text(1, 8, “M-W U P-value = {}”.format(np.round(pval, 4)), ha=”center”, va=”center”)
/usr/local/lib/python3.7/dist-packages/seaborn/categorical.py:1296: UserWarning: 51.9% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
warnings.warn(msg, UserWarning)
Text(1, 8, ‘M-W U P-value = 0.2233’)
We see the distribution of area under curve (auc, which is the inverse of activity area) and EC50 for the EGFR mut cell lines is lower than the EGFR wild type cell lines. Next, let’s see if we can further discriminate the celllines by including the gene expression data.
ge = pd.read_csv(DATA / “CCLE_expression.csv”, index_col=0)
# remove entrez ID from gene names
ge.columns = [x.split(” (“)[0] for x in ge.columns]
TSPAN6 TNMD DPM1 SCYL3 C1orf112 FGR CFH FUCA2 GCLC NFYA STPG1 NIPAL3 LAS1L ENPP4 SEMA3F CFTR ANKIB1 CYP51A1 KRIT1 RAD52 BAD LAP3 CD99 HS3ST1 AOC1 WNT16 HECW1 MAD1L1 LASP1 SNX11 TMEM176A M6PR KLHL13 CYP26B1 ICA1 DBNDD1 ALS2 CASP10 CFLAR TFPI NDUFAF7 RBM5 MTMR7 SLC7A2 ARF5 SARM1 POLDIP2 PLXND1 AK2 CD38 … FAM240A FAM95C LITAFD PRRT1B BX276092.9 ETDC LMLN2 MYOCOS HSFX3 VSIG10L2 PRSS50 CPHXL AC131160.1 TPTEP2-CSNK1E GNG14 SLURP2 AC069544.2 SCO2 C2orf81 PERCC1 THSD8 LYNX1-SLURP2 OR8B3 OR4F16 OR8B2 TMEM247 SMIM38 OR8S1 OR4F29 EEF1AKMT4 AC022414.1 TBCE SMIM41 AC008397.1 GCSAML-AS1 CCDC39 EEF1AKMT4-ECE2 AP000812.4 UPK3BL2 AC093512.2 ARHGAP11B AC004593.2 AC090517.4 AL160269.1 ABCF2-H2BE1 POLR2J3 H2BE1 AL445238.1 GET1-SH3BGR AC113348.1
ACH-001113 4.990501 0.000000 7.273702 2.765535 4.480265 0.028569 1.269033 3.058316 6.483171 5.053980 3.456806 4.415488 4.766595 2.280956 3.237258 0.000000 5.125982 6.636770 5.638364 3.881665 5.156639 4.775051 5.904966 0.097611 0.111031 0.042644 2.847997 3.336283 5.371210 4.313971 0.000000 7.536830 5.207893 2.965323 1.922198 2.049631 4.478972 2.077243 5.101818 0.056584 4.614710 5.286142 0.545968 1.613532 7.381197 2.611172 5.929081 5.293885 6.730640 0.176323 … 0.0 2.724650 0.042644 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.748461 0.042644 1.782409 0.356144 0.650765 0.411426 2.895303 0.765535 1.731183 0.056584 0.695994 0.070389 0.0 0.070389 0.0 0.0 0.389567 0.0 0.070389 4.578335 0.000000 5.761019 0.0 0.000000 0.0 0.028569 0.000000 1.464668 5.234961 4.139961 1.214125 0.000000 0.111031 0.150560 1.427606 5.781884 0.0 0.000000 0.799087 0.000000
ACH-001289 5.209843 0.545968 7.070604 2.538538 3.510962 0.000000 0.176323 3.836934 4.200850 3.832890 1.910733 3.374344 4.861955 3.625270 1.275007 0.028569 5.177121 7.130313 5.061776 3.023255 5.542258 6.305423 6.641546 0.084064 0.014355 0.275007 0.189034 2.903038 4.955127 4.421560 0.042644 7.133091 2.861955 0.124328 3.513491 4.056584 4.286142 0.333424 4.520422 0.070389 3.987321 6.192391 3.163499 2.185867 7.792530 2.427606 6.269407 3.785551 7.327059 0.137504 … 0.0 0.918386 0.000000 0.042644 0.000000 0.0 0.0 0.000000 0.111031 0.0 0.000000 0.000000 1.799087 0.526069 0.000000 0.000000 4.202418 0.298658 0.765535 0.000000 4.173927 0.189034 0.0 0.014355 0.0 0.0 0.028569 0.0 0.014355 2.182692 0.042644 5.771357 0.0 0.000000 0.0 1.090853 0.000000 1.490570 0.941106 4.107688 1.835924 0.000000 0.310340 0.000000 0.807355 4.704319 0.0 0.000000 0.464668 0.070389
ACH-001339 3.779260 0.000000 7.346425 2.339137 4.254745 0.056584 1.339137 6.724241 3.671293 3.775051 2.895303 3.613532 4.300856 0.799087 0.275007 0.042644 4.149747 5.655352 4.858976 2.675816 4.560715 6.170125 8.182245 0.389567 0.084064 0.084064 0.084064 5.733625 6.274262 4.407353 0.214125 7.361417 0.137504 0.454176 2.301588 3.317594 3.746313 1.232661 4.750607 3.914565 3.723559 5.008541 0.545968 1.400538 6.864558 2.241840 6.123087 7.242793 8.119875 0.310340 … 0.0 0.000000 0.000000 0.000000 0.028569 0.0 0.0 0.028569 0.028569 0.0 0.014355 0.000000 0.992768 0.201634 0.000000 0.014355 3.209453 0.344828 0.555816 0.000000 2.952334 0.070389 0.0 0.014355 0.0 0.0 0.028569 0.0 0.014355 3.012569 0.000000 4.744699 0.0 0.000000 0.0 0.000000 0.000000 0.985500 1.124328 2.313246 1.823749 0.084064 0.176323 0.042644 1.384050 4.931683 0.0 0.028569 0.263034 0.000000
ACH-001538 5.726831 0.000000 7.086189 2.543496 3.102658 0.000000 5.914565 6.099716 4.475733 4.294253 2.472488 4.573496 5.314697 3.488001 2.980025 0.028569 3.872829 6.176921 3.714795 2.726831 5.565293 5.230357 6.811728 2.657640 0.084064 0.124328 0.367371 4.161888 6.703211 3.541019 0.070389 7.208478 0.014355 1.124328 3.9467