PART 1: Python Primer¶
Variable definition, manipulation¶
String manipulation¶
For example, let us consider DNA, taking the opportunity to also start using some elementary aspects of Python. Each chain of DNA nucleotides is called a DNA strand, and we refer to the four possible DNA nucleotides using the symbols A G T and C each corresponding to a specific well-defined arrangement of atoms. In Python, we can jointly represent the elements of a DNA sequence with a symbol, say s, which represents the a string of characters:
s = ‘ACAGTAGC’
print(“The length of s is”, len(s))
print(“s is”, s)
The length of s is 8
s is ACAGTAGC
Note that we have not assigned any numerical values to A, T, C and G, but we simply treat them as four distinct abstract “characters.” The array “s”, being an ordered succession of characters, is said to be a “character string,” We say that every character of the string s is drawn from an alphabet, which, in this case, is the set {A, G, T, C} containing the four “letters” A, G, T, C.
To recap: To each DNA chain consisting of N nucleotides there corresponds a character string consisting of the elements s[n], where n = 0, 1, 2, 3, … , N-1, and each s[n] takes values from the set {A, G, T, C}. So, in our case, Python’s response to, say, s[3] is:
indicating that the fourth element of the array s is a G, and the response to len(s) , which returns the length of the sequence, is:
indicating that the number of nucleotides in the DNA strand, denoted simply by len(s), is, in our case, N = 8. In reality, the value of N for DNA sequences stored in computers can be a very large number.
A substring of a string s is defined as another string formed by consecutive characters of s, in the same order as they appear on s. For example, CAG is a substring of the string s = ‘ACAGTAGC’. Python has a useful function, called s.find(s2), which returns the starting indices (locations) of any occurrences of the shorter of the two strings (s and s2) in the longer one. So, we can verify that CAG is a substring of s, by asking Python:
s2 = ‘CAG’
s_idx = s.find(s2)
print(s_idx,s[s_idx:s_idx+len(s2)])
This response tells us that CAG is indeed a substring of s starting from its second location s[1]. If we ask: s.find(‘CCC’), the answer will be: -1 indicating that the set of valid answers is “empty,” i.e., there is no such substring in the string s.
print(s.find(‘CCC’))
If we ask: s.find(‘AG’), the answer will be 2, which is the first occurrence of the substring. To find all occurrence of AG (at index 2 and 5), you will have to iteratively reduce your search range by excluding the substrings that have already been found:
# A list to store all found indices of `AG`
ind_arr = []
while index < len(s):
index = s.find(s2, index)
# if not found, exit
if index == -1:
print('s2 found at', index)
ind_arr.append(index)
index += len(s2) # +2 because len('AG') == 2
print(len(ind_arr))
s2 found at 2
s2 found at 5
Another way to do this is to utilize the regular expression module in Python by importing the re module for regular expression:
# A list to store all found indices of `AG`
ind_arr = []
iterator = re.finditer(s2, s)
for idx in iterator:
ind_arr.append(idx.start())
print('Found s2 at index', idx.start())
Found s2 at index 2
Found s2 at index 5
The re.finditer function does something similar to our while loop above, it gives you an callable_iterator object for you to iterate through all occurrence of the found substring.
Note that each string of length N has N-L+1 substrings of length L, the first such substring
occupying locations (1, 2, ... , L), and the last occupying locations (N-L+1, N-L+2, ... , N).
Matrix manipulation¶
Another data type we would frequently encounter is the gene expression data. The gene expression data is the quantification of the different RNA expression levels. For example we can represent the expression levels of different genes of a subject as an array:
import numpy as np
exp = np.array([1.23, 4.56, 7.89, 10.11, 12.13])
print("The expression levels of the subject is ", exp)
The expression levels of the subject is [ 1.23 4.56 7.89 10.11 12.13]
Here we use the numpy library (https://numpy.org/) to perform array manipulation. It is one of the most essential library in Python. Here each element in the array records the expression level of each gene. The first gene is 1.23, second gene 4.56, etc..
Usually, we would look at gene expression levels of multiple subjects together, which can be represented as a 2-dimensional array :
exp_set = np.array([
[2.13, 5.46, 8.79, 11.10, 13.12],
[3.12, 6.45, 9.78, 10.11, 12.13]
print(exp_set)
[[ 1.23 4.56 7.89 10.11 12.13]
[ 2.13 5.46 8.79 11.1 13.12]
[ 3.12 6.45 9.78 10.11 12.13]]
The first row of the array is the expression levels of genes of subject 1, while the second row is that of subject 2, third row subject 3.
Having a gene expression matrix allows us to perform some simple statistics. For example, if we want to know the mean expression values of gene 3 across all three patients:
print("The mean expression value of gene 3 across three patients is: ", exp_set[:, 2].mean())
The mean expression value of gene 3 across three patients is: 8.82
numpy is very powerful in terms of row-wise or column-wise calculating compared with normal Python loop. For example, if we want to calculate the variance of all genes, as can simply do
print("Variance of genes are: ", exp_set.var(axis=0))
Variance of genes are: [0.5958 0.5958 0.5958 0.2178 0.2178]
print("Variance of samples are: ", exp_set.var(axis=1))
Variance of samples are: [15.171664 15.457 10.077736]
To compare the calculation time using numpy function versus Python loop:
exp_set.var(axis=0)
36.7 µs ± 6.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
[g.var() for g in exp_set.T]
93.3 µs ± 742 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
We can see the built-in numpy calculation is at least 4 or 5 times faster than for loop. When we are dealing with large data set in the class, it is important to look for existing built-in functions provided by libraries than writing your own for loop.
PART 2: Pairwise Sequence Alignment Using Dynamic Programming¶
Pairwise Biomolecular Sequence Alignment¶
For now, we only want to consider the problem of alignment of homologous sequences from a quantitative viewpoint, and to understand precisely the method of alignment shown below.
The problem is one of optimization: Out of the numerous ways, in which two sequences can be put one above the other, find the one that maximizes some score. If we allow for insertions and deletions in large sequences, then the number of such possible ways is astronomically large, and so the "exhaustive search" way of solving the problem cannot work. Let us see how a method called "dynamic programming" can come to the rescue.
ATTCAGGTCAAGGC-TTAGCAGTCTAAACGGATCG
ATTC-GGTCAATGCATTAGCAGTCTAAA-GGATC-
import numpy as np
def global_alignment(s1, s2, match_score=1, mismatch_score=-1, gap_score=-2):
This function performs global alignment of two character strings x and y
First character string to be aligned
Second character string to be aligned
match_score : numeric
Score when the characters match
mismatch_score : numeric
Score when the characters mismatch
gap_score : numeric
Score when an indel is created
xalign : str
Aligned first string
yalign : str
Aligned second string
F : np.ndarray
Score matrix
I : np.ndarray
Alignment operation matrix for backtrace. 'd' for diagonal; 'v' for
vertical; 'h' for horizontal
M = len(s1)
N = len(s2)
F = np.zeros((M + 1, N + 1), dtype=np.int16)
I = np.empty_like(F, dtype=str)
# Initial Conditions for matrices F and I:
# First column
F[:, 0] = gap_score * np.arange(M + 1)
I[1:, 0] = "v"
# First row
F[0, :] = gap_score * np.arange(N + 1)
I[0, 1:] = "h"
# compare sequenses x and y, find maximum-score matrix F
# and the straight-line segments for traceback of optimal path in matrix I
for i in range(1, M + 1):
for j in range(1, N + 1):
if s1[i - 1] == s2[j - 1]:
w = match_score
w = mismatch_score
options = [F[i - 1,j - 1] + w, # diagonal
F[i, j - 1] + gap_score, # horizontal
F[i - 1, j] + gap_score] # vertical
F[i, j] = np.max(options)
I[i, j] = ['d', 'h', 'v'][np.argmax(options)]
#traceback:
s1rev = ''
s2rev = ''
while I[i, j] != "": #the process will stop at i=j=0 because I[0,0] = ""
if I[i, j] == "d":
s1rev += s1[i]
s2rev += s2[j]
elif I[i, j] == 'h':
s1rev += '-'
s2rev += s2[j]
s1rev += s1[i]
s2rev += '-'
#reverse arrays for proper display
s1align = s1rev[::-1]
s2align = s2rev[::-1]
return s1align, s2align, F, I
s1 = 'ATTCAGGTCAAGGCTTAGCAGTCTAAACGGATCG'
s2 = 'ATTCGGTCAATGCATTAGCAGTCTAAAGGATC'
s1align, s2align, F, I = global_alignment(s1, s2)
print("{}\n{}".format(s1align, s2align))
ATTCAGGTCAAGGC-TTAGCAGTCTAAACGGATCG
ATTC-GGTCAATGCATTAGCAGTCTAAA-GGATC-
s1 = 'ACTGCTA'
s2 = 'CTGA'
s1align, s2align, F, I = global_alignment(s1, s2)
print("{}\n{}".format(s1align, s2align))
array([[ 0, -2, -4, -6, -8],
[ -2, -1, -3, -5, -5],
[ -4, -1, -2, -4, -6],
[ -6, -3, 0, -2, -4],
[ -8, -5, -2, 1, -1],
[-10, -7, -4, -1, 0],
[-12, -9, -6, -3, -2],
[-14, -11, -8, -5, -2]], dtype=int16)
array([['', 'h', 'h', 'h', 'h'],
['v', 'd', 'd', 'd', 'd'],
['v', 'd', 'd', 'd', 'd'],
['v', 'v', 'd', 'h', 'h'],
['v', 'v', 'v', 'd', 'h'],
['v', 'd', 'v', 'v', 'd'],
['v', 'v', 'd', 'v', 'd'],
['v', 'v', 'v', 'v', 'd']], dtype='