Point Mutations In Kmers in Python

Point Mutations In Kmers in Python Homework Sample

You need to write a generic search for point mutations which can handle a variety of values of k and N. You should not rely on external libraries during the ksearch and nsearch functions, such as np.array, np.sort and np.where or other non built in libraries. You need to find frequently occurring kmers inside a DNA sequence. The DNA sequence CATTAG consists of 4 3-mers, CAT, ATT, TTA, and TAG. Add code to find the sequences that appears at least f times in the sequence. Find point mutations for these sequences (a point mutation of CAT might be AAT). For more Python programming assignments contact us for a quote.

Solution:

p1.py

“””M345SC Homework 1, part 1
Your name and CID here
“””

def preprocessing(S, k):
“””
function creates a sorted dictionary for all k-mers with the
corresponding indices values within the string
“””
dic = {}

for idx in range(0, len(S)-k+1):
seq = S[idx: idx+k]
dic.setdefault(seq, []).append(idx)

return dic

def ksearch(S, k, f, x):
“””
Search for frequently-occurring k-mers within a DNA sequence
and find the number of point-x mutations for each frequently
occurring k-mer.
Input:
S: A string consisting of A,C,G, and Ts
k: The size of k-mer to search for (an integer)
f: frequency parameter — the search should identify k-mers which
occur at least f times in S
x: the location within each frequently-occurring k-mer where
point-x mutations will differ.

Output:
L1: list containing the strings corresponding to the frequently-occurring k-mers
L2: list containing the locations of the frequent k-mers
L3: list containing the number of point-x mutations for each k-mer in L1.

Discussion: first we build dictionary for all subsequences of length k and with corresponding
index, this step taked O(N*k) as N length of string and k is the length of substrings, after this
step we could know the frequency of each sequence and the locations by retriving the dictionary
one time with linear time O(N), for last part we compare every substring from the
frequently-occurring k-mers with the last strings in the all k-mers, and if the x element doesn’t match
and all other elemnts match we consider it as mutation, it will take(M*N) as M is the number of frequent
elemnts and N in the number of all k-mers.
“””
L1, L2, L3 = [], [], []

k_mers = preprocessing(S, k)

for string in k_mers:
if len(k_mers[string]) >= f:
L1.append(string)
L2.append(k_mers[string])

for seq in L1:
temp = []
for cand in list(k_mers.keys()):
if seq[:x]+seq[x+1:] == cand[:x]+cand[x+1:] and seq[x] != cand[x]:
temp.append(cand)

L3.append(len(temp))

return L1, L2, L3

if __name__ == ‘__main__’:
#Sample input and function call. Use/modify if/as needed
S=’CCCTATGTTTGTGTGCACGGACACGTGCTAAAGCCAAATATTTTGCTAGT’
#k=3
#x=2
#f=2
#L1,L2,L3=ksearch(S,k,f,x):
L1, L2, L3 = ksearch(S, 3, 2, 2)
print(L1, L2, L3)

p2.py

“””M345SC Homework 1, part 2
Your name and CID here
“””

def binary_search(alist, target, first=False):
“””
Given sorted list apply the binary search algorithm and returns
the index of first occurrence of the target or last occurrence
“””
istart = 0
iend = len(alist) – 1
result = -1
while istart <= iend:
imid = int((istart+iend)/2)
if target == alist[imid]:
result = imid
if first:
iend = imid – 1 # find first occurrence
else:
istart = imid + 1 # find last occurrence

elif target < alist[imid]:
iend = imid-1
else:
istart = imid+1
return result

def linear_search(alist, target):
“””
Given list apply linear search to find all indices
of the target in the list
“””
indx = []
for i, y in enumerate(alist):
if y == target:
indx.append(i)

return indx

def nsearch(L, P, target):
“””Input:
L: list containing *N* sub-lists of length M. Each sub-list
contains M numbers (floats or ints), and the first P elements
of each sub-list can be assumed to have been sorted in
ascending order (assume that P<M). L[i][:p] contains the sorted elements in
the i+1th sub-list of L
P: The first P elements in each sub-list of L are assumed
to be sorted in ascending order
target: the number to be searched for in L

Output:
Lout: A list consisting of Q 2-element sub-lists where Q is the number of
times target occurs in L. Each sub-list should contain 1) the index of
the sublist of L where target was found and 2) the index within the sublist
where the target was found. So, Lout = [[0,5],[0,6],[1,3]] indicates
that the target can be found at L[0][5],L[0][6],L[1][3]. If target
is not found in L, simply return an empty list (as in the code below)
“””
Lout = []

for i, sublist in enumerate(L):
ord_sub = sublist[: P]
unord_sub = sublist[P:]

if target > ord_sub[-1]:
indx = linear_search(unord_sub, target)
if len(indx) > 0:
for idx in indx:
Lout.append([i, P+idx])

else:
first_ord = binary_search(ord_sub, target, first=True)
last_ord = binary_search(ord_sub, target, first=False)

if first_ord != -1:
for idx in range(first_ord, last_ord+1):
Lout.append([i, idx])

indx = linear_search(unord_sub, target)
if len(indx) > 0:
for idx in indx:
Lout.append([i, P+idx])

return Lout

def nsearch_time(plt_len, num_sorted, target, step):
“””Analyze the running time of nsearch.
Add input/output as needed, add a call to this function below to generate
the figures you are submitting with your codes.

Discussion: considering the outer for loop the function iterates
through all the sublists with running time O(N) as *N* is the number
of the sublists inside L. for every sublist, taking advantage of the fact
that the first p elements are sorted in ascending order, we compare the
pth-1 elemnet (last element in sorted part) with the target,
if it is smaller than the target we can omit all elements in
the sorted part, then we apply linear search for the unsorted part
with O(M-p + k) as M is the length of the subset, p is the length of
sorted part and k number of occurence of the target in sublist,
the second case would be that last element in the sorted part is
bigger than the target value, we will use different approaches
to deal with the sorted or unsoreted parts, for the sorted part
we will take advantage of the fact that it’s sorted and we will
use binary search two times to find the first occurrence and the
last occurrence of the target with O(log(p) + k) as p is the length of
the sorted part and k number of occurence of the target in this part,
then we will apply the linear search for unsorted part as in the first case.
this make the running time of function close to O(N * (M-p+log(p)+k))

fig1 : by increasing the size of list L along with the size of sublists
and measure time for each time then plotting the size of list in x-axis
and the running time in y-axis we notice that the function running time
takes behaviour so close to quadratic function as the running time
is considered as multiple of the size of list and size of sublists.
“””
import time
import numpy as np
import matplotlib.pyplot as plt

x = []
y = []
for i in range(100, plt_len, step):
array = np.random.randint(1, 3000, size=(i*3, i))
array[:, :num_sorted].sort(axis=1)
t1 = time.time()
nsearch(array, num_sorted, target)
t2 = time.time()
x.append(i*3)
y.append(t2-t1)

plt.plot(x, y)
plt.ylabel(‘Time’)
plt.xlabel(‘size of list’)
plt.savefig(‘fig1.png’)
return None # Modify as needed

if __name__ == ‘__main__’:

# add call(s) to nsearch here which generate the figure(s)
# you are submitting
nsearch_time(1000, 50, 100, 100) # modify as needed