# Point Mutations In Kmers in Python Homework Sample

You need to write a generic search for point mutations which can handle a variety of values of k and N. You should not rely on external libraries during the ksearch and nsearch functions, such as np.array, np.sort and np.where or other non built in libraries. You need to find frequently occurring kmers inside a DNA sequence. The DNA sequence CATTAG consists of 4 3-mers, CAT, ATT, TTA, and TAG. Add code to find the sequences that appears at least f times in the sequence. Find point mutations for these sequences (a point mutation of CAT might be AAT). For more Python programming assignments contact us for a quote.

Solution:

p1.py

“””M345SC Homework 1, part 1

Your name and CID here

“””

def preprocessing(S, k):

“””

function creates a sorted dictionary for all k-mers with the

corresponding indices values within the string

“””

dic = {}

for idx in range(0, len(S)-k+1):

seq = S[idx: idx+k]

dic.setdefault(seq, []).append(idx)

return dic

def ksearch(S, k, f, x):

“””

Search for frequently-occurring k-mers within a DNA sequence

and find the number of point-x mutations for each frequently

occurring k-mer.

Input:

S: A string consisting of A,C,G, and Ts

k: The size of k-mer to search for (an integer)

f: frequency parameter — the search should identify k-mers which

occur at least f times in S

x: the location within each frequently-occurring k-mer where

point-x mutations will differ.

Output:

L1: list containing the strings corresponding to the frequently-occurring k-mers

L2: list containing the locations of the frequent k-mers

L3: list containing the number of point-x mutations for each k-mer in L1.

Discussion: first we build dictionary for all subsequences of length k and with corresponding

index, this step taked O(N*k) as N length of string and k is the length of substrings, after this

step we could know the frequency of each sequence and the locations by retriving the dictionary

one time with linear time O(N), for last part we compare every substring from the

frequently-occurring k-mers with the last strings in the all k-mers, and if the x element doesn’t match

and all other elemnts match we consider it as mutation, it will take(M*N) as M is the number of frequent

elemnts and N in the number of all k-mers.

“””

L1, L2, L3 = [], [], []

k_mers = preprocessing(S, k)

for string in k_mers:

if len(k_mers[string]) >= f:

L1.append(string)

L2.append(k_mers[string])

for seq in L1:

temp = []

for cand in list(k_mers.keys()):

if seq[:x]+seq[x+1:] == cand[:x]+cand[x+1:] and seq[x] != cand[x]:

temp.append(cand)

L3.append(len(temp))

return L1, L2, L3

if __name__ == ‘__main__’:

#Sample input and function call. Use/modify if/as needed

S=’CCCTATGTTTGTGTGCACGGACACGTGCTAAAGCCAAATATTTTGCTAGT’

#k=3

#x=2

#f=2

#L1,L2,L3=ksearch(S,k,f,x):

L1, L2, L3 = ksearch(S, 3, 2, 2)

print(L1, L2, L3)

p2.py

“””M345SC Homework 1, part 2

Your name and CID here

“””

def binary_search(alist, target, first=False):

“””

Given sorted list apply the binary search algorithm and returns

the index of first occurrence of the target or last occurrence

“””

istart = 0

iend = len(alist) – 1

result = -1

while istart <= iend:

imid = int((istart+iend)/2)

if target == alist[imid]:

result = imid

if first:

iend = imid – 1 # find first occurrence

else:

istart = imid + 1 # find last occurrence

elif target < alist[imid]:

iend = imid-1

else:

istart = imid+1

return result

def linear_search(alist, target):

“””

Given list apply linear search to find all indices

of the target in the list

“””

indx = []

for i, y in enumerate(alist):

if y == target:

indx.append(i)

return indx

def nsearch(L, P, target):

“””Input:

L: list containing *N* sub-lists of length M. Each sub-list

contains M numbers (floats or ints), and the first P elements

of each sub-list can be assumed to have been sorted in

ascending order (assume that P<M). L[i][:p] contains the sorted elements in

the i+1th sub-list of L

P: The first P elements in each sub-list of L are assumed

to be sorted in ascending order

target: the number to be searched for in L

Output:

Lout: A list consisting of Q 2-element sub-lists where Q is the number of

times target occurs in L. Each sub-list should contain 1) the index of

the sublist of L where target was found and 2) the index within the sublist

where the target was found. So, Lout = [[0,5],[0,6],[1,3]] indicates

that the target can be found at L[0][5],L[0][6],L[1][3]. If target

is not found in L, simply return an empty list (as in the code below)

“””

Lout = []

for i, sublist in enumerate(L):

ord_sub = sublist[: P]

unord_sub = sublist[P:]

if target > ord_sub[-1]:

indx = linear_search(unord_sub, target)

if len(indx) > 0:

for idx in indx:

Lout.append([i, P+idx])

else:

first_ord = binary_search(ord_sub, target, first=True)

last_ord = binary_search(ord_sub, target, first=False)

if first_ord != -1:

for idx in range(first_ord, last_ord+1):

Lout.append([i, idx])

indx = linear_search(unord_sub, target)

if len(indx) > 0:

for idx in indx:

Lout.append([i, P+idx])

return Lout

def nsearch_time(plt_len, num_sorted, target, step):

“””Analyze the running time of nsearch.

Add input/output as needed, add a call to this function below to generate

the figures you are submitting with your codes.

Discussion: considering the outer for loop the function iterates

through all the sublists with running time O(N) as *N* is the number

of the sublists inside L. for every sublist, taking advantage of the fact

that the first p elements are sorted in ascending order, we compare the

pth-1 elemnet (last element in sorted part) with the target,

if it is smaller than the target we can omit all elements in

the sorted part, then we apply linear search for the unsorted part

with O(M-p + k) as M is the length of the subset, p is the length of

sorted part and k number of occurence of the target in sublist,

the second case would be that last element in the sorted part is

bigger than the target value, we will use different approaches

to deal with the sorted or unsoreted parts, for the sorted part

we will take advantage of the fact that it’s sorted and we will

use binary search two times to find the first occurrence and the

last occurrence of the target with O(log(p) + k) as p is the length of

the sorted part and k number of occurence of the target in this part,

then we will apply the linear search for unsorted part as in the first case.

this make the running time of function close to O(N * (M-p+log(p)+k))

fig1 : by increasing the size of list L along with the size of sublists

and measure time for each time then plotting the size of list in x-axis

and the running time in y-axis we notice that the function running time

takes behaviour so close to quadratic function as the running time

is considered as multiple of the size of list and size of sublists.

“””

import time

import numpy as np

import matplotlib.pyplot as plt

x = []

y = []

for i in range(100, plt_len, step):

array = np.random.randint(1, 3000, size=(i*3, i))

array[:, :num_sorted].sort(axis=1)

t1 = time.time()

nsearch(array, num_sorted, target)

t2 = time.time()

x.append(i*3)

y.append(t2-t1)

plt.plot(x, y)

plt.ylabel(‘Time’)

plt.xlabel(‘size of list’)

plt.savefig(‘fig1.png’)

return None # Modify as needed

if __name__ == ‘__main__’:

# add call(s) to nsearch here which generate the figure(s)

# you are submitting

nsearch_time(1000, 50, 100, 100) # modify as needed