CS2201 Assignment No. 1
Purpose: Gain experience in using and manipulating arrays, particularly partially-filled arrays.
Background on DNA, Restriction Enzymes, and PCR:
This background is interesting, but not really needed to do the assignment. There are some good stories here, but if you want to get to the assignment, you can skip this stuff.
In this assignment you’ll simulate cleavage (cutting) of DNA by restriction enzymes. DNA is made up of sequences of molecules known as nucleotides that are linked together to form a DNA strand. Biochemists use the letters ‘A’, ‘C’, ‘T’, and ‘G’ (standing for adenine, cytosine, thymine, and guanine) to represent a linked sequence of nucleotides within a DNA strand. Restriction enzymes are molecules that recognize and cut very specific “target” nucleotide sequences. These target sequences are commonly referred to as restriction sites.
Three scientists shared the Nobel Prize in 1978 for the discovery of restriction enzymes. They’re also an essential part of the process called PCR polymerase chain reaction which is one of the most significant discoveries/inventions in chemistry and for which Kary Mullis won the Nobel Prize in 1993.
You can see animations and explanations of both restriction enzymes and PCR at DnaTube and Cold Spring Harbor Dolan DNA Learning Center.
Restriction Enzymes PCR: Polymerase Chain Reaction
(source: http://www.roche.com/pages/facets/1/pcr1.jpg
(source: http://www.astbury.leeds.ac.uk/gallery/leedspix.html)
Kary Mullis, the inventor of PCR, is an interesting character. To see more about him see this archived copy of a 1992 interview in Omni Magazine, this 1994 interview as part of virus myth, his personal website which includes information about his autobiography Dancing Naked in the Mind Field, though you can read this free Nobel autobiography as well.
The simulation is a simplification of the chemical process, but provides us a nice starting point for our exploration of C++ and also provides us with a basis for future assignments.
The Assignment:
This assignment is a simplification of the chemical process, but provides an example of the DNA manipulation using simple arrays. We will use a static array containing characters, such as the letters ‘A’, ‘C’, ‘T’, and ‘G’, to represent the sequence of nucleotides within a DNA strand.
Sample representation of a DNA molecule: ACTTGATTGGGTTGCTTGCC???
Note that the array may be larger than the strand that we are representing, and thus there may be unused array elements after the nucleotides within the strand. Thus not only must we know the size of the array, we must also keep track of the size of the strand inside the array [this is known as a partially-filled array; in this example we have an array that can hold 23 characters which is currently only holding 20 characters, the question marks indicate unused array elements whose values we don’t know or care about].
Cleavage by a restriction enzyme will simply mean removing a specified sequence from our DNA strand. Cleaving requires a target nucleotide pattern. The sequence to be removed will be the nucleotides that reside between matching pairs of the target pattern. A cleave will remove all nucleotides beginning after the first occurrence of the target pattern and continue through the end of the second (matching) target pattern. We will write 2 functions cleave and cleaveAll, which will implement cleaving in 2 different forms. Cleave will cleave only the first such sequence while cleaveAll will remove all such sequences between matching pairs within the DNA strand.
Thus, for e.g. if we invoke cleave(“TTG”) on the above DNA strand, we would remove the first set of highlighted characters and the resultant DNA strand will be :
ACTTGGGTTGCTTGCC??????? Notice how the data was shifted down to fill the hole created by the deleted sequence.
In comparison, if we invoke cleaveAll(“TTG”) on the original DNA strand above, we would remove both sets of highlighted characters and the resultant DNA strand will be :
ACTTGGGTTGCC???????????
The nucleotides in our DNA are character values, so we will use an array of type char. The array will be of size MAX_DNA (a constant variable set to 50 – a value too small for real DNA but it makes our testing easier) and so we will be limited to holding DNA with a maximum number of nucleotides of MAX_DNA. Our class will also have a private instance variable called mySize which will keep track of the size of the strand (i.e., how many elements we are using of the partially-filled array).
DNA_Strand.h describes all the functions to be implemented.
Functional Specifications:
You will be supplied the class declaration file: DNA_Strand.h. The file contains the declaration of a set of functions to manipulate arrays representing DNA. Your task will be to implement all the functions in the class definition file: DNA_Strand.cpp. Changes to DNA_Strand.h are limited to only adding private helper methods (which are optional; helper methods are not required). You have also been provided with an initial test program: DNAtest.cpp. You should add code to the DNAtest.cpp file to fully test your DNA_Strand class (copy code from the DNAtest.cpp file you created for Project #0).
Implementation details:
Here are a few notes that might be helpful:
1. You are to download a zip file that is provided with this specification. The zip file contains a CLion project that
includes the starter code for this project. You must unzip/extract the files before you work on them. Once you
unzip the file, you can open the project by starting CLion, then specify that you want to “Open Project” (do not make a new project or new project from sources), and then navigate to the folder you just extracted. When you open the project, let CLion do its initialization work and load symbols. The code as provided compiles cleanly and the program should run as given – though the results are incorrect since required functionality is missing. Again, you must unzip/extract the files before you work on them. For more details on opening a provided project, review the instructions at the end of the Project 0 specification.
2. Please see “Opening a CLion project” step #6 in the project #0 spec and ensure that your project #1 CMakeLists.txt file has all the correct compiler flags. This is important to ensure your code compiles when it is graded. Also make sure that your project settings/preferences specify the use of the clang compiler under the CMake options (as per the CLion/clang installation instructions posted to Brightspace).
3. Write your code in terms of MAX_DNA and not the constant 50. We may want to change the size of the array in the future, and such a change should only require changing one line of code.
4. Note that you do not need to implement the methods in the order they appear in the file. You can implement them in any order. In particular, it can be helpful to have methods call other methods when those other methods can do some of the work for you. So THINK first.
5. It is highly recommended that you write your code incrementally. You should implement one or two methods at a time and fully test them before moving on to other methods. You can simply comment out code in the DNAtest.cpp program that is testing methods that you have not yet implemented, then uncomment the code when you want to reactivate those tests (CLion has an option to quickly comment/uncomment highlighted code under the Code drop-down menu). You may also want to enhance your test code as you go along, so that it does a better job of testing “edge conditions” – conditions that are valid but are out of the norm.
6. You will likely be performing some operations with C++ strings. C++ string operations are described here. You will only need a few of the available operations in this assignment.
. In particular, when you are asked to compare two DNA_Strands or find nucleotides sequences, you must operate directly on your array. For
the most part, your usage of strings should be limited to getting their length (with the length() method) or accessing individual characters (using array indexing with [ ]). One caveat, there is a nice string constructor that you are allowed to use in your DNA_Strand’s toString() method.
7. The length() method of the string class returns a value of type size_t (an unsigned integer type). We also use variables and parameters of type size_t whenever we expect the value to be non-negative (e.g., MAX_DNA and mySize are both declared to be size_t variables). The compiler may give you conversion warning messages if you attempt to assign/compare these values to a value of type int. Your homework submissions are expected to have clean compiles (no errors or warnings). To eliminate this particular warning, you can cast the size_t values to an int, or cast the int values to type size_t if you know the integer values are not negative – be sure to make conscious decisions as you use these operations.
8. Your DNA_Strand class constructors are required to use the base member initialization list as much as possible.
9. The DNA_Strand.h specifies that you are supposed to throw an exception if someone attempts to access a part of
your DNA_Strand that is not within the bounds of the DNA strand that you are currently representing. To throw an exception in C++, you can use the follow statement: throw std::out_of_range(“Index out of range”); This statement requires that you include the following at the top of your DNA_Strand.cpp file: #include
10. Since we are using a static array in the DNA_Strand class, the C++ compiler takes care of creating the array for you whenever a DNA_Strand object is created. Your constructors should not have any “new” operations (I state this for the benefit of the Java programmers in the class).
11. We are using a character array to represent the strand.
. Rather than marking the end of the DNA strand with a null terminator, we are using the mySize instance variable to keep track of how
many elements of the array we are currently using.
convert your array of chars into a string object and then operate on that string object
Note that when you write your DNA_Strand
methods, you are expected to directly manipulate the array of chars that you have created. You are not allowed to
We are not using this character array as a cstring (a null-
terminated, character array), nor are you allowed to do so while working on this assignment
12. The array of characters used to represent the strand will be a partially-filled array. See the information on partially-filled arrays below. Note that you will lose points if your class methods do unnecessary work; i.e., if you process elements of the array that are not currently being used to represent the strand.
13. Even though we plan to use this class to represent DNA strands, we will not restrict the user to only using the characters ‘A’, ‘C’, ‘G’, and ‘T’; rather we will allow any characters to be stored in the DNA strand. Your code is expected to be case sensitive, thus the strand “AGT” is different than the strand “agt”.
14. If you do not understand how a method should work from its description, then can go back to project #0 and try things out – you were given a working DNA_Strand class after all.
15. In CS2201, we do not utilize “using namespace std;” in our programs, and you are not allowed to use it on any of our programming projects.
Partially filled arrays: Here is some additional information on what is meant by a “partially filled array”.
In C++ (and Java too), once an array is created its size is set. You cannot make it bigger, nor can you make it smaller. If you ever wanted the array to be bigger/smaller, you would have to reallocate a completely new array (something we will see in lecture very soon). In this assignment, we are never reallocating a new array (nor can we due to the way the array was declared in C++).
In C++, once an array is created every element of the array contains a data item of the declared type of the array. I.e., every element of an integer array contains an integer; and every element of a char array contains a character. This is also true for arrays of class type; each element of such an array contains an object when the array is created.
Now, there are times when we don’t know ahead of time how many items we will want to store in our array. In such cases, we have to make a reasonable guess at the maximum size needed and allocate an array of that size. In this situation, we will not be using all the array elements all the time, so we have to keep track of the number of data items that we are actually storing in the array and we will store those items in contiguous array elements starting at index zero. This is known as a partially filled array. To make this work we need to keep track of three separate things:
1) the array where we are storing data (e.g., myDNA)
2) the size of the array (e.g., MAX_DNA)
3) the number of elements of the array that are currently being used (e.g., mySize); again, the mySize data items are stored in the array from index 0 to index mySize-1.
In this partially filled array, you cannot set array elements to some special value to indicate that the element is currently “not used” or “deleted”. What value would you use? It would have to be some value that could not be a valid value added by the user — but any value you can create to indicate “not used” the user could also potentially create to add to the array.
If you ever need to know if an array element is currently being used, you only need to consider the value of mySize. All elements from index 0 to index mySize-1 are used, and all elements from index mySize to index MAX_DNA-1 are not used.
To add items to a partially filled array (assuming there is space available), you need to increase the count of used elements and place the new items in the appropriate position of the array (shifting existing data up if necessary). To delete items from a partially filled array, you need to decrease the count of used elements and shift data down to fill in the hole left by the deleted elements. Deleting will cause some elements of the array to change from being in the “used” portion of the array to being in the “unused” portion of the array. There is no need to set these elements to some special value — since they are now in the unused portion of the array we should not be accessing those values any longer. Note for this assignment, none of the methods require you to add new items to the partially filled array, but some methods require you to delete items.
When processing the data of a partially filled array, you only process the array elements that lie in the portion that is currently being used.
Final write-up: When you have completed the assignment, create a README document (a README.txt text file or a README.docx Word document), and in it answer the following questions:
1. State your name and email address.
2. After reviewing this spec and the .h file, please estimate/report how many hours you think it will take you to complete this project. [This is just an estimate and does not affect your grade.]
3. How many hours did you actually spend total on this assignment? [This information will not affect your grade.]
4. Did you encounter any impediments that prevented you from making progress on this assignment?
5. What did you like or hate about this assignment?
6. Do you have any suggestions for improving this assignment?
Submission for grading:
When you have completed your work on this assignment, please submit the following four files for grading: DNA_Strand.cpp, DNA_Strand.h, and DNAtest.cpp, plus your final README write-up document. Submit ONLY the four files; as you should not have changed any of the other source files (and you likely did not change DNA_Strand.h). Please do not submit a zip file containing your entire project. You can submit the files by visiting the assignment page in Brightspace (click on the assignment name), scroll down to the “Submit Files” section, and add the files by clicking on the “Add a file” button and finding the files to attach.
After submitting your homework, it is good practice to verify your submission. Revisit the assignment page in Brightspace and make sure that your files were successfully submitted. Then click on each file to open it up so that you can verify that you submitted the correct file, as opposed to some older version of the file. It is your responsibility to insure the correctness of your submission.
If you need to resubmit any of the files, you need to do a full resubmission of all the files, as only your last submission is saved.
Grading:
This project is worth 50 points. Your grade on this project will be based on the following:
1. The correctness of your methods implemented in the DNA_Strand.cpp file.
2. The use of good programming style. No lines of code past column 100, proper indentation, proper use of curly
braces, etc. See the style document posted to Brightspace under Course Resources.
3. Eliminating redundancy in your code. If you fail to utilize other methods that you have written and have repeated
functionality in multiple methods, you will lose points.
4. The thoroughness of your testing performed in the DNAtest.cpp file. Note: the thoroughness of your testing often
has a great impact on the correctness of your code (see item #1 above).
5. Appropriate responses to all questions in the README file. Again, the file must be named README and it
should be a .txt text file or a .docx Word file.
You should also review the syllabus regarding the penalties for late programming assignments.
We recommend that you get started on this assignment immediately (start thinking, not coding). Do not wait for your project #0 submissions to be graded – you can fix that stuff up later.
Acknowledgements:
This assignment is loosely based on a project by Owen Astrachan at Duke University which is in the collection of ACM SIGCSE Nifty Assignments.