CS107 Assignment 1: Reassemble
tag. -->
26 captures<br>13 Aug 2014 - 25 Jul 2024
Feb<br>MAR<br>Apr
11
2014<br>2015<br>2016
success
fail
About this capture
COLLECTED BY
Organization: Alexa Crawls
Starting in 1996, Alexa Internet has been donating their crawl data to the Internet Archive. Flowing in every day, these data are added to the Wayback Machine after an embargo period.
Collection: Alexa Crawls
Starting in 1996, Alexa Internet has been donating their crawl data to the Internet Archive. Flowing in every day, these data are added to the Wayback Machine after an embargo period.
TIMESTAMPS
The Wayback Machine - https://web.archive.org/web/20150311174405/http://web.stanford.edu:80/class/cs107/assign1.html
Toggle navigation
CS107
Home
Course materials
Admin
Syllabus
Labs
Assignments
Exams
Gradebook
Getting help
FAQ
Office hours
Forum & email
Other resources
!-- end Direct use -->
Assignment 1: Reassemble
Due date: Mon Jan 19 11:59 pm - Hard deadline: Wed Jan 21 11:59 pm
Assignment by Julie Zelenski, with ideas borrowed from Rich Pattis (UC Irvine) and Owen Astrachan (Duke)
[Quick Links: Implementation details, Advice page, Grading ]
Learning goals
Completing this assignment will jump-start your proficiency in:
editing, compiling, testing, and debugging C programs under Unix
writing code to manipulate C-strings
using the C facilities for dynamic memory management (malloc & free)
The problem
You love The C Programming Language text so much that you own 10 copies of it, but your evil nemesis knows how dear it is to you and has taken a pair of scissors to every one of your copies and randomly cut each into thousands of pieces. Amidst your tears, you attempt to reassemble the fragments to recreate the original. Little did you know that solving this problem would prep you for a career in computational biology and international espionage.
For the assignment, you are to write a program that reads a input of text fragments and uses them to reconstruct the original document. The fragments were created by duplicating the original document many times over and chopping each copy into pieces. The process of reconstructing the original will rely on finding overlapping portions and aligning them to match.
This task is based on a technique used for genome shotgun sequencing in the Human Genome Project. Current sequencing technology can only sequence chunks of 500 or so base pairs, yet the human genome is billions of pairs long. Thus, the DNA to be sequenced is cloned several times and the clones are randomly cut into sequenceable-sized chunks. The reassembly step aligns those chunks to reconstruct the original strand. Nova had a fascinating show on Cracking the Code of Life about the race to decode the human genome. In addition to biology, a number of intelligence gathering tasks involve reassembly. The carpet weavers piecing together photos in Argo are based in truth, archivists are reassembling Stasi trash, and the team "All Your Shreds are Belong to U.S." (great name!) just won the DARPA-sponsored shredder challenge. All of these tasks have in common the process of matching fragments to piece them into a unified whole, just as you will do for this assignment.
Greedy match and merge
The program reads the collection of fragments and then processes the fragments in a series of rounds. A round does a single match/merge operation. To match, it searches the collection to find the pair of fragments with the longest-length overlap. To merge, the pair is aligned at their point of overlap and joined into a superstring that has the two original fragments as substrings. Each round decreases the total number of fragments by one and the process continues until only one fragment remains. In each round, it is the the pair of fragments with the longest-length overlap that is chosen for merging.
Consider a collection with fragments shown below (extra spaces were inserted between letters for clarity):
a l l i s w e l l<br>e l l t h a t e n<br>h a t e n d<br>t e n d s w e l l
On the first round, the longest overlap found is a 6-character overlap between the middle two fragments when aligned as below:
e l l t h a t e n<br>h a t e n d
These two fragments would be removed and replaced with their merged result:
e l l t h a t e n d
The merged fragment is a candidate for future rounds; the two fragments it was composed from are discarded. On the next round, the longest overlap is 5 characters when these two fragments are aligned as below:
e l l t h a t e n d<br>t e n d s w e l l
These fragments would be removed and replaced with their merged result:
e l l t h a t e n d s w e l l
The last round merges the remaining two fragments aligned at their maximal overlap of 3 characters:
a l l i s w e l l<br>e l l t h a t e n d s w e l l
The one fragment remaining is the final result:
a l l i s w e l l t h a t e n d s w e l l
A match is also possible when one fragment is completely contained within another. Consider:
s w e l l...