Reassemble

tosh1 pts0 comments

CS107 Assignment 1: Reassemble

tag. -->

26 captures<br>13 Aug 2014 - 25 Jul 2024

Feb<br>MAR<br>Apr

11

2014<br>2015<br>2016

success

fail

About this capture

COLLECTED BY

Organization: Alexa Crawls

Starting in 1996, Alexa Internet has been donating their crawl data to the Internet Archive. Flowing in every day, these data are added to the Wayback Machine after an embargo period.

Collection: Alexa Crawls

Starting in 1996, Alexa Internet has been donating their crawl data to the Internet Archive. Flowing in every day, these data are added to the Wayback Machine after an embargo period.

TIMESTAMPS

The Wayback Machine - https://web.archive.org/web/20150311174405/http://web.stanford.edu:80/class/cs107/assign1.html

Toggle navigation

CS107

Home

Course materials

Admin

Syllabus

Labs

Assignments

Exams

Gradebook

Getting help

FAQ

Office hours

Forum & email

Other resources

!-- end Direct use -->

Assignment 1: Reassemble

Due date: Mon Jan 19 11:59 pm - Hard deadline: Wed Jan 21 11:59 pm

Assignment by Julie Zelenski, with ideas borrowed from Rich Pattis (UC Irvine) and Owen Astrachan (Duke)

[Quick Links: Implementation details, Advice page, Grading ]

Learning goals

Completing this assignment will jump-start your proficiency in:

editing, compiling, testing, and debugging C programs under Unix

writing code to manipulate C-strings

using the C facilities for dynamic memory management (malloc & free)

The problem

You love The C Programming Language text so much that you own 10 copies of it, but your evil nemesis knows how dear it is to you and has taken a pair of scissors to every one of your copies and randomly cut each into thousands of pieces. Amidst your tears, you attempt to reassemble the fragments to recreate the original. Little did you know that solving this problem would prep you for a career in computational biology and international espionage.

For the assignment, you are to write a program that reads a input of text fragments and uses them to reconstruct the original document. The fragments were created by duplicating the original document many times over and chopping each copy into pieces. The process of reconstructing the original will rely on finding overlapping portions and aligning them to match.

This task is based on a technique used for genome shotgun sequencing in the Human Genome Project. Current sequencing technology can only sequence chunks of 500 or so base pairs, yet the human genome is billions of pairs long. Thus, the DNA to be sequenced is cloned several times and the clones are randomly cut into sequenceable-sized chunks. The reassembly step aligns those chunks to reconstruct the original strand. Nova had a fascinating show on Cracking the Code of Life about the race to decode the human genome. In addition to biology, a number of intelligence gathering tasks involve reassembly. The carpet weavers piecing together photos in Argo are based in truth, archivists are reassembling Stasi trash, and the team "All Your Shreds are Belong to U.S." (great name!) just won the DARPA-sponsored shredder challenge. All of these tasks have in common the process of matching fragments to piece them into a unified whole, just as you will do for this assignment.

Greedy match and merge

The program reads the collection of fragments and then processes the fragments in a series of rounds. A round does a single match/merge operation. To match, it searches the collection to find the pair of fragments with the longest-length overlap. To merge, the pair is aligned at their point of overlap and joined into a superstring that has the two original fragments as substrings. Each round decreases the total number of fragments by one and the process continues until only one fragment remains. In each round, it is the the pair of fragments with the longest-length overlap that is chosen for merging.

Consider a collection with fragments shown below (extra spaces were inserted between letters for clarity):

a l l i s w e l l<br>e l l t h a t e n<br>h a t e n d<br>t e n d s w e l l

On the first round, the longest overlap found is a 6-character overlap between the middle two fragments when aligned as below:

e l l t h a t e n<br>h a t e n d

These two fragments would be removed and replaced with their merged result:

e l l t h a t e n d

The merged fragment is a candidate for future rounds; the two fragments it was composed from are discarded. On the next round, the longest overlap is 5 characters when these two fragments are aligned as below:

e l l t h a t e n d<br>t e n d s w e l l

These fragments would be removed and replaced with their merged result:

e l l t h a t e n d s w e l l

The last round merges the remaining two fragments aligned at their maximal overlap of 3 characters:

a l l i s w e l l<br>e l l t h a t e n d s w e l l

The one fragment remaining is the final result:

a l l i s w e l l t h a t e n d s w e l l

A match is also possible when one fragment is completely contained within another. Consider:

s w e l l...

fragments overlap assignment original round match

Related Articles