8BBS: A Forgotten Primary Source
Many people are vaguely aware that the word ‘hacker’ did not always refer to a<br>computer burgler. Once you begin to ask about the details however things start<br>to break down. When did it go from being computer tricksterism to computer trespass?<br>I never seem to get the same answer twice. Some cite Steven Levy’s Hackers as<br>having ruined the term by popularizing it for a generation of teenage punks. More<br>informed respondents tell me that the 411 Gang corrupted ‘hacker’ through their<br>antics. In my own research, one incredible primary source I keep coming back to<br>is a bulletin board that existed circa 1980 called 8BBS. 8BBS was an open forum<br>that ended up being primarily used by phreakers to discuss the art of phone and<br>computer intrusion. Having been used as one of the major primary sources in Katie<br>Hefner’s Cyberpunk it seems like I’d hear about it more often.
The basic reason I don’t is that it wasn’t available on the web until recently,<br>and it’s only available in a nasty PDF format. For my own personal use<br>this is highly inconvenient, but for getting others interested it’s basically a<br>show stopper. Realizing that nobody else is going to fix it and this is a valuable<br>thing that deserves to be on the open web, I’d like to sketch a plan for restoring<br>this source to a decent web based home.
The Problem: It’s A Scan…Of A Printout
The original medium this source came in had search and good indexing. It was easy<br>to read a thread or find posts by a certain user. The printout format strips away<br>search, and mangles the indexing a bit since page number and post ID are correlated<br>but not quite the same. Worse, the printout was hole punched at some point so it<br>could go into a binder, and this destroyed some of the information. The scan adds<br>a further layer of obfuscation, as the quality of the documents and their flaws<br>are magnified in a scanning environment. Subtle details lost in the scanning process<br>make it highly unpleasant to read the resulting document.
Restoring Indexing
The archive.org material includes<br>a 500mb archive of individual jp2 images of the scanned pages, these could easily<br>be imported into a wiki at which point it would be fairly easy to tag what post ID’s<br>appear on each page. This way you’d be able to browse the posts by their ID number<br>even if it’s just seeing the post as an image.
Restoring Readability and Search
The textual nature of the text must be restored to provide it with improved<br>readability through CSS. This means the text must be extracted from the images<br>somehow.
Text Extraction Methods
OCR - I haven’t really been able to get OCR to work very well, so far the<br>packages I’ve used have been of sufficiently low quality that it was faster to type.<br>The closest results I’ve seen came from this tool<br>which is specifically designed to recognize fixed-width fonts.
Human Transcription - This is what I’ve had the most success with so far,<br>but it’s labor intensive and slow. One hope is that by doing things on a wiki<br>platform I can enlist the assistance of others in transcribing images.
Mechanical Turk - A subpoint of the above, it occurs to me that I could<br>pay for an online transcription service like mechanical turk to do the pages. I’m<br>not sure how much I’d have to pay to get something accurate, but perhaps the fact<br>that the participant is making history might allow me to get services at a lower<br>rate. (So as not to sound callous: I’m to understand Turkers are sensitive to<br>that kind of thing since the wages are low enough that many participants are there<br>to have something to do at their job during breaks, for a ‘normal’ transcription<br>service I would assume it’s just business.)
Once the images are transcribed search is just a matter of getting them into a<br>platform which features decent search. PmWiki has good search, so I’m not<br>particularly worried about this aspect as much as I am about being able to get<br>the posts into text at all.
This blog is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.