21 September 2009
These are my notes from the four-day DocEng 2009 conference in Munich. What follows are my own thoughts on selected talks. This does not attempt to be an objective record by any means.
The conference started with a working session on document versioning. The most common versioning systems currently in existence are devoted to tracking computer code. CVS, Subversion and Git are prime examples. Programmers depend very heavily on these tools. Yet version control tools are not widespread beyond this field. Microsoft Word does have a change tracking feature, but otherwise the field is barren. This lead to several open questions which are connected to each other:
This was a very interesting discussion. The general feeling was that version control is a tool with far greater potential than its current reach. This working session dissolved when most of the Silicon Valley participants succumbed to jet lag and retired for the day.
Efficient Change Control of XML Documents
Sebastian Rönnau presented a pair of diff and merge/patch algorithms for XML. An apparently simple change (such as bolding a word) to an XML document (such as ODF or OOXML) can result in a disproportionate number of changes to the XML tree. Many diff algorithms will result in large deltas. These deltas span many nodes and result in collisions when patched onto trees which have been modified. Sebastian's diff algorithm is designed to create smaller diffs and to create them faster.
Matching nodes between two trees is done on the basis of a contextual fingerprint. Although I'm still chewing my way through the paper, one question which jumps out is whether adding unique IDs to each node would vastly improve the performance (view the source of any Google Docs file to see where this idea came from). Clearly introducing ID attributes is beyond the scope of a stand-alone diff algorithm, however it is within the scope of an integrated system.
Several charts were shown showing the performance of this diff algorithm when compared with other implementations currently available. A few of the charts produced laughter as they showed Microsoft XML Diff producing bizarrely bad performance.
Despite having both "Efficient" and "XML" in the same title, this paper won the best paper award. For me personally, this was the most interesting talk of the conference since Sebastian's diff and patch algorithms appear to be suitable for plugging directly into my differential synchronization algorithm, thus allowing MobWrite to deal with rich text. I can safely say on behalf of everyone that Sebastian was the life of the conference, devoting his limitless energy to organizing every detail, making sure everyone was taken care of and being an outstanding host. I can't think of many computer scientists who would run into a crowded street and start directing traffic when our bus attempted an ill-advised U-turn.
Differential Synchronization [PDF]
I presented the differential synchronization algorithm which powers MobWrite. First, the audience was invited to participate in a shared MobWrite document. As expected the system handled the load without problems. Unexpectedly this document remained active throughout the rest of the day and acted as an unofficial back-channel for people to exchange random notes.
Next, I demonstrated the dangers of a divergent system where any mutation results in a runaway forking of the document. One demonstration showed a dual-stack Tetris game which behaved properly when each stack was given identical inputs, but which quickly fell apart when a single rotate command was injected into one stack but not the other. The other demonstration involved injecting a similar error into EtherPad, a real-time collaborative web editor which relies on passing all edits to remain consistent.
As an alternative to divergent edit-passing systems, I walked through the differential synchronization algorithm. As a convergent system, it guarantees consistent results regardless of programming errors, network errors or memory errors.
Aesthetic Measure of Alignment and Regularity
Helen Balinsky of Hewlett Packard presented an algorithm which determined how good (or bad) a document looked. In particular the algorithm analyzed whether the objects making up the document were aligned. Photos should generally be the same width, be flush with the left or right margins, etc.
This sounds pretty trivial. But the problem is much more complicated that it appears. Consider an image such as the HP logo to the right. Your web browser has aligned it at the edge of the image (line D). But the image content (ignoring my A/B/C/D annotations) doesn't begin until the ® symbol (line C). But aligning to the symbol doesn't look correct; ideally the logo itself should be aligned to the page edge, and the symbol should overhang. That leaves the open question of whether the logo's shadow (line B) or the logo proper (line A) should be the place to set the alignment against. The presented algorithm used a Hough transform[?] to determine good candidate lines in images.
Another issue is the non-linear nature of alignment. Two objects properly aligned is perfect. A few pixels out of alignment is also fine, since it is imperceptible. But once the misalignment becomes perceptible, it gets the worst possible score since the result looks very wrong. Then as the misalignment increases, the score improves since it appears more intentional and less like a sloppy accident.
An E-Writer for Documents plus Strokes
Richo presented a hardware device for filling out forms electronically. While looking superficially like an Amazon Kindle, the E-Writer differs in that it is designed to accept handwriting via stylus input. The Kindle's screen proved unusable for this task, since its slow update speed would make writing appear a couple of seconds after one had written it. Amongst other things, this lag would make it difficult for cursive writers to cross a 't' or dot an 'i' since the word would not be visible for a while.
Unlike with a traditional pen and paper form where only the final state is visible, the E-Writer records the date of each stroke thus allowing one to query the state at any given time. This could reveal previous versions of a form which is filled out in several sessions or a prolonged period.
One of the findings of the initial research was that most forms are not structured correctly to capture all the information needed, thus leading to notes scribbled in the margins. Allowing a free-form stylus input and additional blank pages after each form provides users with the familiar method of adding extra information.
The E-Writer also includes a camera. Pictures taken may be attached to forms.
A completed document is composed of the base image of the form, the list of strokes, and any attached images. This represents an edit-passing system, however unlike the ones in my earlier demonstration, the E-Writer does not suffer from catastrophic divergence in the event of a mutated edit since all edits are absolutely positioned on the page and subsequent edits do not depend on previous edits for context.
User testing revealed a preference for using pen and paper over this device, however there was confidence that as improvements are made (such as reducing weight) the E-Writer would eventually win out.
Cost was not mentioned. There was also no mention of automated interpretation of the resulting documents.
Movie Script Markup Language
A group of researchers from Ghent University have devised a format for encoding movie scripts as XML. A traditional movie script might look like:
INT. JEEP JOE drives recklessly. ANDI sits next to him. ANDI (shouting) How much longer? JOE Couple o' hours. You okay?
Whereas their XML version might be:
<scene name="Jeep" INT> <action> <char name="Joe"\> drives recklessly. <char name="Andi"\> sits next to him. </action> <dialog character="Andi"> <actionspan class="shouting"> <dialogspan>How much longer?</dialogspan> </actionspan> </dialog> <dialog character="Joe"> Couple o' hours. You okay? </dialog> </scene>
This change of format offers several advantages.
While it is not particularly newsworthy to take an existing data format and recode it in XML, there is something unique about Movie Script Markup Language: it is the only XML format where the rendering engine is made of people.
From Rhetorical Structures to Document Structure: Shallow Pragmatic Analysis for Document Engineering
This talk started at 8:30 in the morning to a mostly empty room due to the scrumptious Bavarian banquet thrown for us the previous night.
A Panlingual Anomalous Text Detector
Ashok Popat of Google described a problem faced by Google Book Search's large-scale scanning of books from every language. In addition to the usual OCR errors of individual letters, there are periodic massive errors generated by a variety of causes. Ashok showed some very interesting examples of these errors:
The obvious solution of running the text against a dictionary and rejecting anything with a very high rate of mis-spellings fails in some other interesting cases:
An efficient algorithm was presented which scores content on whether it appears to follow the patterns of a natural language. Validating the accuracy of the solution seemed like a bigger problem than creating the algorithm. At issue is that human raters have a significant error rate, and in some cases the algorithm appears to have exceeded human ability in scoring the text quality.
Linguistic Editing Support
Michael Piotrowski discussed creating an editor which parses the text so that it can offer higher level assistance in editing tasks. Sort of like Microsoft Word's grammar checker and suggestion engine on steroids, but without all the false-positives.
However one of the questions asked afterwards which I found most interesting. Computer code is far easier to read when syntax highlighted than not. Could something similar be done to improve the legibility of natural languages? Maybe something along the lines of colouring verbs green, nouns blue, etc. One by one we realized that to a certain extent this had already occurred. German capitalizes all Nouns which makes them stand out in a Document. As Michael said, "Quotes from other people are often italicized." which helps separate the contexts. Punctuation such as question marks are usually redundant, but help flag questions. And paragraphs separate logical blocks.
It would be interesting to see if there are additional conventions that might help. For instance, a grammar-aware editor could automatically underline objective noun phrases and overline subjective noun phrases. The focus of this presentation was what sort of assistance and indicators would be helpful in an editor, but a question with larger potential impact would be what indicators might be helpful to a reader.