Neil's News

DocEng 2009

21 September 2009

These are my notes from the four-day DocEng 2009 conference in Munich. What follows are my own thoughts on selected talks. This does not attempt to be an objective record by any means.

Document Versioning

The conference started with a working session on document versioning. The most common versioning systems currently in existence are devoted to tracking computer code. CVS, Subversion and Git are prime examples. Programmers depend very heavily on these tools. Yet version control tools are not widespread beyond this field. Microsoft Word does have a change tracking feature, but otherwise the field is barren. This lead to several open questions which are connected to each other:

  • What fields currently do not have version control, would benefit from it, but don't know it yet? A simple example might be a supermarket checkout. Currently a cashier can only process one order at a time. If the transaction is suspended due to a clerk needing to do a price check or fetch some non-broken eggs, everyone else in the queue has to wait. There is no provision for saving the current state, then starting a new transaction for someone else. Furthermore, once complete, a supermarket transaction cannot be reopened in case of later adjustment. A return or error involves creating additional transactions.
  • Versioning currently has a very poor user interface. It is not surprising that version control tools are only used by programmers if the UI for these tools consist of largely command-line interaction. How could version control be made more intuitive for non-technical users? What would a hypothetical iVersion application by Apple look like?
  • Useful version control allows for automatic merging between branches. A feature apparently not currently available is the ability to define constraints in the result. Such as reference constraints. Or the requirement that the resulting code compile.
  • Ultimately, does the version control field need to fragment? Different fields have very different requirements for versioning. Code version control might be better if it had awareness of where in the product cycle the code was, so that conflicting merges could be resolved automatically on a best-effort basis early on, but get progressively more paranoid as the scheduled release approaches. Version control for lawyers, doctors and musicians might similarly have their own domain specific requirements.

This was a very interesting discussion. The general feeling was that version control is a tool with far greater potential than its current reach. This working session dissolved when most of the Silicon Valley participants succumbed to jet lag and retired for the day.

Efficient Change Control of XML Documents

Sebastian Rönnau presented a pair of diff and merge/patch algorithms for XML. An apparently simple change (such as bolding a word) to an XML document (such as ODF or OOXML) can result in a disproportionate number of changes to the XML tree. Many diff algorithms will result in large deltas. These deltas span many nodes and result in collisions when patched onto trees which have been modified. Sebastian's diff algorithm is designed to create smaller diffs and to create them faster.

Matching nodes between two trees is done on the basis of a contextual fingerprint. Although I'm still chewing my way through the paper, one question which jumps out is whether adding unique IDs to each node would vastly improve the performance (view the source of any Google Docs file to see where this idea came from). Clearly introducing ID attributes is beyond the scope of a stand-alone diff algorithm, however it is within the scope of an integrated system.

Several charts were shown showing the performance of this diff algorithm when compared with other implementations currently available. A few of the charts produced laughter as they showed Microsoft XML Diff producing bizarrely bad performance.

Despite having both "Efficient" and "XML" in the same title, this paper won the best paper award. For me personally, this was the most interesting talk of the conference since Sebastian's diff and patch algorithms appear to be suitable for plugging directly into my differential synchronization algorithm, thus allowing MobWrite to deal with rich text. I can safely say on behalf of everyone that Sebastian was the life of the conference, devoting his limitless energy to organizing every detail, making sure everyone was taken care of and being an outstanding host. I can't think of many computer scientists who would run into a crowded street and start directing traffic when our bus attempted an ill-advised U-turn.

Differential Synchronization [PDF]

I presented the differential synchronization algorithm which powers MobWrite. First, the audience was invited to participate in a shared MobWrite document. As expected the system handled the load without problems. Unexpectedly this document remained active throughout the rest of the day and acted as an unofficial back-channel for people to exchange random notes.

Next, I demonstrated the dangers of a divergent system where any mutation results in a runaway forking of the document. One demonstration showed a dual-stack Tetris game which behaved properly when each stack was given identical inputs, but which quickly fell apart when a single rotate command was injected into one stack but not the other. The other demonstration involved injecting a similar error into EtherPad, a real-time collaborative web editor which relies on passing all edits to remain consistent.

As an alternative to divergent edit-passing systems, I walked through the differential synchronization algorithm. As a convergent system, it guarantees consistent results regardless of programming errors, network errors or memory errors.

Finally, I addressed the surprising finding that in JavaScript a minimal O(n) checksum which would be required to use a divergent system is too slow to handle a one megabyte document, whereas a more complicated diff algorithm executes in O(log n) time. It is counter-intuitive that diff could be faster than a simple checksum, but JavaScript operations do not have conventional performance characteristics.

Aesthetic Measure of Alignment and Regularity

[HP] Helen Balinsky of Hewlett Packard presented an algorithm which determined how good (or bad) a document looked. In particular the algorithm analyzed whether the objects making up the document were aligned. Photos should generally be the same width, be flush with the left or right margins, etc.

This sounds pretty trivial. But the problem is much more complicated that it appears. Consider an image such as the HP logo to the right. Your web browser has aligned it at the edge of the image (line D). But the image content (ignoring my A/B/C/D annotations) doesn't begin until the ® symbol (line C). But aligning to the symbol doesn't look correct; ideally the logo itself should be aligned to the page edge, and the symbol should overhang. That leaves the open question of whether the logo's shadow (line B) or the logo proper (line A) should be the place to set the alignment against. The presented algorithm used a Hough transform[?] to determine good candidate lines in images.

Another issue is the non-linear nature of alignment. Two objects properly aligned is perfect. A few pixels out of alignment is also fine, since it is imperceptible. But once the misalignment becomes perceptible, it gets the worst possible score since the result looks very wrong. Then as the misalignment increases, the score improves since it appears more intentional and less like a sloppy accident.

An E-Writer for Documents plus Strokes

Richo presented a hardware device for filling out forms electronically. While looking superficially like an Amazon Kindle, the E-Writer differs in that it is designed to accept handwriting via stylus input. The Kindle's screen proved unusable for this task, since its slow update speed would make writing appear a couple of seconds after one had written it. Amongst other things, this lag would make it difficult for cursive writers to cross a 't' or dot an 'i' since the word would not be visible for a while.

Unlike with a traditional pen and paper form where only the final state is visible, the E-Writer records the date of each stroke thus allowing one to query the state at any given time. This could reveal previous versions of a form which is filled out in several sessions or a prolonged period.

One of the findings of the initial research was that most forms are not structured correctly to capture all the information needed, thus leading to notes scribbled in the margins. Allowing a free-form stylus input and additional blank pages after each form provides users with the familiar method of adding extra information.

The E-Writer also includes a camera. Pictures taken may be attached to forms.

A completed document is composed of the base image of the form, the list of strokes, and any attached images. This represents an edit-passing system, however unlike the ones in my earlier demonstration, the E-Writer does not suffer from catastrophic divergence in the event of a mutated edit since all edits are absolutely positioned on the page and subsequent edits do not depend on previous edits for context.

User testing revealed a preference for using pen and paper over this device, however there was confidence that as improvements are made (such as reducing weight) the E-Writer would eventually win out.

Cost was not mentioned. There was also no mention of automated interpretation of the resulting documents.

Movie Script Markup Language

A group of researchers from Ghent University have devised a format for encoding movie scripts as XML. A traditional movie script might look like:

  JOE drives recklessly.  ANDI sits next to him.

  How much longer?

  Couple o' hours.  You okay?

Whereas their XML version might be:

  <scene name="Jeep" INT>
      <char name="Joe"\> drives recklessly.
      <char name="Andi"\> sits next to him.
    <dialog character="Andi">
      <actionspan class="shouting">
        <dialogspan>How much longer?</dialogspan>
    <dialog character="Joe">
      Couple o' hours.  You okay?

This change of format offers several advantages.

  • Shooting and lighting instructions can be included in a unified script as opposed to being hand annotated on the director's copy.
  • Keeps better track of time due to a timing model.
  • Once shot, each piece of film can be labeled with the scene's ID, thus creating a database of video linked to the XML file.
  • Enables semantic repurposing so that content may be changed for cinema vs TV versions (similar to CSS which may specify different output for different media).
  • Drives rough CGI previsualizations of scenes (currently these are sometimes done with Ken and Barbie dolls in paper models of the sets).
  • It is easy to filter the XML for only the points of interest for each actor or staff member.

While it is not particularly newsworthy to take an existing data format and recode it in XML, there is something unique about Movie Script Markup Language: it is the only XML format where the rendering engine is made of people.

From Rhetorical Structures to Document Structure: Shallow Pragmatic Analysis for Document Engineering

This talk started at 8:30 in the morning to a mostly empty room due to the scrumptious Bavarian banquet thrown for us the previous night.

A Panlingual Anomalous Text Detector

Ashok Popat of Google described a problem faced by Google Book Search's large-scale scanning of books from every language. In addition to the usual OCR errors of individual letters, there are periodic massive errors generated by a variety of causes. Ashok showed some very interesting examples of these errors:

  • Chinese characters OCRed against the Latin character set, resulting in completely random letters.
  • Chinese written in top to bottom columns but interpreted as left to right rows. Every character was correct but they were in a scrambled order.
  • An image scanned as text. Instead of recognizing the block as an image and creating a bitmap, it was interpreted as text and returned random letters.

The obvious solution of running the text against a dictionary and rejecting anything with a very high rate of mis-spellings fails in some other interesting cases:

  • Obscure language like Saxon. Even if the scan is fine the whole book would be declared a mis-scan because there's no Saxon dictionary available.
  • Many works are heavy with jargon which while correct is not in any dictionary. Adding these terms to the global dictionary makes the dictionary useless since the total number of jargon words and acronyms is vast.

An efficient algorithm was presented which scores content on whether it appears to follow the patterns of a natural language. Validating the accuracy of the solution seemed like a bigger problem than creating the algorithm. At issue is that human raters have a significant error rate, and in some cases the algorithm appears to have exceeded human ability in scoring the text quality.

Linguistic Editing Support

Michael Piotrowski discussed creating an editor which parses the text so that it can offer higher level assistance in editing tasks. Sort of like Microsoft Word's grammar checker and suggestion engine on steroids, but without all the false-positives.

However one of the questions asked afterwards which I found most interesting. Computer code is far easier to read when syntax highlighted than not. Could something similar be done to improve the legibility of natural languages? Maybe something along the lines of colouring verbs green, nouns blue, etc. One by one we realized that to a certain extent this had already occurred. German capitalizes all Nouns which makes them stand out in a Document. As Michael said, "Quotes from other people are often italicized." which helps separate the contexts. Punctuation such as question marks are usually redundant, but help flag questions. And paragraphs separate logical blocks.

It would be interesting to see if there are additional conventions that might help. For instance, a grammar-aware editor could automatically underline objective noun phrases and overline subjective noun phrases. The focus of this presentation was what sort of assistance and indicators would be helpful in an editor, but a question with larger potential impact would be what indicators might be helpful to a reader.

[DocEng 09]

< Previous | Next >

Legal yada yada: My views do not necessarily represent those of my employer or my goldfish.