Merge: A General Transclusion Facility

Merge: A General Transclusion Facility

Paul Haeberli
Silicon Graphics Computer Systems
30 June 1996

Introduction

This work was inspired by a conversation with Carola Salis during a visit to CRS4 in Cagliari, Sardinia, in October of 1995.
Here's a facility that lets people easily quote parts of HTML documents on the web. This can be used to implement transclusion of sections of HTML from any number of different sources.
For a nice description of the history and applications of this technology please see this recent article by Michael Sippey.

From "Hypermedia Unified by Transclusion"
"The central idea has always been what I now call transclusion, or reuse with original contexts available, through embedded shared instances (rather than duplicate bytes). Thus the user may intercompare contexts of what is reused, both for personal work (keeping track of reuse) and publication (for deep comprehension and study. (footnote 1) ) Transclusion brings to electronic publishing a copyright method that makes republication fair and clean: Each user buys each quotation from its own publisher, assuring proper payment (and encouraging exploration of the original).
Contexts of transclusions must be visually comparable on screen (figure 1 and 2)..."
Theodor Holm Nelson

The Current Implementation
To build a simple transclusion facility, I set up a CGI program on our server that takes a URL of a text file as input. Anyone in the world can run this CGI script with a URL that refers to a .txt file of their choice. This can be added as a link to any page you author.
The composite URL looks like this:
http://reality.sgi.com/grafica/merge.cgi?http://www.here.com/~bob/doc.txt
The file doc.txt has this structure:
#
# An HTML merge script
#
maxage 1 hour
puthtml <html>
puthtml <head>
puthtml <title>Merged Docs</title>
puthtml </head>
puthtml <body bgcolor="#ffffff" text="#000000">

puthtml <p><center>Here is an excerpt of Paul's pages</center>

getfile 1 "http://www.sgi.com/grafica/index.html"
putlink 1 green
putfile 1 pat "<img src=go.gif" 0 pat "ts</h3></center>" 15

puthtml <p><center>Here is an excerpt from 
puthtml <a href="http://www.suck.com/daily/dynatables/96/06/28/">
puthtml suck</a></center>

getfile 2 "http://www.suck.com/daily/dynatables/96/06/28/"
putlink 2 red
puthtml <center><b>
putfile 2 pat "<font size=5><i>Wi" 0 pat "<p>day." 6
puthtml </b></center>

puthtml </body>
puthtml </html>
puthtml 
The Commands
This set of merge instructions is read by merge.cgi and this returns a composite HTML file that will be displayed by the browser.
Here's a brief description of the commands:
maxage n minutes|hours|days|weeks|months|years
Specifies a how long before the cache should reloaded when a getfile command is given. The default maxage value is 1 hour. (this is ignored for now)

getfile id URL
Gives the URL of an HTML document to fetch. This can be refenced in putfile using the given id.

puthtml textline
The textline is copied to the output of the merge.cgi program.

putlink id black|white|cream|red|yellow|green|blue
Creates a centered image with a pointer to the original document. It may be one of the colors black, white, cream, red, yellow, green, or blue.

putfile id location location
The id selects one of the input HTML URLs given on a getfile line. The two locations specify the start and end locations in the input HTML file to be copied to the output of the merge.cgi program.
A location can be described in three ways:

pos offset
offset gives a character position in the input HTML file. 0 refers to the first character. If the offset is negative, it is the position from from the end of the file.

pat pattern offset
pattern is a string of characters. offset is an offset from the beginning of the matched string. This offset can be positive of negative.

line number
number specifies a line number in the file. If the line is negative, it is the position from from the end of the file. Line number -1 is the last line.

What do you think so far? Here's a merge.txt file. Why not give it a try?
Applications

Naturally, this can be used and abused in many interesting ways. For instance:
USE

Add annotations to a view of someone else's document. Make a system that allows anyone to add annotations where they want.
Add a bit of context around an automatically generated press release.
Add your own name to a list of "movers and shakers".
Implement a thread discussion system. I dare you.
Make a single HTML file to print a multipage linked document.
Make a multipage linked version of a single long document.
Change the background and text colors of existing documents.
Quote a bit of a site along with a pointer to it.
Surround an automatically generated complaint letter with some context. Actually, this is how I got started on this kick...

ABUSE?

Restructure parts of a commercial site.
Make a FrameFree version of a framed site
Remove advertising from a commercial site.
Add Netscape Frames to an existing site.
Replace advertising on a commercial site with other ads.
Add advertising to a non-commercial site.

Can you think of other applications good or bad? Please let me know, and I'll add 'em to the lists!
Performance

The performance of this prototype may degrade because of server load or network bandwidth constraints. The current implementation also never caches documents. If caching was used, performance could improve dramatically. In this case, getting a transcluded excerpt of a document could take less time than getting the original, if the source document was stored on a slow server. Using caching should lead to reduced load on the network.
What's Missing?

The most obvious thing that's missing from this proposal is a payment method for authors whose work is quoted, when this falls outside the domain of fair use. If you have ideas how a royalty scheme could be implemented, please let me know.
Discussion

Richard Lee suggested an alternative to the commands decribed above. Here's what he said:
Cool! I have a suggestion though. It would be cool if the document that people referenced looked more like HTML with directives in it to do inclusion of other pages. That way, people could use traditional HTML authoring tools (like cosmocreate) to do the main page and just put additional markup in to do the referencing (using view/edit source or insert unknown markup functionality)
For example...something like this for the doc.txt file
<html>
<head>
<title>Merged Docs</title>
<meta quoted-page>
</head>
<body bgcolor="#ffffff" text="#000000">
<h3>Here is an excerpt of Paul's pages</h3>
<quote src="http://www.sgi.com/grafica/index.html">
<pat match="<img src=go.gif" offset=0>
<pat match="ts</h3></center>" offset=15>
</quote>
<h3>Here is an excerpt from Suck</h3>
<center><b>
<quote src="http://www.suck.com/daily/dynatables/96/06/28/">
<pat match="<font size=5><i>Wi" offset=0>
<pat match="<p>day." offset=6>
</quote>
</b></center>
</body>
</html>
Then you would take this file, search out the <quote> markup, and do the appropriate replacements, while still allowing the user to easily do all the other things their authoring tool does well.
I like Richard's suggestion; I'm considering implementation it. I'll probably make the following slight change to the syntax of the quote tag though.
<html>
<head>
<title>Merged Docs</title>
<meta quoted-page>
</head>
<body bgcolor="#ffffff" text="#000000">
<h3>Here is an excerpt of Paul's pages</h3>
<quote src="http://www.sgi.com/grafica/index.html" 
    startpat="<img src=go.gif" startoffset=0
    endpat="ts</h3></center>" endoffset=15>
<h3>Here is an excerpt from Suck</h3>
<center><b>
<quote src="http://www.suck.com/daily/dynatables/96/06/28/"
    startpat="<font size=5><i>Wi" startoffset=0 
    endpat="<p>day." endoffset=6>
</b></center>
</body>
</html>
Kevin Hughes wants to be able to specify locations in HTML documents using line numbers. I like this idea. Here's what Kevin said:

I'd like to be able to specify a piece of a file by line count, since it's easy for many people to figure out the line number they're on in a word processor, etc.
putfile 2 line 3 line 20
...this would incorporate lines 3 to 20 inclusive from file 2.
I'm surprised nobody has done this already. This would make a useful tool for people who manage content on complicated sites, and would cause lots of headaches for copyright lawyers! But I think that's pretty inevitable anyway...
I just added this feature!

Alan J. Flavell expresses a concern about copyright violation. I think that many types of inclusion of a part of another document could be considered fair use. But then again I'm not a lawyer either. This is a very interesting issue that deserves some informed discussion.
Article: 84782 of comp.infosystems.www.authoring.html Newsgroups: comp.infosystems.www.authoring.html From: "Alan J. Flavell" <flavell@mail.cern.ch> Subject: Re: including another page's contents in one page (w/o frames) Sender: news@news.cern.ch (USENET News System) Organization: speaking for myself and not for CERN Date: Wed, 3 Jul 96 03:30:15 1996 Lines: 17 On 2 Jul 1996, Paul Haeberli wrote: > Here's the idea that lets > people easily quote HTML documents around the web. > This can be used to implement transclusion of sections > of HTML from any number of different sources. Although it's perfectly OK to point to other documents by using their URL, it's a violation of copyright to include all or part of someone else's document into one's own. (I'm not a lawyer, but people would be well advised to hunt down the various copyright FAQs before contenancing this kind of technique). best regards
Hey! Did Alan just include a part of my original posting in his reply? Perhaps someone could interpret this copyright FAQ? In the mean time this article has a nice description of some Rules for Responsible Transclusion.
Andrew Pam just let me know about this nice document that discusses several ways of implementing transclusions. It makes for good reading. Andrew also mentioned this discussion of Transcopyright.

Please contact me if you have other ideas to contribute.
Paul Haeberli
BACK