Re: [blogite] SmartReferer 0.1.2

Date view Thread view Subject view Author view Attachment view

From: lenz (lenzoink@libero.it)
Date: Sat Dec 28 2002 - 21:40:42 GMT


Hi all,
this is a collection of technical notes on SR design, as an answer to Ian's
message.

1. There should not really be a lot of 404's.
A (single) 404 error is generated at most when a non-SR sender site links
to a SR receiver site, no matter how many users traverse that link. Though
not required, the best SR implementations will avoid repeately querying a
site that is a known non-SR sender host.

2. The autodiscovery process based on a single URL is very poor design.
The delegation system implicit in SR ports makes it easy to delegate SR
ports on any system. The SR base port needs not describe any resource in
the given site, but tells you where to find a correct SR port for the given
resource. Imagine we have a domain name shared between with three users
Alice Bob and Zelda. The SR base port will simply say:
  - if the URI looks like sitename/home/alice, the correct SR port is
sitename/home/alice/srport.txt
  - if the URI looks like sitename/home/bob, the correct SR port is
sitename/home/bob/srport.php
  - if the URI looks like sitename/home/zelda, the correct SR port is
sitename/home/zelda/srport/

It is an open question how to avoid the autodiscovery process if:
  - you can consitently do URL-rewriting on your external links and want to
avoid the autodiscovery process, or
  - you cannot place a SR base port in the root directory of your site and
still would like to participate as a SR sender.

I am thinking about some sort of standard GET parameter to append to an
HTTP GET request, but I'm not sure yet.

3. Case-insensitive longest matching subsequence
URIs are not case-sensitive on - at least - Win/IIS systems. This means
that on such systems capitalization errors - both in the HTML developement
phase and in manual URL entering are not apparent and - in my experience -
quite common.
That's why I would lowercase everything in order to make the matching
process easier.

As of 0.1.2, all DOMAIN tags are thought of as being subdirectories:
http://foo/bar and http://foo/bar/ are considered exactly alike and won't
match http://foo/barcode.php . I am not very enthusiatical about it, but
have no better ideas at the moment.

4. If the returned SmartPort URL has a querystring part, it should be left
untouched and no res= and from= parameters should be added.

This is made to make it possible to do URL rewriting or adding additional
parameters on dynamic back-end sites and still to pass the original referer
information along on static sites. I agree it's far from being elegant. I'm
open to better ideas.

5. Spam protection

Spam protection rules are far from being mandatory. They are just a
proposed scheme. I'm sure there may be far better ones. I am modifying the
spec to make it clear (in fact it was not).

6. XML format of SR port files

I do not agree with you on relying on DC metadata. Yes, they could be
added, but the whole point of the protocol is to make a working and
workable system taht is quite easy to implement and understand. Defining
exactly and simply how data should be returned - albeit quite naively -
makes life easier for SR receiver role implementors.

7. The many-different-names scenario

Let's see how SR handles the following case in a real-life situation. As
you say, you can have a number of different sligthly different referers, like:

    http://example.org
    http://example.org/
    http://www.example.org/
    http://www.example.org/index.html
    http://example.org/?lastModified=2089420986

I agree with you - the easiest and most logical way to obtain information
would be to query the URI itself. But this means any URI of our website
must be equipped with the ability to answer such question, and this means
that such information is redundant. I don't want to change my existing
PHP-Nuke setup, and don't want to edit anything. So IMHO this is not the
way to go if you want to offer a low cost - simple conversion method.

The second way to go would be to query a central repository for the SR
port. But I can't imagine who would be willing to run such a service for
free if SR should become widely used, and I don't like the idea of such a
system going down and stopping the whole SR web. All centralized systems
pay a price in terms of scalability and error resilience, so I dont like it
so much.

The third way to go is to query the site itself for information on a
standard URI, and this is what the SR base port is for. And as I don't
want to query multiple base ports, I stick to a text file. You want it to
be a PHP file? that's fine. Either you paly with your HTTPD settings or you
- more simply - redirect all SR port queries in the SR base port to a SR
port written in PHP. This is a form of delegation. I believe your webserver
can handle two more hits per each link someone makes to you, no matter how
many people traverse it.

Anyway, the simplest way to make your site a SR sender is to add a simple
resource describing whatever is in it placed as your SR base port. Be it
http://example.org/srport.txt or http://www.example.org/srport.txt, the
file will be correctly downoaded. As of 0.1.2, you don't have to specify a
domain name to describe a resource. And you are done.

8. Why "sender" and "receiver" roles as different entities?

The reason is simple, and it's more practical than technical.
It's really easy to add a SR base port to your web site.
Think about it: when I want to link somebody else's site, I never know what
to write, or how to write the URl, or the description.
A simple SR base port will give people better information on how to link
you back, and it's simple and free. Takes 5 minutes to do it. Why not? No
software to add, no time to lose. Same as the robots.txt file. And a simple
SR sender role is done.

After that, you can think of implementing a SR receiver role. I believe
this will be for a minority. It's harder. You need a dynamic back end. Not
everybody's lunch. But has advantages, and I believe two-way links make for
a definitely better web than today's.

Thanks for your time and patience,
l.

PS. I guess I can use my own words what I post here for SR's FAQ section;
am I right or is there any limit on this? Thanks.

At 03.41 28/12/02, you wrote:

>On Fri, 27 Dec 2002, lenz wrote:
> >
> > http://www.oinko.net/smartreferer/
>
>The idea is intruiging. (For those who haven't read it: it's basically a
>referrer sanitisation system: given a requested URI and a referer URI, it
>will effectively give you the permalink of the referring page.)
>
>Comments:
>
>: SmartReferer [...] SmartPort
>
>"Port" is the wrong technical term; and SmartXXXX makes this sound like
>non-technical marketting-speak. I recommend avoiding the invention of new
>trade names in specifications, and sticking with technically accurate
>pre-existing jargon. (e.g. "referrer authentication" or "canonical
>referrer determination".)
>
>
>: If links are the economics of the web, SmartReferer makes it easier and
>: neater to make a precise balance of who is linking you
>
>That sounds like marketting-speak, and doesn't belong in a spec.
>
>
>: The SR autodiscovery process
>
>The autodiscovery process given (relying on a fixed URI) is a very poor
>design. Many sites on the net are limited to subdirectories. Furthermore,
>administrators are very easily annoyed by repeated 404s appearing in their
>logs (witness the fuss behind the favicon.ico or P3P systems).
>
>
>: case-insensitive longest matching subsequence
>
>This is, IMHO, a poor design. URIs are explicitly case sensitive, and two
>URIs that differ only by a trailing slash, e.g.
>
> http://www.example.com/foo
>
>...and
>
> http://www.example.com/foo/
>
>...are NOT the same resource.
>
>
>: writing special-case XML parsers from scratch
>
>That is to be discouraged. XML parsers are complicated things, and
>reducing the total number of them in the world is a good thing.
>
>Any XML resource should be able to use <![CDATA[ ]]> blocks or whatever,
>without having to worry about running into limitations of custom parsers.
>
>
>: If the returned SmartPort URL has a querystring part, it should be left
>: untouched and no res= and from= parameters should be added.
>
>I disagree that "SmartPort URL"s are the way to go here, but if they are,
>then I think this part of the spec contradicts the part of the spec that
>wants this to work well with static backends.
>
>
>: No single SmartPort document shall be longer than 32k in size. If it is,
>: it should be truncated accordingly.
>
>That appears to be an aribitray limitation. Also, truncating an XML file
>makes it illformed, and XML processors MUST refuse to handle illformed XML
>files.
>
>
>: Spam protection
>
>In my opinion, it is in the interest of the market to leave spam
>protection at the informative (non-normative) level, and let different
>implementations develop their own systems. If you explicitly state what
>protections are to be used, then spammers will know exactly what to avoid
>doing.
>
>Also, blacklists and whitelists are a maintenance nightmare.
>
>
>: XML format of SmartPort files
>
>This really shouldn't be an appendix.
>
>
>: <owner> ... <description> ... <title-long> ... <icon>
>
>Specifications should pick one area, and only try to address that one
>area. In this case, metadata should not be addressed by a referrer
>authentication system. Leave metadata to the RDF or Dublin Core guys,
>don't try to mix it in with your own spec. (Trackback made this mistake.)
>
>
>: You have a right to develop software based on this document provided
>: that such software will be distribuited as freeware under the GNU
>: General Public Licence.
>
>The whole point of specifications is that they should be freely
>implementable by anyone. Why limit it to a tiny subset of the population?
>
>
>: In order to implement the SmartReferer 'receiver role' protocol
>: (i.e. the autodiscovery mechanism) in a commercial software or in a
>: paid-for environment, you will have to hold a signed licence for doing
>: so.
>
>I am not a lawyer, but unless you own a patent on this stuff, I think this
>is not an restriction you can levy.
>
>
>Generally, I'm not convinced this is the way to go. As I understand it,
>the problem is this:
>
> Given a requested URI and a referer for that URI, determine the
> canonical URI for the referring resource for the purposes of a link
> back to the referring resource from the requested resource.
>
>In this respect, it appears to be very similar to pingback, where pingback
>is a way for the referring resource to specifically annouce the existence
>of a link on the referring resource to the requested resource.
>
>As I see it there are three types of referring resources:
>
> 1. Those that are dynamically generated.
>
> 2. Those that are static pages but automatically generated, typically
> resulting in having multiple pages that contain a particular link,
> but only one canonical URI for that link.
>
> 3. Those that are static pages with unique URIs.
>
>The first category can cope with any canonical referer discovery system,
>since it can be programmed to respond as required.
>
>The second category poses the most trouble, but it can easily cope with
>any system that only requires modifying the referring pages or providing a
>single static response file.
>
>The third is simple: the canonical referrer is the actual referrer, minus
>any fragment identifier.
>
>
>Note that the most common scenario is where a site's main page, or a
>content aggregator, has grouped many resources under one URI, with the
>result that sites get multiple hits from subtly different URIs, e.g.:
>
> http://example.org
> http://example.org/
> http://www.example.org/
> http://www.example.org/index.html
> http://example.org/?lastModified=2089420986
>
>The last one is especially common, and illustrates one problem, which is
>that many of these are URIs that the site itself doesn't know about.
>
>
>In the problem scenario, we have two URIs:
>
> the requested URI
> the referrer URI
>
>No assumptions can be made; the referrer might not be HTTP, for example,
>so the only possible way of determining the canonical referrer URI is to
>ask the URI we have available.
>
>The logical next step, therefore, is to request the referrer URI.
>
>At this point, we have several options as far as a spec goes. Pingback's
>mechanism is probably the simplest: provide either an HTTP header or a
>specially formatted <link> element pointing to a source for further
>details on the canonical URI for this resource, given the information that
>it should contain a link to the requested URI.
>
>
>Note that there might be several. For example, if
>
> http://ln.hixie.ch/
>
>...links to
>
> http://www.example.net/
>
>...in two blog entries, then there are two canonical versions of the
>referrer URI. (As far as I can tell the current spec doesn't deal with
>this, by the way.)
>
>
>The obvious next step is to make the HTTP header or <link> element point
>to an XML-RPC server, which can then be communicated with to get a list,
>using an interface such as:
>
> pingback.getCanonicalURI(referrerURI, requestedURI) : array of URIs
>
>However, this does not cater for the static case.
>
>
>I don't know how we can truly cope with the static case. My own Web log,
>for example, can be accessed through at least 6 separate domain names,
>with any number of different arguments... and it only has one URI if you
>ignore the query part, since all the permalinks are merely the domain with
>a query part added on the end. And a defaulting rule can't be used,
>because it also has some other files in a /resources/ directory that are
>unrelated to the Web log material. (The current spec doesn't cope with
>this either.)
>
>
>I'll let you know if I can think of a solution for the static case.
>
>
>Note: You may only use these ideas if you agree not to limit their use.
>
>--
>Ian Hickson )\._.,--....,'``. fL
>"meow" /, _.. \ _\ ;`._ ,.
>http://index.hixie.ch/ `._.-(,_..'--(,_..'`-.;.'
>
>Message sent over the Blogite mailing list.
>Archives: http://www.aquarionics.com/misc/archives/blogite/
>Instructions: http://www.aquarionics.com/misc/blogite/

Message sent over the Blogite mailing list.
Archives: http://www.aquarionics.com/misc/archives/blogite/
Instructions: http://www.aquarionics.com/misc/blogite/


Date view Thread view Subject view Author view Attachment view

This archive was generated by hypermail 2.1.5 : Sun Dec 29 2002 - 05:05:02 GMT