Respec: [META] Implement cross-references feature

Created on 11 May 2018  路  13Comments  路  Source: w3c/respec

Today, creating a cross reference in ReSpec requires Editors to manually search for the id of a term by going into another spec and copying the URL. This means Editors that Editors need to write, For example:

<a data-cite="HTML/webappapis.html#eventhandler">EventHandler</a>

For most Editors, this can be quite labor intensive and error prone. Also, the id being referenced might be removed or changed, causing cross references to break.

Ultimately, spec Editors just should be able to just write markup:

<a>EventHandler</a>

And ReSpec, together with some hints, should know what the user means (i.e., the EventHandler typedef, as defined in the HTML specification).


This project aims to make it easy for Editors to link terms defined in other specs. ReSpec should be able to understand itself where EventHandler is defined and automatically add the link (and citation of the linked spec) to it.

This feature will extend the present data-cite attribute. In the beginning, this feature will link to dfn and IDL fragments only. (later stages may support headings and other components).

In case ReSpec is not able to find a reference, or if there is some ambiguity, it will inform Editors so they can take appropriate action.

Markup

<a>EventHandler</a>

<!-- really mean the IDL -->
<a data-lt="EventHandler">event handler</a>

<!-- all references are normative by default. -->
<!-- explicit non-normative references: -->
<a class="informative">EventHandler</a>
<a class="informative" data-lt="EventHandler">event handler</a>

<!-- shorthand syntax. (in future) -->
The {{EventHandler}} typedef.

One should keep in mind that EventHandler (type = typedef) is not as same as event handler (type = dfn).

The above will also be applicable to <dfn> (except the shorthand syntax).

All references are normative by default, unless used inside some non-normative parent (like being nested in some .note, .ednote, figure, .example) or having .informative themselves.

Handling Ambiguity

In case of ambiguities, the spec writer should be able to resolve it by providing additional information.

<a>URL parser</a>
<!--
  is ambiguous, as defined at:
  - https://www.w3.org/TR/appmanifest/#dfn-parse
  - https://html.spec.whatwg.org/multipage/infrastructure.html#url-parser
  - https://url.spec.whatwg.org/#concept-url-parser
-->

The following markup may be used, to provide additional information:

<!-- following are equivalent in terms of what the resultant link is -->
<!-- each says: look for "URL parser" in spec with shortname `url` -->
<a data-cite="url">URL parser</a>
<a data-cite="url" data-lt="URL parser">URL parsing</a>
<p data-cite="url"> 
  <a>URL parser</a> <!-- unless a local dfn for URL parser exists -->
</p>

<!-- overall markup pattern -->
<a>TERM</a>
<a data-cite="SPEC">TERM</a>
<a data-cite="SPEC" data-lt="TERM">ALT TERM</a>

Using data-cite to provide more information

The data-cite attribute can be defined in following ways:
(_in order of increasing precedence (locality) and decreasing risk of ambiguity. each higher precedence overrides the lower precedence_)

body: a space separated list of specification ids (as per SpecRef or W3C Short Names for specs). The terms in entire spec will be searched in these specs. The risk of an unresolved ambiguity is maximum here.

<body data-cite="spec1 spec2 spec3">

section: a space separated list of short names. The terms in this section and its subsections will be resolved in these specs (unless an override is given in a subsection).

<section data-cite="spec1 spec2">
  <a>TERM</a> is searched in spec1,spec2
  but not in spec3 which is defined at a level of lower precedence.
</section>

That is, the data-cite can be defined at any level (<p>, <span> are also valid), with the most local data-cite overriding above. The closest parent will be used on the TERM.

a or dfn: a single specification short name in which the current element's term will be searched for. The risk of an ambiguity is minimum here.

<a data-cite="spec">TERM</a>

An empty data-cite shall be used to denote a local reference explicitly. Otherwise, all terms whose definitions couldn't be found locally shall be looked up externally. The closest empty [data-cite] ancestor be be considered for local references.

Examples

Let term be "request". It is defined in following specs as:

Let specs be value defined by closest parent's data-cite.

If specs is null, it is ambiguous.
If specs is "webusb", it is unambiguous.
If specs is "webusb service-workers", result is ambiguous (defined in both).
If specs is "service-workers", result is ambiguous (defined twice)
If specs is "fetch", result is ambiguous (defined thrice)

Handling ambiguous results

Reducing ambiguity on back end

Only retrieve terms that have export attribute

For example, "utf-8 encode" is defined in https://encoding.spec.whatwg.org/#utf-8-encode and https://html.spec.whatwg.org/multipage/infrastructure.html#utf-8-encode but the encoding spec is the clearly rightful candidate here.

If we only query the export terms, we can reduce the chances of an ambiguity significantly (possibly altogether) .

Make use of spec from data-cite data
as explained above.

Resolving ambiguity on client side

Current specs have higher precedence over snapshots

Make use IDL definitions
Consider "Request" being used as:

<pre class="idl">
partial interface Request {
  // Other stuff
};
</pre>
<p>The <dfn data-cite="fetch">Request</dfn> object is defined in fetch.</p>

Here, request is in IDL (interface) (and not dfn). Hence, the ambiguity is resolved to https://fetch.spec.whatwg.org/#request.

Unresolvable ambiguity

In end, if an ambiguity can't be resolved, an error would be given. The author may fall-back to do a manual data-cite providing a hash like data-cite="spec#hash".

The Web Service (Shepherd)

Request (proposed API end-point)

We need to send {term,specList} pairs for each term. The following format may be used for request:

POST /xrefs
Content-Type: application/json
Accept: application/json

{
  "keys": [
    {
      "term": "foo bar",
      "specs": ["spec1", "spec2"],
      "types": ["dfn", "interface"]
    },
    {
      "term": "baz"
    }
  ]
}

Some requirements on API:

  • The search for a term on back-end should be case insensitive by default. If given types belong to IDL_TYPES, then only the search should be case sensitive.
  • All data (as mentioned above) matching the term should be returned, unless some info to disambiguate is provided.
  • The specs and types are optional but recommended attributes for each term.
  • The linking_text attribute in Shepherd data should be treated as same as title attribute and be available for search as term.

Response

We expect a JSON response of the form:

{
  "data": {
    "baz": [
      { uri: "#baz", type, spec: "foo", for: [], normative: true }
    ],
    "bar": [
      { uri: "webappapis.html#bar", type, spec: "html", for, normative },
      { uri, type, spec, for, normative }
    ],
    "biz": []
  }
}

Client Side

  1. A <a>TERM</a> is found.
  2. ReSpec tries to find a local id for this reference.
  3. If local id found, then done.
  4. Else: Send a request to /xrefs with the TERM as term and spec from the elements closest parent's (or its own) data-cite attribute.
  5. Cache the response (store in IndexedDB).
  6. If ambiguity, try to resolve as explained above.
  7. If unresolved ambiguity, error.
  8. If no ambiguities, the data-cite will be converted to (or added as) data-cite=spec#uri. This will then be handled by ReSpec as is presently handled.

TODO

  • Define IDL_TYPES, DFN_TYPES
  • Work more on resolving ambiguities, mainly client side
  • Look into data-for (example {{Event.preventDefault()}})
  • What is scope field in Shepherd data?
  • Look into case sensitivity of terms (in general - case sensitive for IDL_TYPES terms and case insensitive otherwise)

Bonus I : A Search UI

With the web service set, we can create a search UI on top of it in ReSpec (similar to specref search interface).

One can search for a term (and optionally mention the specs in which to search for) and get a list of matched terms.

What would be cooler to have is - each term in result having a "copy" button which lets user copy the required markup to add that reference. This will provide an easier workflow in case there are ambiguities that can't be resolved by provided more information.

Bonus II : Auto generate list of external dependencies

The list of terms defined externally should be auto-generated.

Checklist

  1. [x] Create the web service https://github.com/w3c/respec/issues/1757
  2. [x] Expand support for data-cite and data-lt in ReSpec.
  3. [x] Integrate Shepherd API https://github.com/w3c/respec/pull/2158
  4. [x] Caching and other performance improvements
  5. [ ] Auto generate list of external dependencies
  6. [x] Expand fancy syntax like {{foo}} to valid cross references. https://github.com/w3c/respec/pull/1765

Abandoned:

  • Add the search interface. (Out of scope of ReSpec, can be done as a separate micro-project)
xref

Most helpful comment

Note that Shepherd already has all the anchor data in a MySQL database and it gets updated frequently throughout the day. It would be simple to add to Shepherd's existing API to only return specific anchor data for a given set of linking texts (e.g ReSpec could make a single http request with a list of all cross references for the current spec). No need to setup yet another database or try to scrape data that has already been scraped from the primary database.

I'd be happy to implement the Shepherd API side, just let me know what you need.

All 13 comments

Can we work with CSSWG people so that Shepherd can provide a filter API?

Or how about a command-line tool that inserts/updates cross-reference data from Shepherd into the target file?

Also, this is an interesting note... From SVGWG:

Shepherd is a test suite manager that includes issue tracking, etc. The functionality of Shepherd has been replaced by Github and the SVG WG will not be using Shepherd.

Edit: It seems the functionality only refers the issue tracking feature.

Can we work with CSSWG people so that Shepherd can provide a filter API?

Yep, we are already on it :) It might be we don't end up using Shepherd at all, but just BikeShed's data (which is based on Shepherd's data).

Or how about a command-line tool that inserts/updates cross-reference data from Shepherd into the target file?

We will see how to best contribute to BikeShed's data - and figure out how to best get ReSpec data into BikeShed's data.

BikeShed's data 1) still requires a web service as it's still BIG 2) is in its own format. I think getting raw Shepherd data in JSON will be easier then, especially when with cache.

Current plan is for @sidvishnoi to dig a bit deeper into the data, sizes, formats ... and into the problem itself. He is planing to have an full outline for us to review on the 1st of June (he is currently heads down doing his final exams 馃馃摎)

Note that Shepherd already has all the anchor data in a MySQL database and it gets updated frequently throughout the day. It would be simple to add to Shepherd's existing API to only return specific anchor data for a given set of linking texts (e.g ReSpec could make a single http request with a list of all cross references for the current spec). No need to setup yet another database or try to scrape data that has already been scraped from the primary database.

I'd be happy to implement the Shepherd API side, just let me know what you need.

Thank you @plinss. I'll get back on this as soon as my exams are over and let you know :)

@plinss, @saschanaz, @sidvishnoi has updated the proposal. Would you mind having a look?

@plinss, we are going to try to build a prototype using static data first. In the proposal above, please see the "The Web Service (Shepherd)". We are going to try to work out exactly what fields we need, but would like your early input if something like we are proposing there is possible.

The proposed API looks fine, and should be easily doable. You'll probably also want to be able to send the anchor type information (and allow the client to specify the anchor type) to further reduce ambiguity. e.g. query for an element, vs an attribtute, vs ...

Also, I expect some queries will have a large number of search terms, you might want to have a POST method as well containing a JSON payload of search terms (and options)

POSTing JSON sounds like a great idea - probably better than the GET approach entirely. Much nicer grouping too. Any precedence for the data structure to send or should we roll our own?

Go ahead and roll your own data structure

Closing as the only task left in this is now at https://github.com/w3c/respec/issues/2560

Was this page helpful?
0 / 5 - 0 ratings

Related issues

saschanaz picture saschanaz  路  3Comments

andrea-perego picture andrea-perego  路  3Comments

greenkeeper[bot] picture greenkeeper[bot]  路  4Comments

marcoscaceres picture marcoscaceres  路  4Comments

marcoscaceres picture marcoscaceres  路  3Comments