Archive for markup

When is semantic html not important?

For many web developers semantic markup is very important. It makes your javascript and css code more maintainable, increases accessibility, helps search engines rank you better, and especially if you use microformats or RDFa, makes your data machine-readable.

What if you are transforming to html from a more semantic markup language like TEI? In this circumstance, isn’t HTML your presentation layer? How concerned with semantics should you be at this stage, when you are trying to prepare the material for human usage? Well, it is still easier to maintain CSS files than changing inline styles, font tags and misused tags in your XSL files. And you still get the accessibility benefits of using semantic html.

But should you care about machine readability if you are also publishing the more-semantic TEI source? Should you bother about putting RDFa into your html? Well, arguably you should. As I see it, TEI is for describing the elements of a document; transforming to HTML with RDFa can add a meaningful interpretative layer, stating what the document is saying about the external world. In addition, the semantics of RDF and HTML microformats are more widely understood by machines than the semantics of TEI.

But what about if you are providing the information in a (linked) RDF document as well? Is it still a beneficial thing to add these extra semantics to your html document? After all, why give machines the trouble of extracting RDF from your html if you are giving it to them pure and free anyway? Are, or will there be, user agents (for example, browser plugins, screenreaders) that will find these inline semantic statements more useful than pure RDF in a separate document?


Comments (7)

Accessibility vs. Semantic Markup?

I came across a post about semantic markup and accessibility citing a remark I had made about how, for all the talk about semantic markup in the web-dev community, HTML isn’t a very semantic markup language.

The post goes so far as to say:

[…] when you mark up a page in HTML you shouldn’t get too hung up on the semantic meaning of the elements.[…] What you should be concerned about […] is describing your page elements in such a way as to make them easier to use by screen readers, keyboard-based browsers etc. For example, don’t ask ‘is this set of elements really an unordered list?’ but do ask ‘if I mark up this set of elements as an unordered list, does that make my page more accessible and easier to use?’

However, I feel this has got things backwards – accessibility should, and will be, a consequence of good semantic markup.

Ideally, accessibility is a game for two: you provide the document in as semantic a form as you can, the user agent interprets that document as intelligently as it can. And if the user agent isn’t smart enough to handle all the semantics of your document today, then it will be tomorrow. Admittedly, in practice, a lot of things have to be dumbed down for Internet Explorer – though these tend to be of the bells and whistles rather than semantic variety, but it is usually better to aim at solid principles than the moving target of particular user agents.

The post does make a valuable point about how HTML, besides having to describe a document’s structure, also has to be used as an application interface markup language – which, aside from a rather limited set of form widgets, it isn’t really equipped to do, semantically at least. So we have to make do with the semantically bland div tag spiced up with plenty of javascript.

In theory, there’s lots of ways we can markup user interfaces – XUL, XBL, XForms, ZAML
– all of these hugely inaccessible compared to HTML (even HTML and javascript), because cross-browser support just isn’t there for anything else.

But the div doesn’t have to be bland anymore.

The Role Attribute

Yes, the role attribute is going to save the day.

Not only can we use it to to add semantics to html with RDFa, but this mozilla tutorial shows how we can use that added semantic power to make javascripted widgets accessible as well.

You can read more about how wonderful the role attribute is at Mark Birbeck’s blog.

Comments (3)

A Model for Semantically Encoded Scholarship ?

Three stages:

  1. Source
  2. Interpretation
  3. Statement

For the source, we have TEI. This lets us mark-up different parts of the source document according to what they are. What is also needed is some way of encoding what inferences you draw from the source document about the world – a way of saying “I think this part of this document means that…”. In other words, a schema for encoding statements of knowledge. In other words, RDF.


Then, what is needed is a way of mapping between Source and Statement, a way of saying:

This part (or ‘these parts’) of this document (or ‘these documents’) lead me to make this (or these) Statement(s) of Knowledge.

As a language designed for transforming one XML document into another, XSL seems pretty ideal for this. Of course, you might use any scripting language to do this, but the interpretation stage is both data, and a set of instructions, so the fact that XSL is XML too, lends it to this task. This means you can use the same set of tools for creating, storing and querying Source, Interpretation, and Statement.

An example of how this might work in practice:

You have a TEI encoded version of a Book Auction Catalogue. It consists of lists (listBibl) of books (bibl). You think this list means something, so you write a template:

<xsl:template match="tei:listBibl/bibl">
<xsl:text> This Book, </xsl:text>
<xsl:text>, was sold on </xsl:text>
<xsl:value-of select="tei:date[@xml:id ='this-is-the-date-of-the-auction' ]"/>
</xsl:template >

Note: This is a simplified example. The idea is to output RDF rather than plain text (which I have used here because I am not exactly sure yet how an equivalent machine readable statement would best be expressed). You would also want to include many more details in your statement, pulled in from other parts of the document, as well as other documents perhaps – details identifying you, the source(s), place, probability. The point here, is that using XSL, you can define your interpretation of the document(s).

XSLT simultaneously produces and documents your interpretation. If you then discover that all the books in the catalogue were previously owned by a Rev. John Smith, you can edit the interpretive XSLT file to represent that too (dutifully documenting your change in interpretation in some way). And because you are (hopefully) generating your Statements of Knowledge dynamically from your interpretative XSLT file, when your interpretation changes, your statements change accordingly.

Statements of Knowledge

Ok, we use RDF for representing statements of knowledge – but how? what might these statements look like?

Well, one starting point (at least) could be to look at the W6 vocabulary invented by Danny Ayers in 2004.
The W6 is

an ontology to allow resource descriptions and reasoning based on the 6 questions : who, why, what, when, where, how

Does that cover any statement you would want to make about something?

As I see it, by nesting statements, you have quite a powerful syntax of expression. The top level statement could encompass the act of the interpreter interpreting the source (refrenced within the What element) into the child statement in the How element.

What might need to be added, is a way to identify the source and the interpreter for the statements – although if there’s a neat way I’m not seeing to do that within the current W6 syntax, then all the better. The syntax needs to let us specify, for example, one source for What happened, and another for When it happened.

… But I don’t want to get into too much detail here.
So to recap:

  1. encode source in TEI
  2. write XSL to transform (read: ‘interpret’) source into:
  3. RDF statements of knowledge

Tools for semantically encoding scholarship?

It’s probably too much to expect every practising scholar to be thoroughly conversant with XML and TEI, happily hacking away at XSL stylesheets in vim.

What would an application that lets you make machine-readable statements of knowledge, without hurting your eyes with angle brackets, look like?

Well the Mangrove project at the University of Washington have an interesting graphical tagger application that lets you semantically tag up existing documents.

I imagine something like that. You load in your TEI, and you see it in the window, probably with some CSS applied so that you don’t see the tags, but the elements are styled according to whether they refer to a person, thing, place, or date – each is a different colour, and in bold (for example). Just some easy visual way for you to pick out these elements from the rest of the text. Then when you select these elements, you get a dialogue box for building the statement. The statement is maybe already partly filled out because you set up some template statements for the document(s) you’re working with. And one of the W6 questions is answered depending on which type of element you selected.
You also need to be able to set up macros, for repetitive interpretations of consistently structured documents. For instance, you might have a diary, and want to set up a tentative rule that says the writer of the diary knew every person mentioned within. You could then go through each instance (in a spell checker-esque interface) adding further information on each relationship, and telling it to ignore mentions of people that the writer knew only by repute.

Well, that’s it one broad vague idea for a three step model for digital scholarship, with the odd over-specific digression thrown in for good measure.

Comments (1)

Authoring Born TEI

I wrote my thesis in TEI. Before I began, I searched google (mainly in vain) for advice and examples of writing born-TEI – that is, documents originally written in TEI, not encoded in it afterwards. So, for the benefit of others who are also thinking of authoring in TEI, here is some of what I took away from the experience.

  • Writing (and thinking) Digitally and Semantically Can Be Quite Different from Writing (and thinking) in the conventions of the Printed Page

    For one thing, I was tempted down the path of DRY. So, for example, when citing a book, instead of having a footnote with bibliographic details, I had a <ptr/> element with a target attribute pointing to the id of the <bibl/> in my bibliography. Not having to repeat yourself is nice.

    Another thing you can do is write your notes inline with the text, and then transform them when it comes to presentation, replacing the text with a reference number, and moving the text to the foot of the page or the end of the document. Not only is this breaking out of the mindset of print, it is easier than having to shoot down to a notes section in the document every time you want to write one.

    You could also do the same thing with citations I suppose. Instead of targeting items in a bibliography section, you might simply write the bibliographic details out inline with your main text, targeting back in subsequent citations, and moving everything down into a bibliography (and notes) in the presentational stage. However, you may want to have a bibliography in the TEI as well, for those books and articles that you don’t refer to directly, but nonetheless want to acknowledge. You may also prefer to write your bibliography before you start writing the main text. It depends how you work.

  • You (probably) Still Have to Present It in the Conventions of the Printed Page

    It can be pretty annoying having to batter your born-digital document back into the typographical conventions you tried so hard to think and write outside of. The great advantage of course, is that you can present your text in many different forms without touching the original document. Unfortunately, most new documents, such as university dissertations, only really have to be presented in one form, so this advantage didn’t really console me much.

  • TEI offers too many different ways to fulfil common tasks

    Not that we need less choice, but it would be good if there were ‘microformats’ for authoring in TEI, so that you didn’t have to develop so many mini principles of best practice as you wrote.

    An example: in your bibliography, you have some urls. Scholarly practice dictates that you include a ‘last accessed on’ date, but how do you mark it up? This is a situation where you have to follow a convention anyway, so it would be really useful if you could follow a conventional way to mark it up. If we all do <date type=”lastAccessed>2004-10-16</date> then we can share stylesheets and other tools. And that would be nice.

  • HTML is pretty un-semantic

    There is a lot of talk in the web-dev community about the importance of semantic (x)html. And it’s true, html authors should try to do it as semantically as possible. Transforming from a really semantic mark-up language like TEI though, you realise how little the amount of meaning you can give text with html really is. Of course, it is a good thing that html has a far smaller tag set – imagine the success of the web if every homepage-jockey had to wade through the TEI guidlines to publish their poetry and pet photography. But it really puzzles me why in html we have so many tags for presenting programming stuff – kbd, samp, var, code – but not tags for marking-up the stuff that programmers really care about, like dates and names.

    So, if you are going to transform your TEI into html, you also have to decide how semantic your html is going to be, and how much presentation you are going to do with XSLT (or scripting language of your choice), and how much you are going to do with CSS. This probably depends heavily on the browsers you need to support. CSS3 is quite powerful, but it ain’t gonna work in Internet Explorer. CSS is also, I find, a bit easier to read and work with than XSLT, but you will need to stop-gap html’s small tag set with plenty of classed spans and divs, and it can get quite time consuming switching between xsl and css files trying to locate and solve various presentational glitches (did I do this in css, or xsl?).

    One answer is to skip the html stage. Style your TEI with css, and use only a mere sprinkling of scripting/xslt to re-order and copy chunks of content. This has the advantage that your document will retain its semantics right up till it hits the printer ribbon. The disadvantage is that it loses the functionality of html – you won’t have hyperlinks, and it will only really work in the newest most standard compliant browsers, so won’t be terribly accessible.

Mapping TEI to HTML

One of the annoying differences between TEI and valid (x)html is that in tei, lists and quotes can occur in paragraphs, but in html they can’t. So I thought it might be helpful to put my solution to this here as well. The following template assumes that quotes longers than 130 characters are blockquotes, whilst shorter quotes will be inline. Lists that are part of a paragraph’s text (ie: a comma separated list) cannot be transformed to an <ul> or an <ol> (well, they maybe can if you split it into two paragraphs and fiddle enough with the css, but it’s probably less semantic than to transform it into plain text). I have marked up these lists in the TEI with @type=’inline’.

<xsl:template match="tei:p[child::tei:list[not(@type='inline')]|child::tei:cit[string-length(tei:quote) > 130]|child::tei:listBibl]">
<xsl:if test="@xml:id">
<xsl:attribute name="id">
<xsl:value-of select="@xml:id"/>
<xsl:attribute name="class">
<xsl:value-of select="string('preblock')"/>
<xsl:for-each select="node()[following-sibling::tei:cit[string-length(tei:quote) > 130]|following-sibling::tei:list[not(@type='inline')]|following-sibling::tei:listBibl]">
<xsl:apply-templates select="current()"/>
<xsl:apply-templates select="tei:list|tei:cit[string-length(tei:quote) > 130]|tei:listBibl"/>
<p class="postblock">
<xsl:for-each select="node()[preceding-sibling::tei:list[not(@type='inline')]|preceding-sibling::tei:listBibl|preceding-sibling::tei:cit[string-length(tei:quote) > 130]]">
<xsl:apply-templates select="current()"/>

NB: XHTML 2.0, when it comes, will allow lists within paragraphs.
Also, if you don't already know, #tei-c at is a good place to ask, argue and discuss TEI.

Comments are (as always), most welcome.

Comments (3)