just like there’s a distinction between non-information resources and information resources, or between binary resources and text resources, maybe there should be a distinction between descriptors documents and content documents

trwnh@mastodon.social

i guess this is basically the distinction between embedded metadata and sidecar metadata, is what i was trying to get at

trwnh@mastodon.social

what i’m thinking is that sidecar metadata can be stored 1:1 or 1:n — if it’s 1:1 you might as well embed it if you can, either as some kind of frontmatter or inline with rdfa. but having frontmatter means every single processor that touches your content needs to be aware of the existence of that frontmatter (and strip it). so frontmatter isn’t as portable as i would like. basically a document with frontmatter is no longer that content type; it is a new media type for each combination.

trwnh@mastodon.social

example: markdown is text/markdown but if you add frontmatter it is now something different. but there isn’t a standard type for this; instead, every application implements frontmatter parsing independently. there isn’t consensus on the delimiter or on the format. the definition of a new media type should include the delimiter and the format; for example, “delimit with three dashes and serialize frontmatter as yaml” or “delimit with three pluses and serialize frontmatter as toml”

trwnh@mastodon.social

earlier i said that html’s head-body split is not the same as the metadata-content split i am after; after some further thought, this isn’t really true. i think what i am trying to model here is a way to be able to detect and handle arbitrary header data, by unwrapping it to get at the body content. but i’m realizing that this body content may itself have its own nested headers and body…

trwnh@mastodon.social

more precisely there is a format to the header data and there is a format to the body content

an http request/response can be serialized as a text file which has http headers and http body, and then that http body can be of a certain content type like html, which itself has html headers and html body. the html body content is often also of type html

you can progressively wrap or unwrap “body content” with “header data” in different formats. i’m not sure how best to describe this…

trwnh@mastodon.social

how can we generalize this header+content pattern, basically

i’m fairly sure you need to at least define header type, delimiter type, content type

example:
- header = toml
- delimiter = +++ to start, +++ to end
- content = html

is this enough to describe a canonical data format?

trwnh@mastodon.social

side note: i wish there was a distinction between html content and a full html document… if you try to render html content in a browser and it isn’t a full html document, weird things might happen

trwnh@mastodon.social

revisiting: i discovered the iana media type multipart/mixed which could basically be this, just with a little modification https://www.iana.org/assignments/media-types/#multipart

the thing is the "boundary" parameter in multipart media types expects to be concatenated to a -- so you can't express the typical --- or +++ without problems (a markdown horizontal rule --- would get parsed as a multipart boundary)

still there's probably some inspiration to be had there, you could define an application/subtype that does similar

trwnh@mastodon.social

it would probably be more correct to define application/mdx or whatever (since the typical intent is to be processed by something like an MDX processor), but i haven't really looked into the particulars of doing this properly and making it modular (instead of hardcoding semantics of "toml frontmatter, --- separator, markdown body")

oblomov@sociale.network

@trwnh multipart/* types have to specify the boundary signature, so why would the classic markdown ---- be identified as one if not indicated as the boundary separator?

trwnh@mastodon.social

@oblomov i mean if you say boundary="-" then the separator becomes --- but your markdown content might include --- as what gets rendered into an <hr> element

something like

```
---
foo: bar
---

stuff.

---

more stuff.
```

could get parsed as 3 parts instead of 2

NodeBB-ActivityPub Bridge Test Instance

just like there’s a distinction between non-information resources and information resources, or between binary resources and text resources, maybe there should be a distinction between descriptors documents and content documents