just like there’s a distinction between non-information resources and information resources, or between binary resources and text resources, maybe there should be a distinction between descriptors documents and content documents

trwnh@mastodon.social

it might not be a problem with smaller content, like say an as:Note, but stuffing the full html structured contents of an entire article into even rss or atom seems like it could get out of hand really quickly. this is why feeds are often limited to 10-20 items or otherwise only include a summary, right?

so maybe it makes sense to treat even text content as a separate thing, just like we do with binary resources.

trwnh@mastodon.social

html kinda doesn’t make this distinction. there’s a head-body split but that’s not the same as a metadata-content split. you can embed metadata into body content just as equally as you can embed it in head tags (example: rdfa)

trwnh@mastodon.social

i guess this is basically the distinction between embedded metadata and sidecar metadata, is what i was trying to get at

joelving@mastodon.joelving.dk

@trwnh imagine how easy on your server sharing links across the Fediverse would be, if you could query a URL (either separate via a well-known or as an HTTP verb) for a (signed) OpenGraph document instead extracting it from the full payload.

trwnh@mastodon.social

@joelving there is OEmbed which might be what you are looking for?

trwnh@mastodon.social

what i’m thinking is that sidecar metadata can be stored 1:1 or 1:n — if it’s 1:1 you might as well embed it if you can, either as some kind of frontmatter or inline with rdfa. but having frontmatter means every single processor that touches your content needs to be aware of the existence of that frontmatter (and strip it). so frontmatter isn’t as portable as i would like. basically a document with frontmatter is no longer that content type; it is a new media type for each combination.

trwnh@mastodon.social

example: markdown is text/markdown but if you add frontmatter it is now something different. but there isn’t a standard type for this; instead, every application implements frontmatter parsing independently. there isn’t consensus on the delimiter or on the format. the definition of a new media type should include the delimiter and the format; for example, “delimit with three dashes and serialize frontmatter as yaml” or “delimit with three pluses and serialize frontmatter as toml”

alice@gts.void.dog

@trwnh it should be specified by markdown (variants) probably,,

trwnh@mastodon.social

@alice im thinking more along the lines of like. what if yaml frontmatter + markdown content = application/markdown-content-with-yaml-frontmatter or whatever. and it had a registered extension .mdyaml or whatever.

i’d be mainly interested in html content and toml frontmatter but saying that this is .html and text/html is not accurate. i can’t directly serve such an html file via a web browser; it needs to be processed/converted/whatever by an application first

alice@gts.void.dog

@trwnh if it needs so much pre processing is it a markdown template?

trwnh@mastodon.social

@alice not really, it’s “content” plus some “header”; you *can* render it against some template or layout, but the main goal here is portability. i want to be able to know ahead-of-time that this text file is not just html/md/etc, it also has some junk i may or may not care about at the start. if i want just the content then i need to strip that junk first

trwnh@mastodon.social

earlier i said that html’s head-body split is not the same as the metadata-content split i am after; after some further thought, this isn’t really true. i think what i am trying to model here is a way to be able to detect and handle arbitrary header data, by unwrapping it to get at the body content. but i’m realizing that this body content may itself have its own nested headers and body…

trwnh@mastodon.social

more precisely there is a format to the header data and there is a format to the body content

an http request/response can be serialized as a text file which has http headers and http body, and then that http body can be of a certain content type like html, which itself has html headers and html body. the html body content is often also of type html

you can progressively wrap or unwrap “body content” with “header data” in different formats. i’m not sure how best to describe this…

trwnh@mastodon.social

how can we generalize this header+content pattern, basically

i’m fairly sure you need to at least define header type, delimiter type, content type

example:
- header = toml
- delimiter = +++ to start, +++ to end
- content = html

is this enough to describe a canonical data format?

trwnh@mastodon.social

side note: i wish there was a distinction between html content and a full html document… if you try to render html content in a browser and it isn’t a full html document, weird things might happen

NodeBB-ActivityPub Bridge Test Instance

just like there’s a distinction between non-information resources and information resources, or between binary resources and text resources, maybe there should be a distinction between descriptors documents and content documents