-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Citation links #15
base: main
Are you sure you want to change the base?
Citation links #15
Conversation
Proposed order of operations in general:
For this post, I am shortcutting
In terms of spec features that could be re-used (I'm sure I'm missing some):
One potential mapping that might feel appropriate, showing
A couple of questionable decision I made above
|
An idea I gave on the Discord server to account for @vhyrro 's proposal:
And later we figured that an example of what a citation style library could look like:
|
I've not read this carefully at all, but I'll just repeat the essence of what I earlier posted on a citation issue here: Look at the pandoc citation model, and the new newer org-cite one. The former has been around longer, and been battle-tested by a lot of users, and the latter in turn learned from that (as well as BibLaTeX). Having a model that is conceptually close will not only get you needed features without having to reinvent the wheel, but also can ensure it's easy to losslessly convert back and forth. As for styling, I'm obviously biased, since I created CSL (used for styling in both), but I think we collectively solved a lot of challenges in this domain. But it's really difficult, with trade-offs. If you design an approach organized around natbib styling, that leaves out better, more general, options in the TeX world (namely BibLaTeX), but also newer solutions like CSL, that work outside the TeX world. I think pandoc is the best balance; it's rich enough (it's citation model is richer than natbib's AND more concise), but can export to key TeX styles, while also supporting native formatting in the CSL-based citeproc processor. In The org-biblatex module actually has a variable that allows the users to define their own mappings, if they prefer, somewhat like the macro idea here. But the default value is curated so that the styles map more-or-less consistently to other backends. Pandoc also has that as well, but supports a more limited range of local citation styles (just the equivalent of citet, citep, and I guess citea). |
Thanks for offering an experienced viewpoint in this domain!
Can you explain what aspects of On thinking about it more, I tend to like a generalized macro approach with |
Skimmed through The basic idea seems to be that
|
@vhyrro, would this |
I should indeed clarify that. I only meant the
That design is really powerful and flexible, but I was thinking "cite" and such wasn't really consistent with the syntax approach in neorg. |
The default insert processor uses minibuffer completion. My citar package provides a richer alternative for that, and also a cmp-like completion at point (see two screenshots at the top). https://github.com/emacs-citar/citar The latter does use org for finding the citations, but is otherwise independent of org-cite. The minibuffer completion is provided as an insert processor, so that it will be used when calling It's a really well-designed system that makes that easy to do on my end. The insert processor is just a few lines of code. Basically the insert and follow processor frameworks allows you to plug in custom functions to standard org commands. |
@d-r-a-b - just looking at your earlier note a bit more closely.
+1
I only casually follow neorg, so forgive me: WDYM by "macro" in this context? It sounds like maybe your suggestion would be coupling citation syntax to a particular output backend? Or am I misreading? |
Depending on how you define an "output backend", yes. To put it in On the other hand, I am advocating for some conceptual separation of |
It's probably better to excerpt from the spec for this. It's a bit longish, but a quick skim overall.
The specific
See also
Macros are an active focus of neorg development right now. At the moment, I believe that the "macro" system is completely implemented in lua. However, there are experiments to embed Janet as a very lightweight LISP-like that comes with Parsing Expression Grammars built in as part of the standard library. My understanding is that this would then form the basis of most macros, but you'd really have to pick Vhyrro's brain for a better explanation of how the responsibilities are going to be delegated. |
edited for clarity
OK, I get it. So in effect, you are meaning a default citation syntax that should be good enough for the vast majority of cases, but that's not hard-coded? If yes, that seems reasonable. By "backend" I was meaning TeX (natbib vs biblatex, etc.) vs a CSL solution (including the lua-based one for TeX) vs a newer example like Typst. Aside: have you all identified the list of requirements? I would suggest this as one of them:
IMO, the org-ref/org-cite thing is a problem ATM; it fragments the development ecosystem, and forces users to choose between incompatible approaches (you can't mix the two syntaxes in the same document, or things will break), all because some people insist on "style" names that look like natbib's command names, which is more of a UI issue than anything. But I guess that's neither here-nor-there; users and developers do have different priorities, and that's unavoidable at some level. The natbib citation model, BTW, is more limited than all the alternatives we've raised here. |
I think I am advocating for a This means that Pro/cons as I see it revolve around
|
Fragmentation is always a problem. To be fair, I don't expect to use both Zotero and Endnote in my word docs and for them to play well together. At least with these markup languages it's easy to see at a glance which package is being used.
If I understand correctly,
For the discussion, I think it's useful to be explicit. I mentally hold distinct the following 3 concepts and these are how I use these terms.
It's fair to say that certain syntaxes limit what models are supported, and if you want to convert from a package that only supports a .bib backend to one that only supports a sqlite backend there are some squicky conversion that probably need to happen outside of I more strongly align to your earlier statement that much of this is a UI issue. Is there really a difference in the model just because the syntax looks a bit different? I will say the |
Some possible requirements to get us going, all of which are up for debate Model
Syntax
Package
Compelling User story
EDITED: add genitive citations |
Absolutely not. I do indeed mean model; the abstractions behind the syntax.
Yes, this is what I meant; that's an arbitrary limitation that makes it impractical in many fields; one that pandoc, biblatex, org-cite don't have. Well, maybe not "arbitrary" exactly; I am guessing one has to do some gymnastics to support it in TeX that aren't necessary elsewhere, which would explain why biblatex has two different syntaxes for single vs multiple.
That's correct; it's why the org-cite code is completely agnostic on it, with just some best-attempt reasonable defaults, which allow documents to work pretty consistently across different output targets. For example Org-cite history, etc ... Don't want to get too focused on this; I really just raised this issue to encourage you all to avoid this yourselves, which you have an opportunity to do since this is a pretty new project.
I'd say more like the development of I was only involved in the six months or so of it, long after the syntax/model discussions were settled. But the org-ref folks had every opportunity to make their case there; either they didn't, or they failed to convince. In (current-citation (if (eq 'citation (org-element-type datum)) datum
(org-element-property :parent datum)))
(current-ref (when (eq 'citation-reference (org-element-type datum)) datum))
(refs (org-cite-get-references current-citation)) I do know some of the org-cite syntax decisions were guided by technical reasons; for example, parsing. |
Understood and thank you for your input. Is there anything you've run into that the model generated by citations using
Thank you for the example! Is there any technical reason that this code example would require first-class object support, instead of just depending on the macro library that would provide the same AST? From a theoretical standpoint my somewhat limited understanding is that the primary difference is whether a
In what cases do you find a unified AST provides useful functionality over and above the 2nd class model? I'm struggling a bit to come up with strong use cases for this, but that doesn't mean there isn't a good strong use case. |
Another part of step 1 - collect examples of citation styles we want to support. https://www.overleaf.com/learn/latex/Questions/How_do_I_create_a_possessive_or_genitive_citation%3F
This is essentially a style variant in line with parenthetical, in-text, etc. |
I said:
Pandoc defines a citation in its AST (https://hackage.haskell.org/package/pandoc-types). 1st class definition of a citation in Anyone else have specific pros/cons for 1st class vs 2nd class citation objects? |
You guys have been productive! I think @d-r-a-b 's list is a good starting point. Some random thoughts:
Overall, I think we should aim for something not too huge to implement, but that can be extended afterwards, |
For sure. My understanding is that the CSL specification only defines the style component in the diagram at https://docs.citationstyles.org/en/stable/primer.html, which essentially states
I haven't found a place that really describes how a CSL processor is meant to implement things like per-key prefix/affixes within https://citationstyles.org, nor a strong description of what we have been terming a model. These are things that I think correspond to "Citation Details" in the diagram referenced above. For Citation Details/Model level details, the best references that I have seen are pandoc.types and citeproc.types. My sense is that CSL as a spec leaves open the interpretation of a citation model to be implementation-specific for a processor. There is also a "test suite maintained by Frank Bennett for testing of [the citation processor] citeproc-js. The test suite can be used by authors of other CSL processors, but contains tests that go beyond the scope of the CSL specification." @bdarcus I would really appreciate your input on whether my understanding of what CSL defines is accurate here.
+1
I would consider adding support for
I don't think it's rare at all. I fully agree that a lightweight syntax for the simplest use-case is desirable, but I think the main benefit is it makes documents more readable because the intent is clearer. Reducing syntax-related pain points also helps to preserve a writing flow that is focused on content instead of syntax. |
Yeah, there is no real CSL API, though I think we have enough experience and implementations to define one.
Yes, because when we published the first version and docs, we weren't sure, and other priorities took over. But then citeproc-js came along, which Zotero used but was independent from, and Frank developed a couple of JSON Schemas that described a kind of API between the two. The CSL schemas repo now hosts versions we adapted from that work. https://github.com/citation-style-language/schema/tree/master/schemas/input Newer implementations certainly studied that, and evolved it. I'd say pandoc and it's citeproc is a good place to look, since it's much newer, and very well designed. https://github.com/jgm/citeproc See the JSON CLI server, for example. https://github.com/jgm/citeproc/blob/master/man/citeproc.1.md#notes As I say, I think we should better formalize those, and an API. I'm actually working on an experiment that may address all this, including the API. https://github.com/bdarcus/csl-next.js But it's very tentative ATM.
This is the test suite pretty much all the CSL projects use, which was adapted from Frank's, but now is a broader community effort. We've mostly removed tests specific to citeproc-js and CSL-M. https://github.com/citation-style-language/test-suite
A citation that only allowed a single key wouldn't be usable for people in many fields.
I'd expect many would be want to be able to publish finished manuscripts in PDF or OpenDocument from their norg documents, at least in time. But even if one is only using norg for note-taking, a key part of that it properly citing while doing that. A single key without additional metadata is basically a non-starter for people in many fields. For example, many fields in the humanities and social sciences do a lot of quotation of source material. If one is doing that, they must include the page number(s) or other "locators" (what we call them in CSL land). So not sure your distinction between basically two modes of citation holds. |
No. As I said, it's just an iterative improvement on the pandoc model, which works pretty well. The only differences:
Hard to say, since as I said, I don't follow norg or neovim much at all I can just say that org-cite makes it really easy to write functional integration. |
Amazing, thank you so much for your comments and clarification. It's also exciting to hear about developments happening in this space for CSL!
Just to clarify, was this comment interleaved correctly? As I understand them, supporting locators does not imply a limitation to single key. For example, @Klafyvel, I do understand your concerns about raising the scope too far especially for a initial implementation/minimum viable product. They are very reasonable and cogent concerns. I hope that I haven't been giving the impression that all the features being discussed need to be implemented early (or even at all). My goal in enumerating all of the various features that a user of citations might want is to make sure they are being considered so that 1 of 2 decisions can be made. 1) we would eventually like to support so-and-so feature and so we should make design decisions that will not require kludges later. 2) we say that some feature is explicitly never supported (because we think it's harmful, because it creates some form of ambiguity in parsing or output, because it's way too complicated and you should use some other tool if you want that, etc) and then we don't feel bad later if some citation doesn't work with Parsing |
To illustrate informal, incomplete grammar for what seems like the current iteration of the citation syntax under discussion:
Considering how much the current proposal mirrors the The (cite/variant) part, as currently proposed, will just be a convention for naming the cite macros provided by a given output processor. I imagine Janet has something similar to Lua |
hmmm... If multiple locators are supported within the AST, the syntax for that needs to be figured out. Would correspond to output of something like "(Jones 1988, pp 12-15, 30-34, 88 for more details)". To some extent, if an author needs multiple locators they can utilize item-suffix to cover 90% of cases. However, main use cases for an AST representation of a locator that come quickly to mind are for consistency (always put "see" in front locator in output citation), for localization of that "see" string, and for being able to support a style directive that removes all locators in the output citations. That style-directive will act very poorly in the 1 locator case because it will look like it works for most citations and then the author will be surprised by the multiple locator situation that only hides the first locator in the output. Of course in 0 locator case a style directive would do nothing, but there are no surprises. Decision: 1 locator, multiple locators, 0 locators in AST My impression is that explicitly 1 locator in the AST is worse than 0 or multiple locators for reasons listed above. |
It was awkward; just meant to agree with you in that, and suggest there's no reason to allow only one.
The existing implementations I am aware of (elisp, Haskell, JS, rust) accept citation-reference suffix strings as input and parse them into lists of locators, assuming a standardized syntax. I don't have the details handy, but they're documented in those implementations. EDIT: See the very precise english description in the oc-csl.el commentary. He derived it from citeproc-org, which likely borrowed from pandoc :-)
I don't want to overstate it: it's currently a personal experiment. I am, however, starting to document the model (I adapted the locators docs I link above to a docstring there), and have added a |
I really dislike their parse method if this is correct. It leads to surprising parse behavior if you have an org-cite citation like
That looks really awesome! |
Thank you for the clarification! |
Pandoc has some additional options to handle the problems you note, I believe, using TeX-like brackets. These data issues are really tricky. You need to support well by far the most important case here, which is page numbers, but not foreclose other options. E.g. the old make the common easy and the complex possible. When we were working on enhancements to CSL early in the pandemic, we actually converted it to an array of objects. But that's awfully complex for the common case. |
I think we all agree that the ideal scenario would be a syntax that allows explicit specification of multiple locators and still degrades nicely to a simple/easy syntax for the simple or no locator case. I'll be putting some thought into how to achieve that. What I really, really dislike is when the software makes the complex impossible. When things that are meant to automate and ease your life turn into a slogfest of trying to just get the thing to do what you want, it's much worse than never automating it at all. Consider an author who writes their citation with multiple logical locators
If the processing of locators can be wrong, then there needs to be a simple way to turn locator processing off or to provide it in a more verbose syntax. If that cannot be done, then I'd rather have no explicit support and say that authors will have to manually fix their suffixes if they switch style guides and now need "pages" to show up as "pps.".
In essence, I think the AST for the locator should look like an array of objects. To achieve the ideal case, my sense is that we would have to figure out how to create a syntax that hides this underlying model in the simple case of 0 or 1 locators. It would be really cool if we can hide it in the multiple locator case too, although I wonder how feasible that will be. The current org-cite solution seems to achieve the goal of hiding the model for the simple case, but break down in a lot of specific cases that an author might want. You mentioned that pandoc has some optional facilities to deal with this. I'll see if I can track down that syntax for ideas. EDITS: many for formatting. I'm a klutz this morning. |
See here, in para that starts "In complex cases ...".
Edit: Of course, you can play with |
Still thinking about the locator syntax problem, but something I came across: https://list.orgmode.org/orgmode/[email protected]/ From the org-mode mailing list, relating to macros and citations and citation processors. There is more to see in that thread and another 2 threads which reference it, but a summary is that figuring out how to make their CSL export processor interoperate with Org elements in the prefix/suffix elements is a tricky problem to solve because CSL has it's own concept of formatted text that does not always map to what org-mode believes in (i.e. a smallcaps format). I wanted to highlight the user who is trying to make a macro work in the prefix.
This is precisely what I mean about having to fight with the citation system, although in this case I would say that it originates more from the package than from the syntax. The user encountered the problem in Oct 2022 and was trying to submit patches to get it fixed through Jan 2023. I didn't see a resolution, but maybe in another thread. It's worth making sure that there is at least a fallback that minimally provides this kind of hacky solution, but it would be nice if we either had a better fallback, or made it very clear that you could always ask the processor for an individual bibliographic element, ideally formatted according to the main rules of the CSL style. |
FWIW, here's what we came up with for the JSON schema model for CSL v1.1. Here's an example: {
"locators": [
{ "page": 23 },
{ "begin": { "page": 25 }, "end": { "page": 28 } }
]
} We basically concluded our priority for these input files, which aren't likely to be touched by users, is correctness, and in this case wanting to allow processing of those lists. A possibly reasonable alternative could be something like this, but it raises other issues (like, it would assume on a processor treating the value as a plain string): {
"locators": [
{ "page": "23, 25-28"} }
]
} PS - not clear what the future of that 1.1 branch is, but why I'm experimenting also with the alternative. |
Thanks for the input! Just making sure I am remembering correctly - CSL has nothing to say on author-supplied affixes right? These are purely implementation-defined by whichever citeproc variant is parsing the CSL? |
Correct. But they all have settled on a similar approach, so seems past time to standardize. Aside: suffice to say, the success of CSL is sometimes a bit of a challenge. Imagine ten different implementations of neorg! Worth keeping in mind, though, there are too broad groups here:
In both cases, you have to expose a sane UI to users, whether in the form of GUI field(s), or a markup syntax. That tension between machine-friendly and human-friendly is the nub of the challenge. |
Standardize all the things! But also, I don't like the existing implementations in terms of the functionality they currently offer around more ad-hoc citations, mostly around these bugbears:
These support things like
The citation syntaxes and resulting models that I've seen, especially in the lightweight markup world, don't seem to deal with these citations well. I would be happy to have counter-example of syntaxes that do cover these cases well and can produce the appropriate output as citation styles change. Some of this is modeling and syntax and needs to be fixed at those levels, some of it is just the difficult problem of software interop, but these examples show some of the real edge cases that authors want to be able to express. My hope for this thread is for us to come to some conclusion about whether or not we want to support authors who want these things and if so, how to do it in a way that parses cleanly and limits the introduction of new markup unless absolutely needed. |
I suppose the other thing is there is a weird asymmetry in all the syntaxes. Why isn't there any support for the concept of a citation like
? I know that I would never write that, but there are many things I wouldn't do that other authors might want or have a business need to do. |
I agree, but the answer is because it's never come up AFAIK. We also haven't talked about "rich" markup there, which has come up. |
Still very interested in this, but balancing a number of priorities. See https://gist.github.com/d-r-a-b/e359904b2e8f1bd4e9eca2574b8e6265 for a flash-frozen state of my thoughts at the moment. The only really useful bit of it is the list of terms, but other bits might be scavenged either to see online resources related to citations or for sentences that might be useful for an actual draft document. It's also missing the idea I'm about to suggest. I think I have a viable suggestion for how to model a citation (abstract), and from that some ideas on how it might be realized syntactically.
Continuing from the basic grammar in #15 (comment), I am suggesting conceptually the same
The exact syntax to achieve this is up to debate, but this might be more easily understood if:
This would allow the citation (consult Smith, 1995, pp. 5-12 on the bottom sections of the pages; for an example, consult Walters, 2000; consult pages 8-9 in Adams, 2015; the day of the week is Monday; I hate that day and that is why this citation is ugly) to be represented as
This illustrates several points about the proposed model
What do people think of the general idea (model or syntax)? It allows you to do something simple like |
I've only quickly read this @d-r-a-b, but I like it. I think the only reason org didn't go with the wrapper for locator is some technical reason. And as you note, the really common simple case remains simple. Except, why does the simple example not include the wrapper for the locator? Did I miss some exception? I guess I should mention, since I don't think we discussed it, and it could impact details: some styles require resorting and grouping of multi-reference citations for output. It's one reason why a distinction between global and local affixes is useful. So ...
... might become:
|
To my mind, the truly simple case is to reduce the amount of magic that the citation processor will do without prompting; hence by default the simple case that I chose to show as
Unless I'm missing something, I believe the the syntax I discussed still has the concept of "global" or per-citation affixes and "local" or per-per-key affixes. The primary difference is that the structure that represents any of them is identical: a list of
|
Right, so the simple example is maybe a little too simple for practical use. But I think that's fine. As we discussed, there are no magic bullets here that nicely balance all priorities, and if you want to have rigorous parsing without magic, that's a totally cool design decision.
It's really, at least in the CSL world, the style that governs whether it's a colon, comma, etc.
Yes, I wasn't meaning to suggest otherwise. It just occurred to me worth mentioning in this context. |
I would argue that depends on the user, but it's true that it makes locator magic opt-in and and that does make the simplest version undesirable for peer-reviewed publishing (as opposed to personal or even school-assignment level citation needs, where the pure string version could serve many people well). I suppose the other option is to provide a set of delimiters that would actually make something explicitly a string instead, thus making locators more opt-out. Tbh, I'm not sure which decision is more elegant or practical. There's a certain purity to requiring a delimiter to "promote" a string into a locator that is very nice and reduces the surprises a lot. It also makes it clear that the user expects this next token to be a locator-unit, so a mistyping can trigger a macro error instead of silently failing to convert into a locator AST. It also seems potentially a bit weird to require a delimiter to make something a string when the surround invariant-data is already treated as string without requiring any such delimiters. There's a consistency and principle of least surprise that feels like it works well with the rest of the norg-spec philosophy. OTOH, if most people would prefer the magic, there is an elegance to making the syntax for it as straight-forward as possible and relegating the less common option to the more cumbersome syntax. |
In the change I just pushed, the YAML representation would be: suffix: [see, page: 23, section: V] ... which is reasonably elegant for human writer and also machine parser. Edit: except it's biased towards English speakers, since all the symbols are English. Do you have any conventions for that sort of thing in neorg? |
Nice to see the "affix as array of option<locator,string>" idea get a concrete commit! Would commas in the strings be escaped?
Conventions for what exactly? Lists? Key-value pairs? Tags are one of the mechanisms for calling macros; they can take a series of space-delimited parameters:
Attributes are also in the spec, and are closest to the idea of a key-value pair. Multiple items are
Is this what you were asking about? |
Hey! Is there any update on this topic? I think this could be helpful. It is a simple-to-use telescope plugin that allows you to render and pick BibTeX citations from .bib files. You can find it here: Telescope-BibTex BibTeX/BibLaTeX is widely used in science, and I believe it's one of the most commonly used formats. |
This is heavily work in progress, and is meant to centralize discussion about the syntax for citing references.