Span.start_char has unexpected return value #5541

elben10 · 2020-06-04T09:52:53Z

elben10
Jun 4, 2020

How to reproduce the behaviour

from spacy.lang.en import English
text = """Fast and Precise Type Checking for JavaScript AVIK CHAUDHURI, Facebook Inc., USA PANAGIOTIS VEKRIS, University of California, San Diego, USA SAM GOLDMAN, Facebook Inc., USA MARSHALL ROCH, Facebook Inc., USA GABRIEL LEVI, Facebook Inc., USA In this paper we present the design and implementation of Flow, a fast and precise type checker for JavaScript that is used by thousands of developers on millions of lines of code at Facebook every day. Flow uses sophisticated type inference to understand common JavaScript idioms precisely. This helps it find non-trivial bugs in code and provide code intelligence to editors without requiring significant rewriting or annotations from the developer. We formalize an important fragment of Flow’s analysis and prove its soundness. Furthermore, Flow uses aggressive parallelization and incrementalization to deliver near-instantaneous response times. This helps it avoid introducing any latency in the usual edit-refresh cycle of rapid JavaScript development. We describe the algorithms and systems infrastructure that we built to scale Flow’s analysis. CCS Concepts: • Theory of computation → Type structures; Program analysis; Additional Key Words and Phrases: Type Systems, Type Inference, JavaScript 1 INTRODUCTION JavaScript is one of the most popular languages for writing web and mobile applications today. The language facilitates fast prototyping of ideas via dynamic typing. The runtime provides the means for fast iteration on those ideas via dynamic compilation. This fuels a fast edit-refresh cycle, which promises an immersive coding experience that is quite appealing to creative developers. However, evolving and growing a JavaScript codebase is notoriously challenging. Developers spend a lot of time debugging silly mistakes—like mistyped property names, out-of-order arguments, references to missing values, checks that never fail due to implicit conversions, and so on—and worse, unraveling assumptions and guarantees in code written by others. In many other languages, this overhead is mitigated by having a layer of types over the code and building tools for the developer that use type information. For example, types can be used to identify common bugs and to document interfaces of libraries. Our aim is to bring such type-based tooling to JavaScript. 1.1 Goals In this paper, we present the design and implementation of Flow, a static type checker for JavaScript we have built and have been using at Facebook for the past three years. The idea of using types to manage code evolution and growth in JavaScript (and related languages) is not new. In fact, several useful type systems have been built for JavaScript in recent years. The design and implementation of Flow are driven by the specific demands of real-world JavaScript development we have observed at Facebook and the industry at large. • The type checker must be able to cover large parts of the codebase without requiring too many changes in the code. Developers want precise answers to code intelligence queries (the type of an expression, the definition reaching a reference, the set of possible completions at a point). Relatedly, they want to catch a large number of common bugs with few false positives. arXiv:1708.08021v2  [cs.PL]  30 Aug 2017"""
nlp = English()
for idx, token in enumerate(nlp(text)):
    print(idx, token.doc[token.idx:token.idx+1].start_char)

It seems spacy is able to handle the first 109 tokens correctly, but then it just returns 0 as start index of the rest char_start.

Info about spaCy

spaCy version: 2.2.4
Platform: macOS-10.15.5-x86_64-i386-64bit
Python version: 3.8.2

Answered by adrianeboyd

Jun 4, 2020

In doc[i], the i refers to the token index, not the character offset in the text string. You don't need to access a span to get the character position of the token, either, it's available as Token.idx, so a shorter way to do this is:

for idx, token in enumerate(nlp(text)):
    print(idx, token.idx) # (also: doc[idx].idx)

Token.i is the token position and Token.idx is the character offset, which is admittedly a bit confusing because it's not consistent with the Span API.

View full answer

adrianeboyd · 2020-06-04T12:21:14Z

adrianeboyd
Jun 4, 2020

In doc[i], the i refers to the token index, not the character offset in the text string. You don't need to access a span to get the character position of the token, either, it's available as Token.idx, so a shorter way to do this is:

for idx, token in enumerate(nlp(text)):
    print(idx, token.idx) # (also: doc[idx].idx)

Token.i is the token position and Token.idx is the character offset, which is admittedly a bit confusing because it's not consistent with the Span API.

0 replies

elben10 · 2020-06-04T13:34:58Z

elben10
Jun 4, 2020
Author

Thanks for clarification.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Span.start_char has unexpected return value #5541

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Span.start_char has unexpected return value #5541

elben10 Jun 4, 2020

How to reproduce the behaviour

Info about spaCy

Replies: 2 comments

adrianeboyd Jun 4, 2020

elben10 Jun 4, 2020 Author

elben10
Jun 4, 2020

adrianeboyd
Jun 4, 2020

elben10
Jun 4, 2020
Author