Span.start_char has unexpected return value #5541
-
How to reproduce the behaviourfrom spacy.lang.en import English
text = """Fast and Precise Type Checking for JavaScript AVIK CHAUDHURI, Facebook Inc., USA PANAGIOTIS VEKRIS, University of California, San Diego, USA SAM GOLDMAN, Facebook Inc., USA MARSHALL ROCH, Facebook Inc., USA GABRIEL LEVI, Facebook Inc., USA In this paper we present the design and implementation of Flow, a fast and precise type checker for JavaScript that is used by thousands of developers on millions of lines of code at Facebook every day. Flow uses sophisticated type inference to understand common JavaScript idioms precisely. This helps it find non-trivial bugs in code and provide code intelligence to editors without requiring significant rewriting or annotations from the developer. We formalize an important fragment of Flow’s analysis and prove its soundness. Furthermore, Flow uses aggressive parallelization and incrementalization to deliver near-instantaneous response times. This helps it avoid introducing any latency in the usual edit-refresh cycle of rapid JavaScript development. We describe the algorithms and systems infrastructure that we built to scale Flow’s analysis. CCS Concepts: • Theory of computation → Type structures; Program analysis; Additional Key Words and Phrases: Type Systems, Type Inference, JavaScript 1 INTRODUCTION JavaScript is one of the most popular languages for writing web and mobile applications today. The language facilitates fast prototyping of ideas via dynamic typing. The runtime provides the means for fast iteration on those ideas via dynamic compilation. This fuels a fast edit-refresh cycle, which promises an immersive coding experience that is quite appealing to creative developers. However, evolving and growing a JavaScript codebase is notoriously challenging. Developers spend a lot of time debugging silly mistakes—like mistyped property names, out-of-order arguments, references to missing values, checks that never fail due to implicit conversions, and so on—and worse, unraveling assumptions and guarantees in code written by others. In many other languages, this overhead is mitigated by having a layer of types over the code and building tools for the developer that use type information. For example, types can be used to identify common bugs and to document interfaces of libraries. Our aim is to bring such type-based tooling to JavaScript. 1.1 Goals In this paper, we present the design and implementation of Flow, a static type checker for JavaScript we have built and have been using at Facebook for the past three years. The idea of using types to manage code evolution and growth in JavaScript (and related languages) is not new. In fact, several useful type systems have been built for JavaScript in recent years. The design and implementation of Flow are driven by the specific demands of real-world JavaScript development we have observed at Facebook and the industry at large. • The type checker must be able to cover large parts of the codebase without requiring too many changes in the code. Developers want precise answers to code intelligence queries (the type of an expression, the definition reaching a reference, the set of possible completions at a point). Relatedly, they want to catch a large number of common bugs with few false positives. arXiv:1708.08021v2 [cs.PL] 30 Aug 2017"""
nlp = English()
for idx, token in enumerate(nlp(text)):
print(idx, token.doc[token.idx:token.idx+1].start_char) ...
103 3149
104 3173
105 3187
106 3214
107 3235
108 3264
109 0
110 0
111 0
112 0
113 0
114 0
115 0
... It seems spacy is able to handle the first 109 tokens correctly, but then it just returns 0 as start index of the rest char_start. Info about spaCy
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
In
|
Beta Was this translation helpful? Give feedback.
-
Thanks for clarification. |
Beta Was this translation helpful? Give feedback.
In
doc[i]
, thei
refers to the token index, not the character offset in the text string. You don't need to access a span to get the character position of the token, either, it's available asToken.idx
, so a shorter way to do this is:Token.i
is the token position andToken.idx
is the character offset, which is admittedly a bit confusing because it's not consistent with theSpan
API.