Skip to content

Latest commit

 

History

History
252 lines (196 loc) · 14.8 KB

Proposal-2023.md

File metadata and controls

252 lines (196 loc) · 14.8 KB

3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS)

With great scientific breakthrough comes solid engineering and open communities. The Natural Language Processing (NLP) community has benefited greatly from the open culture in sharing knowledge, data, and software. The primary objective of this workshop is to further the sharing of insights on the engineering and community aspects of creating, developing, and maintaining NLP open source software (OSS), which we seldom talk about in scientific publications. Our secondary goal is to promote synergies between different open source projects and encourage cross-software collaborations and comparisons.

We refer to Natural Language Processing OSS as an umbrella term that not only covers traditional syntactic, semantic, phonetic, and pragmatic applications; we extend the definition to include task-specific applications (e.g., machine translation, information retrieval, question-answering systems), low-level string processing that contains valid linguistic information (e.g. Unicode creation for new languages, language-based character set definitions) and machine learning/artificial intelligence frameworks with functionalities focusing on text applications.

There are many workshops focusing on the creation and curation of open language resources and annotations (e.g. BUCC, GWN, LAW, LOD, WAC). Moreover, we have the flagship LREC conference dedicated to linguistic resources. However, the engineering aspects of NLP-OSS are overlooked and under-discussed within the community. There are open source conferences and venues (such as FOSDEM, OSCON, Open Source Summit) where discussions range from operating system kernels to air traffic control hardware but the representation of NLP related presentations is limited. In the Machine Learning (ML) field, the Journal of Machine Learning Research - Machine Learning Open Source Software (JMLR-MLOSS) is a forum for discussions and dissemination of ML OSS topics. We envision that the Workshop for NLP-OSS becomes a similar avenue for NLP-OSS discussions.

A decade ago, there was also the SETQA-NLP (Software Engineering, Testing, and Quality Assurance for Natural Language Processing) workshop that raised awareness of the need for good software engineering practices in NLP. In the earlier days of NLP, linguistic software was often monolithic and the learning curve to install, use, and extend the tools was steep and frustrating. More often than not, NLP-OSS developers/users interact in siloed communities within the ecologies of their respective projects. In addition to engineering aspects of NLP software, the open source movement has brought a community aspect that we often overlook in building impactful NLP technologies.

More recently there have been successful workshops which examine and promote open science in NLP. While important and complementary, the goals of these workshops are distinct from those of NLP-OSS which focuses more on the community of practice in open-source software in support of NLP and language technologies. We expect many who participated in the BigScience workshop to participate in NLP-OSS as many of the participants are former PC members in past editions of NLP-OSS. Another grassroot community movement, Eleuther AI started with the researchers attempting to replicate commercial language models and has since grown to an active decentralized community of volunteer researchers, engineers, and developers focused on AI alignment, scaling, and open source AI research.

With the rise of open source startups like Huggingface, the democratization of NLP gives researchers and the general public easy access to language models once available only to a handful of industrial research labs. This acceleration of NLP tools availability creates new synergies between cloud integrations, e.g. Huggingface x AWS Sagemaker, that allows engineers and researchers to train and deploy live applications with minimal infrastructure setups. Building on the shoulders of giants, the scikit-learn and Huggingface ecosystems are now interoperable under the skops framework.

We want to highlight these emergent communities and synergies in the NLP-OSS workshop and promote future collaborations with like-minded open source NLP researchers in the third NLP-OSS workshop. The first and second NLP-OSS workshop, which was co-located with ACL 2018, was the first workshop in recent years that focused more on building quality software for NLP, open sourcing, developing useful engineering practices, and less on scientific novelty or state-of-art development. We hope that the 3rd NLP-OSS workshop could also be hosted in an *ACL conference, to be an intellectual forum to collate this type of knowledge, announce new software/features, promote the open source culture and best practices that go beyond the conferences.

Call for Papers

We invite full papers (8 pages) or short papers (4 pages) on topics related to NLP-OSS broadly categorized into (i) software development, (ii) scientific contribution and (iii) NLP-OSS case studies.

  • Software Development

    • Designing and developing NLP-OSS
    • Licensing issues in NLP-OSS
    • Backwards compatibility and stale code in NLP-OSS
    • Growing, maintaining and motivating an NLP-OSS community
    • Best practices for NLP-OSS documentation and testing
    • Contribution to NLP-OSS without coding
    • Incentivizing OSS contributions in NLP
    • Commercialization and Intellectual Property of NLP-OSS
    • Defining and managing NLP-OSS project scope
    • Issues in API design for NLP
    • NLP-OSS software interoperability
    • Analysis of the NLP-OSS community
  • Scientific Contribution

    • Surveying OSS for specific NLP task(s)
    • Demonstration, introductions and/or tutorial of NLP-OSS
    • Small but useful NLP-OSS
    • NLP components in ML OSS
    • Citations and references for NLP-OSS
    • OSS and experiment replicability
    • Gaps between existing NLP-OSS
    • Task-generic vs task-specific software
  • Case studies

    • Case studies of how a specific bug is fixed or feature is added
    • Writing wrappers for other NLP-OSS
    • Writing open-source APIs for open data
    • Teaching NLP with OSS
    • NLP-OSS in the industry

Demographic Diversity

Organizers: We have 5 organizers with representation from industrial NLP/ML labs, government organization and academic institutes.

PC members: We strive to a have a balance of academic and industrial PC from diverse gender and geolocation demographics. We extended our list of PC members in NLP-OSS 2018 edition by inviting a subset of the WiNLP members on the BIG directory and accepted invitees have since joined us in NLP-OSS 2020 and reinvited in the proposed 2023 edition.

Misc

Estimated no. of Attendees: 50

Shared Task: No

Special Requirements / Technical Needs: No

Preferred Venue:

  1. EMNLP
  2. ACL
  3. EACL

Previous Workshop:

Expected no. of submissions: 30-40 submissions

Organizers

  • Geeticka Chauhan, Massachusetts Institute of Technology

    Geeticka Chauhan is a Ph.D. student at MIT, working on NLP for healthcare advised by Prof. Peter Szolovits. Her master thesis focused on revealing the reproducibility and generalizability problems in Relation Extraction, and experimentally showed the importance of streamlining evaluation methods in NLP challenges

  • Dmitrijs Milajevs, Grayscale AI.

    Dmitrijs Milajevs is a data scientist at KMPG. Previously, he evaluated information retrieval systems at National Institute of Standards and Technology (NIST). He has defended a Ph.D. thesis on evaluation of compositional models in distributional semantics.

  • Elijah Rippeth, University of Maryland.

    Elijah Rippeth is Ph.D. student at the University of Maryland in the Department of Computer Science. His work focuses broadly on natural language processing, but with a focus on multilingual NLP, cross-lingual transfer, and machine translation.

  • Jeremy Gwinnup, Air Force Research Laboratory.

    Jeremy Gwinnup is a Research Computer Scientist in the Airman Systems Directorate of the Air Force Research Laboratory located in Dayton, Ohio USA. His research focuses on multimodal machine translation and is the topic of his studies as a Doctor of Engineering student at Johns Hopkins University.

  • Liling Tan, Rakuten Institute of Technology

    Liling is a research scientist at Rakuten Institute of Technology working on Machine Translation and developing applications using language technologies. He has been actively involved in corpora creation/maintenance, Asian NLP and machine translation. He co-organized the Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2014-16).

Programme Committee (Confirmed from pre-proposal survey)

  • Aakanksha Naik, Allen Institute for Artificial Intelligence
  • Aitor Soroa, HiTZ Center - Ixa, University of the Basque Country UPV/EHU
  • Alexander Rush, Cornell, Hugging Face
  • Aline Paes, Universidade Federal Fluminense
  • Amittai Axelrod, Apple AI
  • Anish Mohan, Nvidia
  • Arun Balajiee Lekshmi Narayanan, University of Pittsburgh
  • Atnafu lambebo Tonja, Instituto Politécnico Nacional
  • Atul Kr. Ojha, University of Galway
  • Cassandra Jacobs, University at Buffalo
  • Christoph Teichmann, Bloomberg LP
  • Daniel Braun, University of Twente
  • Dave Howcroft, Edinburgh Napier University
  • Diana Maynard, University of Sheffield
  • Flammie a Pirinen, University of Norway
  • Gérard Dupont, Mavenoid
  • Jack Morris, Cornell University
  • Jörg Tiedemann, University of Helsinki
  • Karin Sim, Language Weaver
  • Kevin Cohen, University of Colorado
  • Lane Schwartz, University of Alaska Fairbanks
  • Leo Boytsov, Amazon
  • Lucy Park, Upstage
  • Maarten van Gompel, Radboud University
  • Maheshwar Ghankot, Indira Gandhi National Open University
  • Mallika Singh, Harvard Medical School
  • Marco Cognetta, Tokyo Institute of Technology, Google
  • Marzieh Fadaee, Zeta Alpha Vector
  • Matt Post, Microsoft
  • Micah Shlain, Allen Institute for Artificial Intelligence
  • Michael Wayne Goodman, LivePerson, Inc.
  • Mohd Sanad Zaki Rizvi, University of Edinburgh
  • Nelson F. Liu, Stanford University,
  • Ogundepo Odunayo, University of Waterloo
  • Pasquale Lisena, EURECOM
  • Phu Mon Htut, AWS AI Labs
  • Raeid Saqur, Princeton University
  • Raphael Tang, Comcast Applied AI
  • Sagnik Ray Choudhury, University of Michigan
  • Shilpa Suresh, Harvard Medical School
  • Sina Ahmadi, George Mason University
  • Steve DeNeefe, RWS Language Weaver
  • Steven Bethard, University of Arizona
  • Taha Zerrouki, Bouira University Algeria
  • Tenzin Bhotia, Delhi Technological University
  • Thomas Kober, Zalando SE
  • Tomas Mikolov, Czech Institute of Informatics
  • Tommaso Teofili, Roma Tre University
  • Vlad Niculae, University of Amsterdam
  • Won Ik Cho, Seoul National University
  • Zaid Alyafeai, King Fahd University of Petroleum and Minerals
  • Ziv Litmanovitz, University of Haifa

Programme Committee (Previous PC, To follow up with)

  • Abigail See
  • Akiko Eriguchi
  • Amandalynne Paullada
  • Anca Dumitrache
  • Andreas Mueller
  • Arwen Griffioen
  • Brendan O'Connor
  • Carolina Scarton
  • Chris Hokamp
  • Christian Federmann
  • Christopher Manning
  • Dan Simonson
  • David Przybilla
  • Delip Rao
  • Denny Britz
  • Ehsan Khoddam
  • Eleftherios Avramidis
  • Emiel van Miltenburg
  • Emily Dinan
  • Eva Maria Vecchi
  • Fabio Kepler
  • Francis Bond
  • Frédéric Blain
  • Graham Neubig
  • Grzegorz Chrupała
  • Hal Daumé III
  • Ian Soboroff
  • Ignatius Ezeani
  • James Bradbury
  • Jason Baldridge
  • Jiatao Gu
  • Joel Grus
  • Joel Nothman
  • Jon Dehdari
  • Karin Sim Smith
  • Kheng Hui Yeo
  • Kyunghyun Cho
  • Madison May
  • Marcel Bollmann
  • Marcos Zampieri
  • Mary Ellen Foster
  • Matthew Honnibal
  • Moshe Wasserblat
  • Muthu Kumar Chandrasekaran
  • Paul Pu Liang
  • Philipp Koehn
  • Pontus Stenetorp
  • Rachael Tatman
  • Radim Rehurek
  • Rico Sennrich
  • Sandya Mannarswamy
  • Sang Phan
  • Shamil Chollampatt
  • Sharat Chikkerur
  • Shubhanshu Mishra
  • Stephen Sloto
  • Svitlana Vakulenko
  • Taku Kudo
  • Tareq Al-Moslmi
  • Tilahun Abedissa Taffa
  • Varun Kumar
  • Vered Shwartz
  • Yusuke Miyao
  • Yves Peirsman