-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
GSOC 2024 ideas list
We strongly believe in open source and provide interaction with a diverse community. JabRef aims to provide a welcoming experience to open source newcomers. We have three years of Google Sommer of Code (GSoC) participation with great results. All of them are huge steps towards a well-usable research tool.
Participants will grow their technical, coding, and their open source experience. They will also receive a stipend by Google. Finally, participants will expand their professional network.
Below, there are some project ideas to serve as start what could be done within a GSoC project. First some links for some more background information.
-
GSoC timeline
- latest proposal deadline: April, 2nd, 18:00 UTC
- coding until: August, 26th, 18:00 UTC (can be extended under conditions)
- GSoC stipends: starting at 750 USD, depending on the country.
- Checklist for items contained in the proposal: https://github.com/JabRef/jabref/wiki/GSOC-Application
(All summarized information is tentative. The definitive information is on the linked pages)
This page lists a number of ideas for potential projects to be carried out by the persons participating in Google Summer of Code 2024. This is by no means a closed list, so the possible contributors can feel free to propose alternative activities related to the project (the list of feature requests and the GitHub issue tracker might serve as an additional source of inspiration). Students are strongly encouraged to discuss their ideas with the developers and the community to improve their proposal until submission (e.g., using the Gitter Channel or the forum). It's also a good idea to start working on one of the smaller issues to make yourself familiar with the contribution process.
JabRef, a comprehensive literature management software, currently supports both handling metadata and text-based PDF documents. However, a significant limitation arises with scanned PDFs, particularly historical articles, which are not text-searchable due to their image-based format. This project aims to bridge this gap by integrating advanced OCR (Optical Character Recognition) technology, enabling full-text search in scanned PDFs.
Useful links:
- A Document AI Package: https://github.com/deepdoctection/deepdoctection
- Hand-written text recognition in historical documents: https://github.com/githubharald/SimpleHTR#handwritten-text-recognition-with-tensorflow
- Java OCR with Tesseract: Baeldung Guide
- OCRmyPDF Installation and Usage: GitHub Repository
- ChatOCR and ChatGPT Integration: Blog Article
- AI-Powered OCR: Addepto Blog
- Tika OCR Integration: Apache Tika Wiki
- Tesseract OCR Library: Official Documentation
- Surya AI powered SOTA OCR, better than Tesseract but coded in python https://github.com/VikParuchuri/surya
Some aspects:
- Add an option to call an OCR engine from JabRef, e.g., cloud based or local installs
- Define a common interface to support multiple OCR engines
- Provide a good default set of settings for the OCR engines
- Support expert configuration of the settings
- Add the extracted text as a layer to the pdf so that lucene can parse it
- Add an option to further process the text with Grobid for training and metadata extraction
Expected outcome:
A) Develop a common interface within JabRef to accommodate multiple OCR engines, ensuring flexibility and expandability.
B) Enable expert users to fine-tune OCR settings, catering to specific needs or document formats.
C) Incorporate the OCR-extracted text as a searchable layer in PDFs, allowing Apache Lucene to index and look for the content.
Skills required:
- Proficiency in Java programming.
- A keen interest and curiosity in document processing and AI technologies.
Possible mentors:
Project size:
175h (medium)
This project aims to revolutionize the way researchers interact with academic literature in JabRef, utilizing the power of Artificial Intelligence (AI) to enhance user experience and efficiency. The goal is to implement an AI feature allowing users to request a) summaries of PDF documents directly within JabRef and b) ask questions based on the "knowledge" inside the local PDFs. Ideally, the solution should work locally without any external Cloud service.
More ideas: Support ChatGPT-powered search. See https://oa.mg/chatgpt.
Useful links:
remote:
- https://scholarai.io/ -- also provides "projects" with a collection of PDFs
- https://smallpdf.com/ai-pdf
- https://www.scholarcy.com/
- https://ai.google.dev/tutorials/rest_quickstart
- https://docs.mistral.ai/api/
- Scholar GPT (a ChatGPT Plugin)
local:
- https://github.com/kherud/java-llama.cpp
- https://github.com/kermitt2/grobid (Optimized for parsing scientific papers; See JabRef Grobid integrations 1 and 2)
- https://github.com/neuml/txtai?tab=readme-ov-file (optimized for RAG; Has Java bindings)
other:
- https://github.com/langchain4j/langchain4j - library to access both local and remote models
- https://github.com/deepset-ai/haystack (Remote, Local, RAG, Rest API with Docker)
- https://github.com/lifan0127/ai-research-assistant?tab=readme-ov-file (Aria; a Zotero plugin)
- https://www.baeldung.com/java-ai (List of AI projects in Java)
Popular libraries/frameworks/applications that have been considered, but that don't offer relevant functionality as Rest API or Java bindings:
- https://github.com/run-llama/llama_index (framework specialised for RAG, but offers no Java bindings or Rest API)
- semantic-kernel (library to interact with local and remote models (by Microsoft), but no RAG for Java yet. See feature matrix breakdown by programming language)
- https://github.com/nomic-ai/gpt4all/tree/main/gpt4all-bindings/java (local RAG, Only Python bindings are maintained)
- https://github.com/stanfordnlp/dspy (The framework for programming—not prompting—foundation models)
Expected outcome:
Phase 1 (90h): Develop a module to connect JabRef with configurable online AI services that can generate summaries of academic papers and answer questions. Ensure this feature is user-friendly, allowing for seamless interaction (summary, asking questions) and customization according to user preferences. It has to be possible to ask questions covering selected (or even all) PDF files of a local library (.bib file with attached .pdf files).
Phase 2 (+90h): Develop a module to connect JabRef a local AI service that can generate summaries of academic papers and answer questions. Ensure this feature is user-friendly, allowing for seamless interaction (summary, asking questions) and customization according to user preferences. There must not be any remote connection. It has to be possible to ask questions covering all PDF files of a local library (.bib file with attached .pdf files).
Possible Mentors:
@koppor, @Siedlerchr, @ThiloteE
Project size:
- Phase 1 only: 175h (medium)
- Phases 1 and 2: 350h (large)
This project aims to create an engaging and informative first start screen for JabRef, enhancing the initial user experience and showcasing the best features of the software. This screen will differ from the standard interface displayed when no database is open, providing a tailored introduction for new users.
Hints
- Configuration of Paper Directory: - Implement a feature allowing users to easily set up and manage their paper directory, as detailed in Issue #41.
- Integration of Online Services: - Include options for update checks, connecting with online services like Grobid (referencing Issue #566), fetchers, and full-text search capabilities.
- Incorporate telemetry features with a clear and concise privacy statement.
- Creation of Example Library: - Develop a feature to create an example library, helping new users quickly understand JabRef's functionality.
- Community Engagement Tools: - Add links to the JabRef forum for support and Mastodon for community interaction.
- Donation Prompt:- Encourage support for JabRef through a tastefully integrated donation option.
- User Group-Specific Defaults: - Offer pre-configured default preferences catering to different user groups, such as "relaxed users" wanting all features, and "pro-users" who prefer managing BibTeX files without additional features (as per Issue #9491).
(These are just ideas, during the project, this needs to be refined)
Expected Outcome:
A welcome dialog with nice and welcoming UX
Examples:
- The welcome dialog should ask for: Configuration of Paper Direction, Integration of Online Services (Grobid, Telemetry), Creation of Example Library, Community Engagement Tool, Link to Donation page
- The welcome dialog should offer some sensitive User Group-Specific Defaults: Offer pre-configured default preferences catering to different user groups, such as "relaxed users" wanting all features, and "pro-users" who prefer managing BibTeX files without additional features (as per Issue #9491).
Skills required:
- Java, JavaFX
Possible Mentors:
Project size:
- 175h (medium)
Description:
With the ever-growing number of publications in computer science and other fields of research, conducting secondary studies becomes necessary to summarize the current state of the art. For software engineering research, Kitchenham popularized the systematic literature review (SLR) method to address this issue. The main idea is to systematically identify and analyze the majority of relevant publications on a specific topic. This is usually an activity that takes extensive manual effort. Some tool support does exist, but the full potential of tools has not been exploited yet. JabRef also offers basic functionality for systematic literature reviews that is used by a number of researchers to systematically "harvest" related work based on the fetching capabilities of JabRef. While using the feature, various additional feature requests came up. For instance, created search queries are currently transformed internally by JabRef to the query format of the publisher. It should also be possible to directly input a query at the publisher site, e.g., for IEEE or ACM. More information: Dominik Voigt, Oliver Kopp, Karoline Wild: Systematic Literature Tools: Are we there yet? ZEUS 2021: 83-88
One key aspect would be the improvement of the fetcher Infrastructure in JabRef to better adapt to new and changing Publisher/Journal websites and to offer a more direct integration. As an inspiration, see BibDesk
Expected outcome:
An advanced SLR functionality, where a researcher is supported to execute a systematic-literature-review.
We did an initial project organization at https://github.com/users/koppor/projects/2.
Skills required:
- Java, JavaFX
Possible mentors:
@koppor, @Siedlerchr, @calixtus
Project size: 350h (large) - Can also stripped down to medium.
Description:
JabRef can connect to LibreOffice to offer premier reference management for LibreOffice. Currently, custom styles are supported. In this project, this support should be extended to offer support for the "Citation Style Language" files. A user should be able to choose the CSL style for the reference list and the citation style. Then, the LibreOffice document should adapt accordingly. For more information on CSL refer to https://citationstyles.org/. [Details: #8893]
In the LaTeX-world, .bst
is still popular. JabRef has BST support, but currently not visible in the UI.
In LibreOffice, it should be possible to select a .bst
file, which is then used for rendering. [Details: #624]
The internal format of references is currently a JabRef-custom format. It should be changed to a format used by Zotero. See the discussion at https://github.com/JabRef/jabref/issues/2146#issuecomment-891432507 for details. This includes: i) implementation of that format, ii) implementation of a converter from the "old" JabRef-Format to the new one. The converter could be implemented within OpenOffice (similar to JabRef_LibreOffice_Converter).
Finally, one can work on improving the JabRef-LibreOffice-Plugin. See https://github.com/JabRef/jabref/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc+label%3Aopenoffice%2Flibreoffice for ideas. For instance, it should be possible to have footnote-based citations (see https://docs.jabref.org/cite/openofficeintegration#known-issues).
Expected outcome:
- It is possible to select and change a CSL style for a LibreOffice document.
- It is possible to select a .bst files
- Internal format of citations changed to Zotero-Format
Possible Mentors:
@koppor, @Siedlerchr, @calixtus
Project size:
- 90h (small) (if only CSL style selection and work on Zotero format)
- 175h (medium) (CSL + .bst + Zotero + other issues fixed)
You can propose another projects. JabRef offers a variaty of places where it can be improved. Think as user or talk to other users. Following places are a good start:
- Feature requests prioritized: https://github.com/orgs/JabRef/projects/6
- General list of feature requests: http://discourse.jabref.org/c/features
- Candidates of university projects, the large ones: https://github.com/orgs/JabRef/projects/3/views/3?filterQuery=status%3A%22free+to+take%22+size-of-project%3Alarge&sortedBy%5Bdirection%5D=desc&sortedBy%5BcolumnId%5D=8246261
- Home
- General Information
- Development
- Please go to our devdocs at https://devdocs.jabref.org
- "Google Summer of Code" project ideas
- Completed "Google Summer of Code" (GSoC) projects
- GSoC 2024 ‐ Improved CSL Support (and more LibreOffice‐JabRef integration enhancements)
- GSoC 2024 - Lucene Search Backend Integration
- GSoC 2024 ‐ AI‐Powered Summarization and “Interaction” with Academic Papers
- GSoC 2022 — Implement a Three Way Merge UI for merging BibTeX entries
- GSoC 2021 - Improve pdf support in JabRef
- GSoC 2021 - Microsoft Word Integration
- GSoc 2019 - Bidirectional Integration — Paper Writing — LaTeX and JabRef 5.0
- GSoC Archive
- Release
- JabCon Archive