Download from Pubmed, use pubmed24n1219.xml.gz
for demo.
Unzip it into ./data-source/pubmed24n1219.xml
yarn tsx src/es-test.ts
The image provides a overview of the system architecture. Let's break it down:
-
Data Sources:
- Pubmed XML
- Twitter JSON
-
Parsing:
- XML Parser for Pubmed
- JSON Parser for Twitter
Separate parsers for each data source ensure proper handling of different formats.
-
Structured Data:
The parsed data is converted into a unified structured format, including fields for Doc, Meta, Twitter, and Pubmed.
-
Worker:
Processes the structured data, performing tasks such as text normalization, entity extraction, etc.
-
Document Services:
- Create Doc Service
- Get Doc Service These services handle document creation, and retrieval.
-
Search Services:
- Search Services
- Index Doc Service
Handles search and index requests and interacts with the Elasticsearch component.
-
Elasticsearch:
The core search engine, with configuration for analyzers, tokenizers, and mapping.
-
Controller:
Manages incoming search requests from the client.
Potential areas for consideration:
- Data validation and error handling are not explicitly shown
- Caching mechanism for frequent searches is not visible
- No visible load balancing for high traffic scenarios
- Security measures are not depicted (e.g., authentication, authorization)
- Monitoring and logging components are not shown
src/
├── parsers/
│ ├── PubmedParser.ts
│ └── TwitterParser.ts
├── services/
│ ├── CreateDocService.ts
│ ├── IndexDocService.ts
│ ├── GetDocService.ts
│ └── SearchService.ts
├── models/
│ ├── Doc.ts
│ ├── Meta.ts
│ └── StructuredData.ts
├── workers/
│ └── DataProcessor.ts
├── controllers/
│ └── SearchController.ts
├── utils/
│ └── types.ts
├── config/
│ └── elasticsearch.ts
└── index.ts