Skip to content

Kisioj/WikipediaPathfinder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wikipedia Pathfinder

Finding shortest path between two Wikipedia articles.

Usage

python3 main.py start_word end_word lang

Examples

$ python3 main.py Słońce Ziemia pl
...
Run time: 0.25s
Shortest path: Słońce -> Ziemia
$ python3 main.py Polska Gustaw_III pl
...
Run time: 3.79s
Shortest path: Polska -> Szwecja -> Gustaw_III
$ python3 main.py Brainfuck Scanline_rendering en
...
Run time: 34.46s
Shortest path: Brainfuck -> Programming_paradigm -> Computer_graphics -> Scanline_rendering

Performance problem

As you can see above, finding paths between two articles is rather slow. The problem is that there isn't much that can be done to fix it. To understand why, you need to know that there are 2 main tasks done for every checked article:

  1. HTTP request in order to get article's HTML
  2. Extracting and adding to queue urls to other articles from provided HTML

As of now, extracting urls from HTML does use regular expressions and is rather fast (takes 0.01 s and less). Most of the run time is actually requesting HTML from Wikipedia, and that's beyond our control.

Can I make it faster?

Yes. You could make it super fast. In order to do that you need to preprocess Wikipedia's articles or... just download Wikipedia's database and use it for this task. To read more about downloading and running Wikipedia offline, just go to: https://en.wikipedia.org/wiki/Wikipedia:Database_download

About

Finding shortest path between two wikipedia articles

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages