Finding shortest path between two Wikipedia articles.
python3 main.py start_word end_word lang
$ python3 main.py Słońce Ziemia pl
...
Run time: 0.25s
Shortest path: Słońce -> Ziemia
$ python3 main.py Polska Gustaw_III pl
...
Run time: 3.79s
Shortest path: Polska -> Szwecja -> Gustaw_III
$ python3 main.py Brainfuck Scanline_rendering en
...
Run time: 34.46s
Shortest path: Brainfuck -> Programming_paradigm -> Computer_graphics -> Scanline_rendering
As you can see above, finding paths between two articles is rather slow. The problem is that there isn't much that can be done to fix it. To understand why, you need to know that there are 2 main tasks done for every checked article:
- HTTP request in order to get article's HTML
- Extracting and adding to queue urls to other articles from provided HTML
As of now, extracting urls from HTML does use regular expressions and is rather fast (takes 0.01 s
and less). Most of the run time is actually requesting HTML from Wikipedia, and that's beyond our control.
Yes. You could make it super fast. In order to do that you need to preprocess Wikipedia's articles or... just download Wikipedia's database and use it for this task. To read more about downloading and running Wikipedia offline, just go to: https://en.wikipedia.org/wiki/Wikipedia:Database_download