Bochica

Bochica refers primarily to the construction of an inverted index. By doing so we aim to explore concepts behind Information Retrieval and the use of a Crawler.

One of its objectives is to gather at least 100 Brazilian news starting from 01/01/2018 and export them as a CSV file following the layout below.

Field	Type	Description
title	String
sub_title	String
author	String
date	Datetime	dd/mm/yyyy hh:mi:ss
section	String	Esportes, Saúde, Política, etc
text	String
url	String

Project layout

If not familiar with Scrapy one should read its basic documentation.

The project is laid out in four main directories

frontier
bochica
seeds
output

The directory seeds has in a JSON file the seeds of the crawling algorithm, in other words, the starting links to be used by the crawler. The code opearates through copies of theses files in the directory frontier. The directory bochica has the project itself.

Commands to execute project

make run # executes crawler for site brasil_elpais
make export # exports json results json of crawler to csv format.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
bochica		bochica
code		code
frontier		frontier
output		output
seeds		seeds
utils		utils
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
bochica_logo.png		bochica_logo.png
bochica_wide.png		bochica_wide.png
requirements.txt		requirements.txt
result.csv		result.csv
scrapy.cfg		scrapy.cfg
tst.md		tst.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bochica

Project layout

Commands to execute project

About

Releases

Packages

Contributors 3

Languages

Benardi/bochica

Folders and files

Latest commit

History

Repository files navigation

Bochica

Project layout

Commands to execute project

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages