- Flume as ingestion tool
- Kafka as messaging backbone
- Spark Streaming for processing
- Redis as lookup service and bots registry
This is a flume event interceptor, that cleans up incoming json string and removes unnecessary symbols.
- Build the project:
mvn clean package
- Copy jar file to container to
/opt/flume/lib
directory:
docker cp ./target/flume-json-filter-1.0-SNAPSHOT.jar <container_name>:/opt/flume/lib
BotDetectorV1 is implemented using DStreams. BotDetectorV2 is implemented using Structured Streaming.
In general, workflow of BotDetectorV1 and BotDetectorV2 are similar:
- take events from kafka every
window
period of time - process events, gather statistics and identify bots
- put statistics (ip -> click, view, event rates, etc.) and bots info into redis
BotDetectorV1 puts data into 2 different redis sets:
statistic
- contains key/values with statistic for the lastwindow
period of timebots
- contains all identified bots
Problem with this approach is that TTL cannot be set for individual key/value pair in set but rather for the whole set.
At the same time, BotDetectorV2 uses different approach. It puts data into redis as individual k/v pairs with specified TTL. In order to identify statistical records and bots every key has prefix:
stat_
for statisticsbot_
for bots
So now end-user can filter keys by prefix and get required information.
Config file path: spark-streaming-kafka/src/main/resources/application.conf
Redis and Kafka properties should not be modified unless you change kafka and redis configuration. Rest of the properties define application logic and business rules and should be modified according to the requirements.
cd spark-streaming-kafka/src/main/resources
docker-compose up -d
- create kafka topic for logs (run from inside of the kafka container):
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic bot-logs
- copy flume interceptor jar (see Flume
JSON filter -> how to use -> step 2
) - restart flume container (so now flume can see newly add interceptor)
Check out docker-compose.yml
in order to find out ports for container like Redis UI or Kafka Topics UI.
In order to generate logs, run gridu-bd-streaming/spark-streaming-kafka/src/main/resources/botgen.py
Example:
cd gridu-bd-streaming/spark-streaming-kafka/src/main/resources
python3 botgen.py -b 100 -u 1000 -n 100 -d 60 -f ./full-stack/logs/logs-1_minute_4.json
Don't modify output path, as far as:
- this path is mounted into flume docker container
- flume is configured to take new files from that directory
Check out script for more details.