The academy consisted of 10 lectures, each followed by a homework task. The lectures were conducted and the homework was graded by Sofascore employees. The academy took place from March 2024 to June 2024. It was organized in a hybrid format, with lectures held in person at the Sofascore office in Zagreb, Croatia, and homework completed remotely.
Data Engineering: The goal of the assignment was to practice event-based analytics, assess data quality, and explore data pipelines using SQL. The SQL database we worked with was ClickHouse, a high-performance, open-source SQL DBMS used for OLAP (Online Analytical Processing).
Data Engineering: The goal of this task was to build an ETL pipeline using Python. The pipeline aimed to connect to a database using the Paramiko library, either download and handle data, or query data from the database. The data was then processed and stored in a new database.
Data Engineering: We delved deeper into ClickHouse, exploring Table Engines (Engine Families), Compressions and Codecs, Log Engine Families, Dictionaries, Views, Materialized Views, and Integration Engines.
Software Engineering in AI: This task involved dockerizing the second homework using Docker and Docker Compose, preparing it for both development and production environments. It was open-ended, requiring us to define the setup for each environment and the pipeline.
Data Analytics: The lectures aimed to familiarize us with growth and product metrics, conduct product analysis (product, cohort, and funnel analysis), and use Apache Superset and SQL Lab to create informative visualizations.
The task involved using the Superset visualization tool to create visualizations for product and growth metrics using SQL queries and generating a monthly report for stakeholders highlighting underlying data trends and growth metrics.
A/B Testing: The goal was to understand A/B testing, design and conduct tests, comprehend statistical significance, and interpret test results.
The homework involved performing an A/B test on groups with and without a button that was added to the test group. The task was to define the test goal, conduct the test, and analyze the results.
Predictive Analytics: The lectures covered predictive analytics, including data acquisition, modeling, and deployment. Topics included time series data, time series decomposition, time series forecasting, stationarity, differencing, ARIMA and SARIMA models, unit root testing, autoregressive models, and more.
The homework task was to create, load, decompose, and test the stationarity of a time series dataset, use ARIMA and exponential smoothing for predictions, and visualize the data using Apache Superset.
Machine Learning: The lectures aimed to understand machine learning concepts, define problems, work with Logistic Regression (OvR and OvO), Linear SVM, Random Forests, XGBoost, understand the bias-variance tradeoff, and more.
The homework task was to develop a machine learning model predicting whether a given user, described by a set of attributes, is organic or not. The data was partially labeled and uncertain, with correct labels available only for one group.
Achieved the highest F1 score here.
Machine Learning: Lectures introduced recommender systems, covering Content-based Recommendation, Collaborative Filtering, and hybrid approaches. Metrics such as Cumulative Gain (CG), Discounted Cumulative Gain (DCG), Normalized Discounted Cumulative Gain (NDCG), and Mean Average Precision (MAP) were defined.
The homework goal was to create personalized recommendations for football matches (events) for users over a one-week period, building our own recommender system.
Prescriptive Analytics: The lectures defined prescriptive analytics, types of models, causal modeling, treatment effects, etc.
The homework had two parts: first, writing a brief summary of findings on the correlation between the number of available sport events on a user's first day in the app and Week 4 retention. Second, writing a quarterly report to the board management explaining Q1 2023 key metrics and performance indicators, detailing which platform led to higher click-through rates and odds impressions, and rationalizing why.