Data Engineer - Technical Assessment

In this section of the interview at Beyond Finance, you will be assessed on your ability to perform several Data Engineering tasks. To perform well on these tasks, you will demonstate competence in the following areas:

preprocessing data to prepare for a database load
understanding entity relationships in a database
merging data from different tables
filtering data to relevant subsets
calculating aggregations and descriptive statistics

It will be pretty difficult to complete all questions in the allotted time. Your goal is not to speed through the answers, but to come up with answers that demonstrate your knowledge. It's more about your thought process and logic than getting the right answer or your code.

Getting Started

This exercise will be broken into 2 parts

Data Processing
Data Wrangling

Data Processing

In this section you will take files from the ./raw_data/ subfolders, combine them into a single newline-delimited json.gz file per subfolder, and place that CSV file in a processed_data/ directory. You may have to do some light investigation into the data files to understand their file formats and delimiters

Example

Files

./raw_data/tracks/tracks_0.csv
./raw_data/tracks/tracks_1.json
./raw_data/tracks/tracks_2.csv
etc...

should be combined into a single file ./processed_data/tracks.json.gz

What we look for

Can you handle all subfolders in a single pass over the raw data files?
How can you limit memory consumption? (hint chunksize)

Data Wrangling

For this section, we'll pretend you loaded the raw data plus additional tables into a small SQLite database containing roughly a dozen tables. We've provided this database for you so don't worry about loading it yourself. If you are not familiar with the SQLite database, it uses a fairly complete and standard SQL syntax, though does not many advanced analytics functions. Consider it just a remote datastore for storing and retrieving data from.

Data Processing (40 minutes)

Process the files in ./raw_data/ to combine them into a single json.gz file per subfolder

Example

Files

./raw_data/tracks/tracks_0.csv
./raw_data/tracks/tracks_1.json
./raw_data/tracks/tracks_2.csv
etc...

should be combined into a single file ./processed_data/tracks.json.gz

Data Wrangling (20 minutes)

How many different customers are there?
How long is the longest track in minutes?
Which genre has the shortest average track length?
Which artist shows up in the most playlists?
What album had the most purchases?
Which customer has the highest number of sales in terms of dollars?
Count of customers who have dollar sales more than $40

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
db		db
raw_data		raw_data
.gitignore		.gitignore
README.md		README.md
assessment.ipynb		assessment.ipynb
db-diagram.png		db-diagram.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineer - Technical Assessment

Getting Started

Data Processing

Data Wrangling

Data Processing (40 minutes)

Data Wrangling (20 minutes)

About

Releases

Packages

Contributors 3

Languages

Beyond-Finance/dataeng-de-technical-assessment

Folders and files

Latest commit

History

Repository files navigation

Data Engineer - Technical Assessment

Getting Started

Data Processing

Data Wrangling

Data Processing (40 minutes)

Data Wrangling (20 minutes)

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages