Questions 1 (TODO)

Data Al Dente

Created @April 5, 2023

Updated @April 5, 2023

Outline
Intro
Q 0.1
Q 0.2
Q 0.3

Intro

Note that each question is labeled with “Q X.Y” and the corresponding solution is labeled with “S X.Y”. Most people who have some professional/industry software engineering experience can skip this section of questions and move directly to Part I.

Q 0.1

What makes an application data-intensive vs compute-intensive?

S 0.1
An application is data-intensive if data is the primary challenge - the quantity of data, the complexity of data, or the speed at which it is changing. A compute-intensive application differs because CPU cycles are the bottleneck.
Source: DDIA Preface p. xiii

Q 0.2

Which of the following systems do you consider to be data-intensive? Why?

Training a large machine learning (ML) model like ChatGPT.

Google search crawling and indexing the internet.

High-frequency trading.

Amazon’s shopping cart.

Computing millions of digits of pi.

Twitter’s home timeline feed.

A small, infrequently read static website like a blog.

S 0.2
This is a bit of a dumb question since data-intensive is not really a rigorous definition and depending on how you look at it all systems could be considered data-intensive. However, the point of this question is just to get you to start thinking about systems in terms of how they store, retrieve, and process data so let’s talk about that for each system above.
- Training a large machine learning model like ChatGPT: The actual training of a machine learning model is often compute-intensive (depending on the size of your training dataset and how large the model is). I would generally consider training a large machine learning model more compute-intensive than data-intensive per se, but the data systems which support training are data-intensive and these systems are essential for training. So I think either answer is correct here, the point is to think about the fact that training a large ML model requires distributed computation for the model training and also some way to store and access the training dataset efficiently.
- Google search crawling and indexing the internet: This is definitely a data-intensive system. There are billions? of web pages on the internet and keeping track of all of them as they change and storing/indexing them requires massive storage.
- High-frequency trading: I’d consider this data-intensive because a system like this will have to receive, store, and react to price updates extremely quickly and potentially store a large amount of data. However, I’m not really familiar enough with these systems to know how much they are compute vs storage, if someone reading this knows, contact me.
- Amazon’s shopping cart: In general, for a small e-commerce site I would not consider a virtual shopping cart a data-intensive system. However, when operating at Amazon’s scale then from their perspective the shopping cart could be considered data-intensive since it deals with the data from billions of customers and needs to highly available.
- Computing millions of digits of pi: This seems the most compute-intensive of all systems listed here and I was expecting the data needs to be relatively simple, however doing some skimming of this Google Cloud article on calculating 100 digits of pi it looks like data storage was a big part of the work, so perhaps any sufficiently interesting system ends up being data-intensive in some way.
- Twitter’s home timeline feed: This is certainly a data-intensive system since it has to show real-time updates to a large number of users and for a large number of posts.
- A small, infrequently read static website like a blog: I would say this is not particularly data-intensive given that the site is small and infrequently read. If it were a large, frequently read static website then you would have some interesting data challenges.
Source: Nate’s brain…

Intro

Q 0.1

Q 0.2

Q 0.3