“big data”
first and foremost: principle 1: Security is everyone’s responsibility
A methodology for building software-as-a-service apps
Large Scale Hadoop Upgrade At Pinterest
OLTP - OLAP, on the surface both looks equal, they offer SQL
| OLTP | OLAP | |
|---|---|---|
| read pattern | small number of records, fetch per key | aggregate large volumes |
| write pattern | random access, low latency writes | bulk import, or stream events |
| primary use | end-user | decision support |
| type of data | latest state of data, current point int time | history of events |
| size |
Star schema, dimension and fact tables
SELECT
P.Brand,
S.Country AS Countries,
SUM(F.Units_Sold)
FROM Fact_Sales F
INNER JOIN Dim_Date D ON (F.Date_Id = D.Id)
INNER JOIN Dim_Store S ON (F.Store_Id = S.Id)
INNER JOIN Dim_Product P ON (F.Product_Id = P.Id)
WHERE D.Year = 1997 AND P.Product_Category = 'tv'
GROUP BY
P.Brand,
S.Country
By default, OLTP systems is storing data in row store oriented databases.
For OLAP, colum store is much more adapted to the query patterns.
Column-oriented storage, store each column in a sorted file. That makes it easy to apply a compression on it.
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
Two paths for data flow. All data coming into the system goes through these two paths
An alternative to the lambda architecture. It has the same basic goals as the lambda architecture. but with an important distinction: all data flows through a single path, using a stream processing system
HDFS + Yarn + a collection of jars …
A bold statement: “Hadoop opened the philosophy to dump data into HDFS and only make sense of it later” By contrast to data warehouses that requires upfront modelling
Resource management and scheduling + monitoring of applications
Shared nothing architecture, horizontal scaling - scaling out
Each machine running the (database) software is called a node.
Each node has it’s own CPU, RAM, HDs.
Keeping a copy of the same data on several nodes –> redundancy
Splitting a big database/dataset into smaller subsets called partitions, so that different partitions can be assigned to different nodes
Where are is the data now, on which node?
Published in 2004, a low level programming model. Unable us to write code over large datasets
MAP TASK purpose: main purpose is to READ ALL DATA
Goal: is to achieve data locality
EACH reducer asks EACH mapper for the key/value pairs designated for that reducer -> partitioning
EACH reducer reads its key/value pairs, the values are aggregated into a collection and the entire input to the reducer is SORTED by keys. The shuffle/sort phase
The REDUCE typically business logic
IF you have X reducers, you will have X result files
Teradata exists from a long time before MapReduce
Is not a database, but uses a database (called metastore) to store the tables you define
A Hive table contains of a schema stored in the metastore and data stored in HDFS (or s3, or ADL or …)
Hive converts HIVESQL commands to mapreduce jobs (or Tez …)
HIVESQL is similar to SQL
A unified analytics engine for large-scale data processing
4 core / 8GB vm costs: $ 0.36 / hour
AWS has in preview CodeCatalyst: https://codecatalyst.aws/explore/dev-environments 4 vCPU / 8GB costs: $0.23/hour + $0.01/min
We are running in the cloud on Ubuntu and docker is installed
It worked on my machine!
Mastering docker is a must for data engineering
Mastering docker is the first step towards k8s
Docker Desktop is licensed as part of a free (Personal) or paid Docker subscription (Pro, Team or Business).
Docker Desktop may be used for free as part of a Docker Personal subscription for:
Container Management and Kubernetes on the Desktop
An open-source desktop application for Mac, Windows and Linux. Rancher Desktop runs Kubernetes and container management on your desktop. You can choose the version of Kubernetes you want to run.
docker run -p 5432:5432 -e POSTGRES_PASSWORD=SuperSecure -d public.ecr.aws/docker/library/postgres:latest
@thierryturpin ➜ /workspaces/data-intensive-applications (main) $ docker run -p 5432:5432 -e POSTGRES_PASSWORD=SuperSecure -d public.ecr.aws/docker/library/postgres:latest
Unable to find image 'public.ecr.aws/docker/library/postgres:latest' locally
latest: Pulling from docker/library/postgres
7a6db449b51b: Pull complete
b4f184bc0704: Pull complete
606a73c0d34a: Pull complete
c39f1600d2b6: Pull complete
31f42f92b0fe: Pull complete
c8b67d2b0354: Pull complete
31107b8480ee: Pull complete
b26434cf8bfa: Pull complete
36220bd76bfa: Pull complete
b79e75c4a0c2: Pull complete
cc1ab699dda5: Pull complete
37312064dd9b: Pull complete
4bce56fcbfe5: Pull complete
Digest: sha256:befb4cdc1d944bd89784b9caa287cf025f0720f9a02436038124163accd177dc
Status: Downloaded newer image for public.ecr.aws/docker/library/postgres:latest
71e9e6d94fa80c700fc5abbb4ed9600b1fbb026b439e2347ae789f2ae933e215
@thierryturpin ➜ /workspaces/data-intensive-applications (main) $ pstree
docker-init─┬─containerd-shim─┬─postgres───6*[postgres]
│ └─10*[{containerd-shim}]
├─dockerd─┬─containerd───8*[{containerd}]
│ ├─2*[docker-proxy───5*[{docker-proxy}]]
│ └─11*[{dockerd}]
├─sleep
└─sshd
Where are we now?
cat /etc/*release*
Is docker installed
docker -v
Do we have docker-compose?
docker-compose -v
docker run -p 5432:5432 -e POSTGRES_PASSWORD=SuperSecure -d public.ecr.aws/docker/library/postgres:latest
-p port forward host:container-e set’s the environment variable POSTGRES_PASSWORD inside the container-d run the container is detached mode (background)public.ecr.aws/docker/library/postgres:latest the docker imagedocker ps
Note your container ID (we didn’t named it)
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e22abdc669c2 public.ecr.aws/docker/library/postgres:latest "docker-entrypoint.s…" 5 minutes ago Up 5 minutes 0.0.0.0:5432->5432/tcp, :::5432->5432/tcp sleepy_snyder
Replace your container ID
docker exec -it e22abdc669c2 bash
root@e22abdc669c2:/#
cat the release files
root@e22abdc669c2:/# cat /etc/*release*
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
root@e22abdc669c2:/#
exit the container
root@e22abdc669c2:/# exit
exit
Install a Postgres client, example CLI client psql
Update packages
sudo apt-get update
Install client
sudo apt-get install -y postgresql-client
Connect to the database
psql -h localhost -d postgres -U postgres
Password for user postgres:
psql (12.12 (Ubuntu 12.12-0ubuntu0.20.04.1), server 14.5 (Debian 14.5-1.pgdg110+1))
WARNING: psql major version 12, server major version 14.
Some psql features might not work.
Type "help" for help.
to see a list of all tables use \dt *.*
to check the users created in the db, query table pg_user
postgres=# \dt *.*
postgres=# select * from pg_user;
usename | usesysid | usecreatedb | usesuper | userepl | usebypassrls | passwd | valuntil | useconfig
----------+----------+-------------+----------+---------+--------------+----------+----------+-----------
postgres |
to quit use \q
postgres=# \q
@thierryturpin ➜ /workspaces/data-intensive-applications (main) $
docker logs e22abdc669c2 -f
CTRL C to stop following the log
Get the details with inspect
docker inspect sleepy_snyder
[
{
"Id": "e22abdc669c2622b87f0489fdcbcff2676803d9bc05114bae141d71d268123b1",
"Created": "2022-08-26T08:21:18.697009815Z",
"Path": "docker-entrypoint.sh",
"Args": [
"postgres"
],
"State": {
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 5438,
"ExitCode": 0,
"Error": "",
"StartedAt": "2022-08-26T08:21:19.119424024Z",
"FinishedAt": "0001-01-01T00:00:00Z"
...
docker kill e22abdc669c2
cd docker
Let’s build a docker image named mypyspark based on the file Dockerfile
docker buildx build -t mypyspark .
Note the Dockerfile has 16 layers
Verify which docker images we have
docker images
First line, we start from amazoncoretto a production-ready distribution of the Open Java Development Kit.
Spark requires JAVA.
FROM public.ecr.aws/amazoncorretto/amazoncorretto:11
If we only intend to use the PySpark api, since Spark 2.2 we can install it as a Python package.
last line of the requirements/requirements.txt file
pyspark==3.1.2
The AWS account has been integrated with auth0 Identity Provider (IDP)
Doing so we limit the number of AWS IAM users to be managed in the AWS account.
Create a profile with the parameters in the file README.md
On the host, start docker-compose in detached mode
the .env file is used by docker-compose to set mongodb credentials
example of the .env file:
export MONGODB_USERNAME=training
export MONGODB_PASSWORD=training
cd docker
source .env
docker-compose up -d
This will:
On the host, start a docker shell
Update your AWS keys (lines 3,4 and 5) in the file docker/code/docker-exec-with-aws-keys.sh
source the docker-exec-with-aws-keys.sh to start a shell into the mypyspark container passing AWS credentials
cd code
source docker-exec-with-aws-keys.sh
In the container start python3 PySpark
code directory is mapped to opt inside the container.env file contains the variables for Pyspark to connect to MongoDBexample of the .env file:
export host=mongo
export user=training
export password=training
export database=dev
The Pyspark script takes in input a single parameter, the s3 path to the races.csv file
cd /opt/
source .env
python3 s3_to_mongo.py --s3_path=s3://tt-dp-dev-datalake/formula1/races.csv
Is exposed via the port mapping done in the docker-compose.yml file
pyspark:
container_name: mypyspark
image: mypyspark
ports:
- "8080:4040"
Jobs, a job breaks down in a series of stages (a DAG)
Stage groups of tasks that can be executed together
Partition represents how the data is physically distributed during execution
Shuffles a physical repartition of the data
In contrast to a “normal” Python dataframe, a Pyspark Dataframe can span different machines
Input file split
on the host
Install the package
wget https://downloads.mongodb.com/compass/mongodb-mongosh_1.0.5_amd64.deb
sudo apt-get install ./mongodb-mongosh_1.0.5_amd64.deb
Establish a connection
mongosh mongodb://localhost:27017/dev -u training
Show collections, this should list the races collection
dev> show collections
Find records, will return the filtered dataframe records
dev> db.races.find()
Besides the standalone mode, it can run on a Cluster
A custer managed by, YARN or Kubernetes
spark-submitspark-submit does 1 thing: it lets send your code to a cluster
example:
spark-submit --master yarn --deploy-mode client twitter.py
AWS EMR has some additional pattern capabilities compared to traditional clusters
EMR serverless can use python venv to package your python libs
This needs to be done on amazonlinux 2