Discover Clickhouse: high-performance Column-Oriented DBMS for OLAP(Online Analytic Processing)
- Introduction to Databases
- Fundamentals of Column-Oriented Databases
- Introduction to ClickHouse
- Architecture of ClickHouse
- Getting Started with ClickHouse
- Getting started with Clickouse cloud
- chDB - An Overview
- Demo - google collab
- References:
Introduction to Databases
What is a Database?
A database is a system used to store, organize, and access information. It allows users to input, manage, and retrieve data efficiently and securely. Databases are structured to handle large amounts of information by allowing data to be stored in tables that can be searched, manipulated, and maintained easily.
FR
Une base de donnĂ©es est un systĂšme utilisĂ© pour stocker, organiser et accĂ©der Ă des informations. Elle permet aux utilisateurs de saisir, gĂ©rer et rĂ©cupĂ©rer des donnĂ©es de maniĂšre efficace et sĂ©curisĂ©e. Les bases de donnĂ©es sont structurĂ©es pour gĂ©rer de grandes quantitĂ©s d'informations en permettant que les donnĂ©es soient stockĂ©es dans des tables qui peuvent ĂȘtre recherchĂ©es, manipulĂ©es et entretenues facilement.
Types of Databases
Relational Store data in structured tables with rows and columns, allowing complex queries using SQL. Example: HR, and CRM database
non-relational Handle unstructured and semi-structured data without a fixed table structure, using formats like key-value, documents, and graphs. Example: E-commerce platforms, content management, IoT and real-time analytics
FR
Relationnelle
Stocke les donnĂ©es dans des tables structurĂ©es avec des lignes et des colonnes, permettant des requĂȘtes complexes en utilisant SQL.
Exemple : bases de données RH et CRM.
Non relationnelle GÚre des données non structurées et semi-structurées sans structure de table fixe, en utilisant des formats tels que clé-valeur, documents et graphes. Exemple : plateformes de commerce électronique, gestion de contenu, IoT et analyses en temps réel.
Introduction to Database Management Systems (DBMS)
Software that allows users to define, create, maintain, and control access to the database. It manages data storage, manipulation, and retrieval, facilitating multiple user interactions with various databases
FR
Logiciel permettant aux utilisateurs de définir, créer, maintenir et contrÎler l'accÚs à la base de données. Il gÚre le stockage, la manipulation et la récupération des données, facilitant les interactions de plusieurs utilisateurs avec diverses bases de données.
Fundamentals of Column-Oriented Databases
Row-Oriented vs. Column-Oriented Storage
Row-Oriented Databases:
- Structure: Stores data by rows, making it efficient for writing operations and transactional processing where access to complete records is common.
- Benefits: Optimal for OLTP systems (Online Transaction Processing) like customer relationship management where fast updates and inserts are crucial.
Column-Oriented Databases:
- Structure: Stores data by columns, enhancing read operations and data analytics where aggregations over large datasets are required.
- Benefits: Ideal for OLAP systems (Online Analytical Processing) such as data warehousing and big data analytics, offering high performance for queries accessing only a subset of columns.
Advantages:
- Row-Oriented: Quick data retrieval by complete row, simpler implementation, efficient for entry and updating of records.
- Column-Oriented: Better compression, efficient storage leading to faster read of column data, optimized for querying large datasets, which reduces I/O operations.
FR
Bases de données orientées ligne :
- Structure : Stocke les donnĂ©es par lignes, ce qui est efficace pour les opĂ©rations d'Ă©criture et le traitement transactionnel oĂč l'accĂšs aux enregistrements complets est courant.
- Avantages : Optimal pour les systĂšmes OLTP (Traitement des Transactions en Ligne) comme la gestion de la relation client oĂč des mises Ă jour et des insertions rapides sont cruciales.
Bases de données orientées colonne :
- Structure : Stocke les donnĂ©es par colonnes, amĂ©liorant les opĂ©rations de lecture et l'analyse de donnĂ©es oĂč des agrĂ©gations sur de grands ensembles de donnĂ©es sont nĂ©cessaires.
- Avantages : IdĂ©al pour les systĂšmes OLAP (Traitement Analytique en Ligne) tels que l'entreposage de donnĂ©es et l'analyse de grandes donnĂ©es, offrant de hautes performances pour les requĂȘtes accĂ©dant uniquement Ă un sous-ensemble de colonnes.
Avantages :
- OrientĂ© ligne : RĂ©cupĂ©ration rapide des donnĂ©es par ligne complĂšte, mise en Ćuvre plus simple, efficace pour l'entrĂ©e et la mise Ă jour des enregistrements.
- Orienté colonne : Meilleure compression, stockage efficace conduisant à une lecture rapide des données de colonne, optimisé pour l'interrogation de grands ensembles de données, ce qui réduit les opérations d'entrée/sortie.
Column-Oriented Database
graph TD
subgraph "Column-Oriented Database"
subgraph "ID Column"
ID1[1]
ID2[2]
ID3[3]
end
subgraph "Name Column"
Name1[John]
Name2[Alice]
Name3[Bob]
end
subgraph "Age Column"
Age1[30]
Age2[25]
Age3[35]
end
subgraph "City Column"
City1[New York]
City2[Los Angeles]
City3[San Francisco]
end
endRow Oriented Database
graph TD
subgraph "Row-Oriented Storage"
A["Row 1: ID | Name | Age | City"]
B["Row 2: ID | Name | Age | City"]
C["Row 3: ID | Name | Age | City"]
endIntroduction to ClickHouse
What is ClickHouse?
ClickHouse is an open-source column-oriented database management system (DBMS) designed for online analytical processing (OLAP). It allows users to generate analytical reports using SQL queries in real-time, offering high performance and scalability for processing large volumes of data. Key features include:
- Column-oriented storage
- Data compression
- Distributed query processing
- SQL support
- High performance for analytical queries
- Ability to handle petabytes of data
- Fault tolerance through replication
ClickHouse is particularly well-suited for applications requiring fast analytics on massive datasets, such as business intelligence, observability, and real-time data warehousing.
FR
ClickHouse est un systĂšme de gestion de base de donnĂ©es (SGBD) orientĂ© colonne et open-source, conçu pour le traitement analytique en ligne (OLAP). Il permet aux utilisateurs de gĂ©nĂ©rer des rapports analytiques en temps rĂ©el Ă l'aide de requĂȘtes SQL, offrant des performances Ă©levĂ©es et une scalabilitĂ© pour le traitement de grands volumes de donnĂ©es. Les caractĂ©ristiques clĂ©s incluent :
- Stockage orienté colonne
- Compression des données
- Traitement distribuĂ© des requĂȘtes
- Support SQL
- Haute performance pour les requĂȘtes analytiques
- Capacité à gérer des pétaoctets de données
- Tolérance aux pannes grùce à la réplication
ClickHouse est particuliÚrement adapté pour les applications nécessitant des analyses rapides sur des ensembles de données massifs, comme l'intelligence d'affaires, l'observabilité et l'entreposage de données en temps réel.
Key Features of ClickHouse & Speed
- Column-Oriented Storage: Often, only a few of the hundreds of columns in source data are needed for a report. Column-oriented storage improves efficiency by only accessing the necessary columns, reducing costly disk reads.
- Indexes: ClickHouse utilizes in-memory data structures to read only the required columns and specific row ranges, optimizing data retrieval.
- Data Compression: Grouping the same column's values enhances compression compared to row-oriented systems, as similar values are common in adjacent rows. ClickHouse also employs specialized codecs for even greater data compactness.
- Vectorized Query Execution: ClickHouse processes data by columns, improving CPU cache efficiency and enabling the use of SIMD CPU instructions for faster operations.
- Scalability: ClickHouse maximizes resource utilization by leveraging all available CPU cores and disks, not just on a single server but across a cluster, enhancing performance for complex queries. FR \\
- Stockage orienté colonne : Souvent, seules quelques-unes des centaines de colonnes des données sources sont nécessaires pour un rapport. Le stockage orienté colonne améliore l'efficacité en accédant uniquement aux colonnes nécessaires, réduisant ainsi les coûteuses lectures de disque.
- Index* : ClickHouse utilise des structures de données en mémoire pour lire uniquement les colonnes requises et les plages de lignes spécifiques, optimisant ainsi la récupération des données.
- Compression des donnĂ©es : Regrouper les valeurs d'une mĂȘme colonne amĂ©liore la compression par rapport aux systĂšmes orientĂ©s ligne, car les valeurs similaires sont courantes dans les lignes adjacentes. ClickHouse utilise Ă©galement des codecs spĂ©cialisĂ©s pour une compacitĂ© encore plus grande des donnĂ©es.
- ExĂ©cution de requĂȘtes vectorisĂ©es : ClickHouse traite les donnĂ©es par colonnes, amĂ©liorant l'efficacitĂ© du cache CPU et permettant l'utilisation d'instructions CPU SIMD pour des opĂ©rations plus rapides.
- ScalabilitĂ© : ClickHouse maximise l'utilisation des ressources en exploitant tous les cĆurs de CPU et les disques disponibles, non seulement sur un seul serveur mais sur tout un cluster, amĂ©liorant ainsi les performances pour les requĂȘtes complexes.
Primary Key, Index & granule
Primary Key:
- Determines the physical sorting order of data on disk
- Used to create a sparse index
Index:
- The index is created for every N rows (default 8192), called a granule
- Stored in the primary.idx file
- Used for quickly locating relevant data blocks
Granules:
- A granule is a logical group of rows (default 8192 rows)
- Data is split into multiple parts, and each part is divided into granules
How they work together:
- When a query is executed, ClickHouse uses the primary key to search the sparse index
- The index helps identify which granules potentially contain relevant data
- ClickHouse then loads only the matching granules into memory
- Data within the granules is processed to find the exact matching rows
Architecture of ClickHouse
Overview of ClickHouse Architecture
- Engine ClickHouse engine is a type of table engine that determines how and where data is stored, how queries are executed, and the concurrency and replication capabilities of the database. Each engine is optimized for specific use cases, providing various functionalities such as data partitioning, indexing, and support for different query types. FR \\
- Moteur : Le moteur ClickHouse est un type de moteur de table qui dĂ©termine comment et oĂč les donnĂ©es sont stockĂ©es, comment les requĂȘtes sont exĂ©cutĂ©es, ainsi que les capacitĂ©s de concurrence et de rĂ©plication de la base de donnĂ©es. Chaque moteur est optimisĂ© pour des cas d'utilisation spĂ©cifiques, offrant diverses fonctionnalitĂ©s telles que le partitionnement des donnĂ©es, l'indexation et le support pour diffĂ©rents types de requĂȘtes.
Clickhouse Engines
MergeTree Family:
- MergeTree: The most versatile and commonly used engine, designed for high-performance read and write operations, supporting data partitioning, primary keys, and data replication.
- ReplacingMergeTree: Allows for deduplication of rows during merges based on the sorting key.
- SummingMergeTree: Sums up numeric columns for rows with the same sorting key during merges.
- AggregatingMergeTree: Aggregates data during merges using specified aggregate functions.
- CollapsingMergeTree: Collapses rows with the same sorting key during merges, based on a special 'Sign' column.
- VersionedCollapsingMergeTree: Extends CollapsingMergeTree with version control for rows.
- GraphiteMergeTree: Specialized for storing time series data.
Integration Engines:
- ODBC: Allows querying external databases using ODBC.
- JDBC: Enables querying external databases using JDBC.
- MySQL: For querying MySQL databases.
- MongoDB: Allows querying MongoDB databases.
- HDFS: For working with data stored in Hadoop Distributed File System.
- S3: Enables working with data stored in Amazon S3.
- Kafka: For working with Apache Kafka streams.
FR \\
- Famille MergeTree :
- MergeTree : Le moteur le plus polyvalent et le plus utilisé, conçu pour des opérations de lecture et d'écriture à haute performance, prenant en charge le partitionnement des données, les clés primaires et la réplication des données.
- ReplacingMergeTree : Permet la déduplication des lignes pendant les fusions basées sur la clé de tri.
- SummingMergeTree : Somme les colonnes numĂ©riques pour les lignes ayant la mĂȘme clĂ© de tri pendant les fusions.
- AggregatingMergeTree : AgrÚge les données pendant les fusions en utilisant des fonctions d'agrégation spécifiées.
- CollapsingMergeTree : Effondre les lignes ayant la mĂȘme clĂ© de tri pendant les fusions, en fonction d'une colonne spĂ©ciale 'Signe'.
- VersionedCollapsingMergeTree : Ătend CollapsingMergeTree avec un contrĂŽle de version pour les lignes.
- GraphiteMergeTree : Spécialisé pour le stockage des données de séries temporelles.
Getting Started with ClickHouse
Installation and Setup - on Mac or Linux
- Install library
curl https://clickhouse.com/ | sh- Run the following command to start clickhouse server:
./clickhouse server- Open a new Terminal and connect to the server
clickhouse-client- Query a remote files
--Check the file
SELECT * FROM s3('https://clickhouse-public-datasets.s3.amazonaws.com/ontime/csv_by_year/*.csv.gz', CSVWithNames) limit 10;
--Basic Count
SELECT count() FROM s3('https://clickhouse-public-datasets.s3.amazonaws.com/ontime/csv_by_year/*.csv.gz', CSVWithNames) limit 10;
--The number of delays by carrier for 2007
SELECT IATA_CODE_Reporting_Airline AS Carrier, count(*)
FROM s3('https://clickhouse-public-datasets.s3.amazonaws.com/ontime/csv_by_year/*.csv.gz')
WHERE DepDelay>10 AND Year=2007
GROUP BY Carrier
ORDER BY count(*) DESC limit 5;
Results
ââCarrierââŹâcount()ââ
- â WN â 296293 â âââââââââââŽââââââââââ ââCarrierââŹâcount()ââ
- â AA â 176203 â
- â MQ â 145630 â
- â US â 135987 â
- â UA â 128174 â âââââââââââŽââââââââââ
5 rows in set. Elapsed: 420.586 sec. Processed 204.84 million rows, 11.66 GB (487.03 thousand rows/s., 27.72 MB/s.) Peak memory usage: 1.18 GiB.
Running Your First Query
Create table
CREATE TABLE `ontime`
(
`Year` UInt16,
`Quarter` UInt8,
`Month` UInt8,
`DayofMonth` UInt8,
`DayOfWeek` UInt8,
`FlightDate` Date,
`Reporting_Airline` LowCardinality(String),
`DOT_ID_Reporting_Airline` Int32,
`IATA_CODE_Reporting_Airline` LowCardinality(String),
`Tail_Number` LowCardinality(String),
`Flight_Number_Reporting_Airline` LowCardinality(String),
`OriginAirportID` Int32,
`OriginAirportSeqID` Int32,
`OriginCityMarketID` Int32,
`Origin` FixedString(5),
`OriginCityName` LowCardinality(String),
`OriginState` FixedString(2),
`OriginStateFips` FixedString(2),
`OriginStateName` LowCardinality(String),
`OriginWac` Int32,
`DestAirportID` Int32,
`DestAirportSeqID` Int32,
`DestCityMarketID` Int32,
`Dest` FixedString(5),
`DestCityName` LowCardinality(String),
`DestState` FixedString(2),
`DestStateFips` FixedString(2),
`DestStateName` LowCardinality(String),
`DestWac` Int32,
`CRSDepTime` Int32,
`DepTime` Int32,
`DepDelay` Int32,
`DepDelayMinutes` Int32,
`DepDel15` Int32,
`DepartureDelayGroups` LowCardinality(String),
`DepTimeBlk` LowCardinality(String),
`TaxiOut` Int32,
`WheelsOff` LowCardinality(String),
`WheelsOn` LowCardinality(String),
`TaxiIn` Int32,
`CRSArrTime` Int32,
`ArrTime` Int32,
`ArrDelay` Int32,
`ArrDelayMinutes` Int32,
`ArrDel15` Int32,
`ArrivalDelayGroups` LowCardinality(String),
`ArrTimeBlk` LowCardinality(String),
`Cancelled` Int8,
`CancellationCode` FixedString(1),
`Diverted` Int8,
`CRSElapsedTime` Int32,
`ActualElapsedTime` Int32,
`AirTime` Int32,
`Flights` Int32,
`Distance` Int32,
`DistanceGroup` Int8,
`CarrierDelay` Int32,
`WeatherDelay` Int32,
`NASDelay` Int32,
`SecurityDelay` Int32,
`LateAircraftDelay` Int32,
`FirstDepTime` Int16,
`TotalAddGTime` Int16,
`LongestAddGTime` Int16,
`DivAirportLandings` Int8,
`DivReachedDest` Int8,
`DivActualElapsedTime` Int16,
`DivArrDelay` Int16,
`DivDistance` Int16,
`Div1Airport` LowCardinality(String),
`Div1AirportID` Int32,
`Div1AirportSeqID` Int32,
`Div1WheelsOn` Int16,
`Div1TotalGTime` Int16,
`Div1LongestGTime` Int16,
`Div1WheelsOff` Int16,
`Div1TailNum` LowCardinality(String),
`Div2Airport` LowCardinality(String),
`Div2AirportID` Int32,
`Div2AirportSeqID` Int32,
`Div2WheelsOn` Int16,
`Div2TotalGTime` Int16,
`Div2LongestGTime` Int16,
`Div2WheelsOff` Int16,
`Div2TailNum` LowCardinality(String),
`Div3Airport` LowCardinality(String),
`Div3AirportID` Int32,
`Div3AirportSeqID` Int32,
`Div3WheelsOn` Int16,
`Div3TotalGTime` Int16,
`Div3LongestGTime` Int16,
`Div3WheelsOff` Int16,
`Div3TailNum` LowCardinality(String),
`Div4Airport` LowCardinality(String),
`Div4AirportID` Int32,
`Div4AirportSeqID` Int32,
`Div4WheelsOn` Int16,
`Div4TotalGTime` Int16,
`Div4LongestGTime` Int16,
`Div4WheelsOff` Int16,
`Div4TailNum` LowCardinality(String),
`Div5Airport` LowCardinality(String),
`Div5AirportID` Int32,
`Div5AirportSeqID` Int32,
`Div5WheelsOn` Int16,
`Div5TotalGTime` Int16,
`Div5LongestGTime` Int16,
`Div5WheelsOff` Int16,
`Div5TailNum` LowCardinality(String)
) ENGINE = MergeTree
ORDER BY (Year, Quarter, Month, DayofMonth, FlightDate, IATA_CODE_Reporting_Airline);Load the data - loaded 1M records and 2002 only
INSERT INTO ontime SELECT *
FROM s3('https://clickhouse-public-datasets.s3.amazonaws.com/ontime/csv_by_year/*2001*.csv.gz', CSVWithNames)
WHERE Year=2001 limit 1000000
SETTINGS max_insert_threads = 40;- Number of delys by airport
SELECT Origin, count(*) AS c
FROM ontime
WHERE DepDelay>10
GROUP BY Origin
ORDER BY c DESC
LIMIT 10;Query Local data - File is local
Using dataset NY Complaints.
describe table file("/Users/[you]/user_files/nypd_complaint.tsv")
select * from file("/Users/[you]/user_files/nypd_complaint.tsv") limit 10Quick analysis
- Count of Complaints by Crime type
SELECT IFNULL(NYPD_Complaint_Data_Current_Year_To_Date.`OFNS_DESC`, 'Unknown') AS Crime_Type,
COUNT(*) AS Complaint_Count
FROM url('https://data.cityofnewyork.us/api/views/5uac-w243/rows.csv?accessType=DOWNLOAD', CSVWithNames)
AS NYPD_Complaint_Data_Current_Year_To_Date
GROUP BY Crime_Type ORDER BY Complaint_Count DESC limit 10- Most Common Crimes at Night
SELECT IFNULL(NYPD_Complaint.`OFNS_DESC`, 'Unknown') AS Crime_Type,
COUNT(*) AS Complaint_Count
FROM url('https://data.cityofnewyork.us/api/views/5uac-w243/rows.csv?accessType=DOWNLOAD', CSVWithNames)
AS NYPD_Complaint
WHERE (NYPD_Complaint.`CMPLNT_FR_TM` >= '18:00:00' OR NYPD_Complaint.`CMPLNT_FR_TM` <= '06:00:00')
GROUP BY Crime_Type ORDER BY Complaint_Count DESC limit 10Getting started with Clickouse cloud
chDB - An Overview
Overview
chDB is an in-process SQL OLAP (Online Analytical Processing) engine powered by ClickHouse. It's designed to be an embedded database that brings the power of ClickHouse to applications without the need for a separate server installation. chDB is developed by ClickHouse, Inc. and open-source contributors.
Key features of chDB include:
- Serverless operation - no need to install or run ClickHouse services
- Support for multiple programming languages including Python, Go, Rust, NodeJS, and Bun
- Ability to work with various data formats like Parquet, CSV, JSON, Arrow, and ORC
- Custom User Defined Functions (UDFs)
FR
chDB est un moteur OLAP (Online Analytical Processing) SQL en cours d'exĂ©cution, alimentĂ© par ClickHouse. Il est conçu pour ĂȘtre une base de donnĂ©es embarquĂ©e qui apporte la puissance de ClickHouse aux applications sans nĂ©cessiter d'installation de serveur sĂ©parĂ©e. chDB est dĂ©veloppĂ© par ClickHouse, Inc. et des contributeurs open source.
Les principales caractéristiques de chDB incluent :
- Fonctionnement sans serveur - pas besoin d'installer ou d'exécuter des services ClickHouse
- Prise en charge de plusieurs langages de programmation, y compris Python, Go, Rust, NodeJS et Bun
- Capacité à travailler avec divers formats de données tels que Parquet, CSV, JSON, Arrow et ORC
- Fonctions définies par l'utilisateur (UDFs) personnalisées
Use Case - Why use chDB
Based on the information provided, here are some practical use cases for chDB in real-world applications:
- Data analysis in Jupyter notebooks: chDB allows users to run ClickHouse queries directly in Jupyter notebooks, making it convenient for data scientists and analysts to work with large datasets without needing a separate ClickHouse server.
- Embedded analytics in applications: Developers can embed chDB into their applications to provide fast analytical capabilities without requiring a separate database server.
- Processing large datasets locally: chDB enables processing of large datasets (e.g. Parquet, CSV, JSON files) on a local machine without needing to set up a full database system.
- Rapid prototyping: Data engineers and analysts can quickly prototype and test queries on large datasets without deploying a full ClickHouse cluster.
- Serverless environments: chDB can be used in serverless functions to provide analytical capabilities without managing database infrastructure.
- Data pipeline processing: chDB can be integrated into data pipelines for fast data transformations and aggregations.
FR
Basé sur les informations fournies, voici quelques cas d'utilisation pratiques de chDB dans des applications réelles :
- Analyse de donnĂ©es dans les notebooks Jupyter : chDB permet aux utilisateurs d'exĂ©cuter des requĂȘtes ClickHouse directement dans des notebooks Jupyter, facilitant ainsi le travail des data scientists et analystes sur de grands ensembles de donnĂ©es sans avoir besoin d'un serveur ClickHouse sĂ©parĂ©.
- Analytique embarquée dans les applications : Les développeurs peuvent intégrer chDB dans leurs applications pour fournir des capacités analytiques rapides sans nécessiter un serveur de base de données séparé.
- Traitement de grands ensembles de données localement** : chDB permet de traiter de grands ensembles de données (par exemple, Parquet, CSV, fichiers JSON) sur une machine locale sans avoir à configurer un systÚme de base de données complet.
- Prototypage rapide : Les ingĂ©nieurs de donnĂ©es et les analystes peuvent rapidement prototyper et tester des requĂȘtes sur de grands ensembles de donnĂ©es sans dĂ©ployer un cluster ClickHouse complet.
- Environnements sans serveur : chDB peut ĂȘtre utilisĂ© dans des fonctions serverless pour fournir des capacitĂ©s analytiques sans avoir Ă gĂ©rer une infrastructure de base de donnĂ©es.
- Traitement des pipelines de donnĂ©es : chDB peut ĂȘtre intĂ©grĂ© dans des pipelines de donnĂ©es pour des transformations et agrĂ©gations rapides des donnĂ©es.
