I recently started building a web application using SQLite, and needed to seed the database with some existing data. As is often the case, this was supplied in a CSV. I have recently been building DuckDB into my toolset a lot more and thought this task should be an easy win. A lot of existing docs/blogs start with creating a table in SQLite and specifying schemas⦠I wanted something more quick and dirty.
CSV to SQLite table as fast as possible (the tl;dr)
We start with our CSV file, shown here as a table.
sodor.csv
Name | Type | Color | Number |
---|---|---|---|
Thomas | Tank Engine | Blue | 1 |
Gordon | Tender Engine | Blue | 4 |
Percy | Tank Engine | Green | 6 |
James | Mixed-Traffic Engine | Red | 5 |
Cranky | Crane | Yellow |
Letβs build a SQLite database with this as our first table. We donβt care too much about the schema; this is for development, and if youβre using SQLite, schemas are less important.
$ duckdb -c "ATTACH 'app.db' as sqlite_db (TYPE SQLITE); \
CREATE TABLE sqlite_db.personnel AS SELECT * FROM 'sodor.csv';"
Thatβs it.
Did it work? Letβs use the SQLite shell tool to check it.
$ sqlite3 app.db
sqlite> select * from personnel;
Thomas|Tank Engine|Blue|1
Gordon|Tender Engine|Blue|4
Percy|Tank Engine|Green|6
James|Mixed-Traffic Engine|Red|5
Cranky|Crane|Yellow|
sqlite> .schema personnel
CREATE TABLE personnel("Name" VARCHAR, "Type" VARCHAR, Color VARCHAR, Number BIGINT);
It did! And DuckDB inferred that our Number
column would be better as a BIGINT (okay INT would be more efficient but better than a VARCHAR).
Finally, we can actually make the shell command even shorter (but maybe thatβs harder to understand on first reading it).
$ duckdb -c "ATTACH 'app.db' (TYPE SQLITE); \
CREATE TABLE app.personnel AS FROM 'sodor.csv';"
In this shorter version, DuckDB uses the
app
inapp.db
to name the SQLite database we can create tables in likeapp.personnel
.
/tldr
How did that work?
Letβs break that down into the different features of DuckDB that were useful here.
CSV Autodetection
For me, CSV auto detection is one of the best features in DuckDB. In this example, we are starting with the string βsodor.csvβ, from which DuckDB is using the filename to work out to use read_csv_auto
function. This function works out the delimiter, data types for the columns and pulls out the header names.
So we already have a SQL-like table by doing select * from 'sodor.csv'
.
Create Table As Select (CTAS)
CTAS is a feature, not exclusive to DuckDB, that allows us to create a table with the schema from a select statement. Itβs especially useful in this situation, as we already have a schema in a table-like object from the CSV auto detection (you can run `describe from βsodor.csvβ). So we donβt need to respecify a table schema if we already have one suitable; obviously if you want more control over the database schema you can create a table βlike normalβ and insert into that with any transformations you want to do.
DuckDB SQLite Extension
DuckDB has extensions for SQLite, PostgreSQL, MySQL etc which allows us to work directly with external databases.
DuckDB does this in an intelligent way, data isnβt copied into the DuckDB process, but we still get the benefits of DuckDBβs query engine. With Postgres it uses the binary transfer mode.
You can attach multiple databases to a DuckDB session. In our case we want to attach a new SQLite database, and create a table with data in it.
ATTACH 'app.db' as sqlite_db (TYPE SQLITE);
By default a database is created in the native DuckDB format, but we can qualify what storage type we want.
We can change the current βmainβ database with a use
statement, like you can in MySQL. But we donβt then we access tables with the namespace like sqlite_db.t1
.
With SQLite, once attached, DuckDB will create the database file on disk, and as we perform operations they are written to disk.
Taking more control
The main focus of this write up is in the one-liner shell CSV to SQLite table command. Using DuckDB CLI -c
flag, commands are executed in non-interactive mode which is useful for simple tasks like this or inside pipelines.
In interactive mode, we can many any changes we need if the defaults are not what we want.
Starting an interactive session:
$ duckdb
D select * from 'sodor.csv';
βββββββββββ¬βββββββββββββββββββββββ¬ββββββββββ¬βββββββββ
β Name β Type β Color β Number β
β varchar β varchar β varchar β int64 β
βββββββββββΌβββββββββββββββββββββββΌββββββββββΌβββββββββ€
β Thomas β Tank Engine β Blue β 1 β
β Gordon β Tender Engine β Blue β 4 β
β Percy β Tank Engine β Green β 6 β
β James β Mixed-Traffic Engine β Red β 5 β
β Cranky β Crane β Yellow β β
βββββββββββ΄βββββββββββββββββββββββ΄ββββββββββ΄βββββββββ
Letβs assume we are really worried about storage space and want Number
to be in smaller integer storage.
D select * from read_csv('sodor.csv', types={'Number':'TINYINT'});
βββββββββββ¬βββββββββββββββββββββββ¬ββββββββββ¬βββββββββ
β Name β Type β Color β Number β
β varchar β varchar β varchar β int8 β
βββββββββββΌβββββββββββββββββββββββΌββββββββββΌβββββββββ€
β Thomas β Tank Engine β Blue β 1 β
β Gordon β Tender Engine β Blue β 4 β
β Percy β Tank Engine β Green β 6 β
β James β Mixed-Traffic Engine β Red β 5 β
β Cranky β Crane β Yellow β β
βββββββββββ΄βββββββββββββββββββββββ΄ββββββββββ΄βββββββββ
Alternatively, we might want to manually create a table, either in DuckDB or in another database, and then move data into it. We could also use the COPY command.
D CREATE TABLE personnel (nom varchar, nombre decimal);
D INSERT INTO personnel SELECT Name as nom, Number as nombre from 'sodor.csv';
D SELECT * FROM personnel;
βββββββββββ¬ββββββββββββββββ
β nom β nombre β
β varchar β decimal(18,3) β
βββββββββββΌββββββββββββββββ€
β Thomas β 1.000 β
β Gordon β 4.000 β
β Percy β 6.000 β
β James β 5.000 β
β Cranky β β
βββββββββββ΄ββββββββββββββββ
Thereβs no limit to what we can do, DuckDB is very flexible with these operations, but the nice bit is that normally we can get pretty far with the defaults and very short commands.
Aside: converting to Parquet
Converting to Parquet is even easier, Iβm just adding it as Parquet is such a useful format and you can get massive benefits if youβre only using CSV at the minute.
$ duckdb -c "COPY (FROM 'sodor.csv') TO 'sodor.parquet' (FORMAT PARQUET)"
Itβs worth looking at the Parquet options. The defaults are fine, but changing the compression settings might help if you have huge datasets. Do you want to optimise for speed or storage?