Saturday, July 3, 2010

pg_sample: extract a sample dataset from a larger PostgreSQL database

pg_sample is a PostgreSQL utility for making smaller versions of large databases.

download pg_sample 0.01

When you have a relatively large database (tables with, say, millions or billions of rows), it can be difficult to generate smaller datasets to work with, especially if foreign keys are heavily used.

That's where this script comes in. It will create smaller instances of each table along with any additional rows needed to satisfy foreign key constraints (circular dependencies are supported).

The script's operation closely resembles that of pg_dump. For example, assuming we have a large database named largedb, a smaller version could be produced with:

createdb smalldb
pg_sample largedb | psql smalldb
The smalldb would then contain a subset of largedb's data.

Here are the command-line options (many of which mirror pg_dump):

-a
--data-only
Output only the data, not the schema (data definitions).

-E *encoding*
--encoding=*encoding*
Use the specified character set encoding. If not specified, uses the
environment variable PGCLIENTENCODING, if defined; otherwise, uses
the encoding of the database.

-f *file*
--file=*file*
Send output to the specified file. If omitted, standard output is
used.

--force
Drop the sample schema if it exists.

--keep
Don't delete the sample schema when the script finishes.

--limit=*number*
The maximum number of rows to initially copy from each table
(defaults to 100). Note that sample tables may end up with
significantly more rows in order to satisfy foreign key constraints.

--random
Randomize the rows initially selected from each table. May
significantly increase the running time of the script.

--schema=*name*
The schema name to use for the sample database (defaults to
_pg_sample).

--trace
Turn on Perl DBI tracing. See the DBI module documentation for
details.

--verbose
Output status information to standard error.

The following options control the database connection parameters.

-h *host*
--host=*host*
The host name to connect to. Defaults to the PGHOST environment
variable if not specified.

-p *port*
--port=*port*
The database port to connect to. Defaults to the PGPORT environment
variable, if set; otherwise, the default port is used.

-U *username*
--username=*username*
User name to connect as.

-W *password*
-password=*password*
Password to connect with.
See also: pg_sample Github source repository