MLA Wire: pg_sample: extract a sample dataset from a larger PostgreSQL database

pg_sample is a PostgreSQL utility for making smaller versions of large databases.

download pg_sample 0.01

When you have a relatively large database (tables with, say, millions or billions of rows), it can be difficult to generate smaller datasets to work with, especially if foreign keys are heavily used.

That's where this script comes in. It will create smaller instances of each table along with any additional rows needed to satisfy foreign key constraints (circular dependencies are supported).

The script's operation closely resembles that of pg_dump. For example, assuming we have a large database named largedb, a smaller version could be produced with:


  createdb smalldb
  pg_sample largedb | psql smalldb

The smalldb would then contain a subset of largedb's data.

Here are the command-line options (many of which mirror pg_dump):


    -a
    --data-only
        Output only the data, not the schema (data definitions).

    -E *encoding*
    --encoding=*encoding*
        Use the specified character set encoding. If not specified, uses the
        environment variable PGCLIENTENCODING, if defined; otherwise, uses
        the encoding of the database.

    -f *file*
    --file=*file*
        Send output to the specified file. If omitted, standard output is
        used.

    --force
        Drop the sample schema if it exists.

    --keep
        Don't delete the sample schema when the script finishes.

    --limit=*number*
        The maximum number of rows to initially copy from each table
        (defaults to 100). Note that sample tables may end up with
        significantly more rows in order to satisfy foreign key constraints.

    --random
        Randomize the rows initially selected from each table. May
        significantly increase the running time of the script.

    --schema=*name*
        The schema name to use for the sample database (defaults to
        _pg_sample).

    --trace
        Turn on Perl DBI tracing. See the DBI module documentation for
        details.

    --verbose
        Output status information to standard error.

    The following options control the database connection parameters.

    -h *host*
    --host=*host*
        The host name to connect to. Defaults to the PGHOST environment
        variable if not specified.

    -p *port*
    --port=*port*
        The database port to connect to. Defaults to the PGPORT environment
        variable, if set; otherwise, the default port is used.

    -U *username*
    --username=*username*
        User name to connect as.

    -W *password*
    -password=*password*
        Password to connect with.

MLA Wire

Saturday, July 3, 2010

pg_sample: extract a sample dataset from a larger PostgreSQL database

No comments:

Archive