Monday, August 20, 2007

iconv: file too large

The iconv utility is used to convert file encodings. I'm using it to convert a Postgresql database from LATIN1 to UTF8.

However, the standard iconv program slurps the entire file into memory, which doesn't work for large data sets (such as database exports). You'll see errors like:

iconv: unable to allocate buffer for input: Cannot allocate memory
iconv: cannot open input file `database.txt': File too large

This script is just a wrapper that processes the input file in manageable chunks and writes it to standard output: iconv-chunks


compass2k said...

iconv-chunks fails :Bad file descriptor on line 62 (code) line 57893/110meg or 285Meg db dump clean.
this is where subroutines are called in the code .
The command I am running is ./iconv-chunks datafile -f utf-8 -t utf-8 > dataonly_cleaned
Any ideas?

mla said...

Hmmm. What OS are you using?

The iconv program runs fine for you on smaller files?

Like 62 looks like the external call to iconv is failing.

Unknown said...

You probably need to add the -c option to skip characters that are not convertible.

nicola said...

file had 4.2G size
here is my solution

uconv -f UTF-16LE -t UTF-8 < data.csv > data_utf8.csv

may be will work with iconv too

nicola said...

! and uconv support callbacks for invlid characters - read man

R.M said...

how can I add this script ?
from where?

mla said...

The script is available here:

Anonymous said...

With my perl this script leaves a tmp file in /tmp. Unfortunately, discovered this when /tmp filled up completely.

I appended this line to fix it.

unlink $tmp;

mla said...

Fixed the removal of the temp file and moved the script to github:

Артем said...

[root@db1 scripts]# ./iconv-chunks /root/scripts/hist1.dmp -f utf8 -t utf32 > /root/scripts/hist2.dmp
iconv: illegal input sequence at position 535322
command 'iconv -f utf8 -t utf32 /tmp/44RA8vXwBe' failed: Inappropriate ioctl for device at ./iconv-chunks line 63, <> line 3295.

Can you help?

mla said...

Are you sure hist1.dmp is encoded as utf8? Maybe really it's Latin1? Try "-f latin1" instead.