Scanning Files for OCR to Various Bitmap Formats

  • Thread starter Thread starter johndavidwood
  • Start date Start date
J

johndavidwood

I'm using the Opticbook 3600 Scanner to scan some pages from some
books, which I will send to OCR and convert to PDF. (For those
unfamiliar with the scanner, it is specially designed to scan books
without shadows, distorted text, etc.)

Here is my question:

I usually scan 300 dpi B+W into an image file and then send to OCR
(ABBYY Finereader 8).

The scanner software comes with software which allows scanning to four
different file formats: BMP, TIFF, PNG, and JPG. The software doesn't
allow to change any settings for these formats.

I know JPG is a compressed, lossy, format, but aren't the other three
lossless bitmap fomats?

In other words -- if all three should in theory produce a perfect
image, why not just scan to PNG which is the smallest file size, and
then send to OCR?

Where am I flawed in my reasoning??

JDW
 
I'm using the Opticbook 3600 Scanner to scan some pages from some
books, which I will send to OCR and convert to PDF. (For those
unfamiliar with the scanner, it is specially designed to scan books
without shadows, distorted text, etc.)

Here is my question:

I usually scan 300 dpi B+W into an image file and then send to OCR
(ABBYY Finereader 8).

The scanner software comes with software which allows scanning to four
different file formats: BMP, TIFF, PNG, and JPG. The software doesn't
allow to change any settings for these formats.

I know JPG is a compressed, lossy, format, but aren't the other three
lossless bitmap fomats?

In other words -- if all three should in theory produce a perfect
image, why not just scan to PNG which is the smallest file size, and
then send to OCR?

Where am I flawed in my reasoning??

JDW

Use LZW TIFF if the OCR software can handle the compression.
LZW TIFF is lossless and uses the same compression as PNG and is a universal
image format. Most of not all image software can use TIFF. PNG is not as
universal.

BMP is not compressed and makes large files and is Windows only.

A list of the different file formats and their use.
http://www.scantips.com/basics09.html

Scanning for OCR and Line art.
http://www.scantips.com/basics04.html

Read the above page plus 5 more pages.
 
I'm using the Opticbook 3600 Scanner to scan some pages from some
books, which I will send to OCR and convert to PDF. (For those
unfamiliar with the scanner, it is specially designed to scan books
without shadows, distorted text, etc.)

Here is my question:

I usually scan 300 dpi B+W into an image file and then send to OCR
(ABBYY Finereader 8).

The scanner software comes with software which allows scanning to four
different file formats: BMP, TIFF, PNG, and JPG. The software doesn't
allow to change any settings for these formats.

I know JPG is a compressed, lossy, format, but aren't the other three
lossless bitmap fomats?

In other words -- if all three should in theory produce a perfect
image, why not just scan to PNG which is the smallest file size, and
then send to OCR?

Where am I flawed in my reasoning??

You are right, it shouldnt matter, assuming the OCR software can read
any of those formats. BMP has two modes, indexed color is RLE
compression (lossless) and 24 bit RGB is not compressed.
PNG uses lossless compression. TIF has several possible compressions,
generally all are lossless (but a few programs can put JPG compression
into TIF files). TIF LZW and G3 and G4 are all lossless.

JPG compression is lossy, and is NOT the best choice for OCR work.

Scan mode is important too (line art, grayscale, or color). OCR is
often line art (black or white, no gray). JPG cannot store that mode at
all, but the others can.
 
I'm using the Opticbook 3600 Scanner to scan some pages from some
books, which I will send to OCR and convert to PDF.

Gilding the lily, innit? Text is much more adaptable than PDF. I'd go
straight text unless you have a bunch of pictures or complex layouts.
I usually scan 300 dpi B+W into an image file

I assume you mean "1-bit black and white" here.
different file formats: BMP, TIFF, PNG, and JPG. The software doesn't
allow [me] to change any settings for these formats. If all [except
JPEG] should in theory produce a perfect image, why not just scan to
PNG which is the smallest file size, and then send to OCR? Where am I
flawed in my reasoning??

Er. If you're scanning in black and white, Group4 TIFF will be a *lot*
smaller than PNG. A fairly standard 8.5x11 page is 45K in Group4, 137K
in PNG. You say the software you're using won't let you change
compression settings, so get some software that *does*. If you're stuck
with 'Doze, go Google Irfanview, which will let you save a 1-bit image
as Group4 TIFF.

Also, TIFF has tags that store resolution data. PNG does as well, but
my fiddling with ImageMagick and Gimp makes me think the tags are
ignored or just not used much in PNG. (This figures; PNG was sort of
designed for Web use, not print.) If it's important that an image be
300 DPI, as it might be if you want to reproduce the image later, TIFF
is the way to go. HTH,
 
I'm using the Opticbook 3600 Scanner to scan some pages from some
books, which I will send to OCR and convert to PDF. (For those
unfamiliar with the scanner, it is specially designed to scan books
without shadows, distorted text, etc.)

Here is my question:

I usually scan 300 dpi B+W into an image file and then send to OCR
(ABBYY Finereader 8).

The scanner software comes with software which allows scanning to four
different file formats: BMP, TIFF, PNG, and JPG. The software doesn't
allow to change any settings for these formats.

I know JPG is a compressed, lossy, format, but aren't the other three
lossless bitmap fomats?

In other words -- if all three should in theory produce a perfect
image, why not just scan to PNG which is the smallest file size, and
then send to OCR?

Where am I flawed in my reasoning??

JDW

I see you've received a few replies to your questions. Would you mind
answering a few I have? I was looking for something like this 2 years
aog, but ended up getting an Epson (which works fine, overall, but
isn't really that great in scanning books). Is the scanner sturdy? Is
there any noticable distortion? Is it as fast as other scanners you
may have experience with? Are you satisfied with it?

I am considering getting one of these "just" for scanning books.
Charlie Hoffpauir
http://freepages.genealogy.rootsweb.com/~charlieh/
 
I have a LOT to say!!! NONE of the reviews have seen on the internet
are adequate IMHO... especially from the editors of the major PC
magazines who review this specialized book scanner and whine about how
it doesn't scan photos well, photos don't scan properly, why can't it
scan photos, where is the photo scanning software, etc. etc.!! I mean
WTF is wrong with them??

I was going to do a very detailed review and setup a website with pics,
shots, etc. -- I still plan on doing this within a couple of weeks when
I get some time. I will provide a brief synopsis though sooner, within
a day or two, to the group when I get a chance...

I own an Epson Photo Scanner 1670... IMHO for the price, one isn't
going to get a much better home scanner (though this model is one or
two years old). Nothing more I could ask for, except maybe x64 drivers
(I'm sure there's a workaround out there but I've never bothered).

JDW
 
go for TIFF in BW flavour,just like fax machines. Otherwise PNG.
JPEG is no good for text or sharp edges of high contrast, JPEG is good
for halftone images.
 
I'm using the Opticbook 3600 Scanner to scan some pages from some
books, which I will send to OCR and convert to PDF.

Why convert to PDF?
PDF is designed for "locked", looks like it would printed documents.
Ebooks should be free-flow text or if it has special formating or pictures
RTF/HTML or PDB (Finereader will output in any of these).

These will flow to fit the screen when you read them and are usable on small
screen devices (pocket pc, cyBook, etc) where the "locked" 81/2x11 pdf's
require much left-right scrolling vs just the pagedown you get with free
flow text.
(For those
unfamiliar with the scanner, it is specially designed to scan books
without shadows, distorted text, etc.)

Here is my question:

I usually scan 300 dpi B+W into an image file and then send to OCR
(ABBYY Finereader 8).

You realy want to scan FROM finereader and use the Finereader commands.
This lets finereader do magic adustmetns to the brightness this makes
blackand white scans work a lot better.

Finereader can use either WIA or TWAIN to talk to the scanner and either
it's command set or the scanner's software. You want to test both WIA and
TWAIN to see which is faster. Normally one is just run through a
translation DLL to the other one and is slower.
The scanner software comes with software which allows scanning to four
different file formats: BMP, TIFF, PNG, and JPG. The software doesn't
allow to change any settings for these formats.

Black and white should be tiff or gif
Where am I flawed in my reasoning??

If you use the FineReader interface to scan you can scan custom size -
exclude the headers and footers so your text flows correctly without lots of
post ocr editing.
 
Back
Top