allyk
Pony Clubber
Posts: 184
|
Post by allyk on Aug 11, 2013 5:21:00 GMT 1
(A lot of this is not original to me and adapted from various available tutorials) ( If there's anything, ah, 'missing' from this tutorial that you happen to need to scan, be sure to PM me.) As always, there are several ways to do this. So feel free to adapt any information here to your methodology. That said, here is how I do it. ---------------- HardwareThere are 3 main ways of scanning books - Sheet-fed scanner: Quick and gives the best quality scans, but requires chopping the book up, which I just can't bring myself to do.
- Camera rig: The 'safest' for books and also very quick, but requires more effort to create than I'm willing to do. If you're interested in this, check out www.diybookscanner.org
- Flatbed: The simplest and cheapest way, my preferred choice. It does require pressure on the spine to keep the book flat, but I've never had problems damaging books like this. YMMV.
|
|
allyk
Pony Clubber
Posts: 184
|
Post by allyk on Aug 11, 2013 5:29:33 GMT 1
Software
There are any number of programs out there you can use to scan and then OCR books, but there's only one I recommend:
>>Abbyy FineReader 11<<
It easily handles book scanning, does the best OCR and just works.
The rest of this tutorial will be based on FineReader, though some stuff can be adapted to other programs.
Doing the Actual Scan
The FIRST thing to do when starting a new project in FineReader is to go to File->Save FineReader document. Otherwise it will save scanned pages off into some temp location and cause you all sorts of grief.
Next go to Tools->Options and look at the Scan/Open tab - Do not read and analyze acquired page images automatically - Enable image preprocessing - For most people, Split facing pages (I don't, and you shouldn't if there are images that span pages) - You should see your scanner listed in the 'Driver' dropdown. If you don't, either you don't have the drivers loaded or your scanner isn't connected
There are differing opinion on the best resolution and color to scan at. I prefer 300dpi grayscale, others like 600dpi black/white. For color illustrations or covers you can enable color just for those images.
It is important to press down HARD on the spine to eliminate the spine shadow as much as possible.
After a few test scans to make sure everything is working right, you can set it to scan multiple pages so you can just flip pages without hitting any buttons.
If you need a little more time to flip the page, you can add a delay after each page.
Depending on the speed of your scanner and the quality of the book, you can get anywhere from 2.5 to 10+ pages/minute. Most reasonable novels should scan in less than an hour, and most closer to 30 minutes.
Congratulations, you just finished the easy part of converting a book to an ebook.
|
|
allyk
Pony Clubber
Posts: 184
|
Post by allyk on Aug 11, 2013 5:46:43 GMT 1
Optional: Image Prep
Page Flipping and Splitting - If you didn't get your options set right or you decided to manually split pages, now is the time.
Page -> Edit Page Image - There are options on the side to do different things to the image such as Rotate&Flip and Split. If all the pages need to be flipped, set the Selection to 'All pages'. If you want to manually split all pages, click the first icon below 'Split' (Add Vertical Separator), set the Selection to 'Current Page', click to position the split and then click 'Split by Line'.
Hint: You can use alt-n to split and then page-down to go to the next page so you don't have to move your mouse back and forth, saving time. Or if you want to be really clever, start at the last page and use page-up to go to the previous page (you'll see why).
Page Cropping - while not strictly necessary I like to do this for several reasons 1) It's neater 2) Reduces OCR issues 3) When proofing it allows the text to be larger (easier to see) when zooming to the full page
Access Crop the same way (Page -> Edit Page Image). Drag around and resize the crop rectangle with the mouse until it just covers the text, then double-click on the crop rectangle to apply it and page-down to move to the next page.
Make Sure You Have All the Pages
Now that all the pages are split, make sure you actually got all the pages.
- Check the total number of pages: If there are 210 numbered pages plus 3 unnumbered pages at the front plus 2 covers there should be 215 total pages - Spot check: It's possible to both skip pages and double up pages such that the total number of pages is still correct. I know, I have actually managed to do this. If page 1 of the book is page 6 of the FR document, then you have an offset of 5. So go to page 50 of your FR document and make sure it's page 45 (50-5) of the book. Then check that page 100 is 95, 150 is 145, 200 is 195, etc.
|
|
allyk
Pony Clubber
Posts: 184
|
Post by allyk on Aug 11, 2013 5:55:48 GMT 1
Optional: Accents
If you're scanning a book with a lot of accented characters, you may want those recognized automatically.
(Of course if the book is completely in a foreign language, you should use that as the recognition language. This is for English with a smattering of accents thrown in)
Tools > Language Editor
New -> Create a new language based on an existing one: 'English'
Language name: English with accents Source language: English (United States) Dictionary: Built-in dictionary Alphabet -> click the [...] button at the end
Unicode subrange: Latin-1 Supplement
Drag select rows 2 and 3 and then individually click additional characters you don't want included (รท) or do want included (ยกยฟ)
Click Ok and Ok and then on the toolbar you should see the 'Document language:' dropdown now set to 'English with accents'
|
|
allyk
Pony Clubber
Posts: 184
|
Post by allyk on Aug 11, 2013 6:01:48 GMT 1
OCR
OCR stands for Optical Character Recognition and transforms the page images into editable text.
Fortunately the software does all the hard work for you, just click the 'Read' button (or 'Document > Read' menu option).
This will take a little bit of the time as it analyzes each page.
(If you ever need/want to read just a selected page instead of all pages, you can select the desired page in the thumbnail pane and right click > 'Read Selected Pages' or 'Page > Read Page' on the menu.)
|
|
allyk
Pony Clubber
Posts: 184
|
Post by allyk on Aug 11, 2013 6:04:56 GMT 1
Option 1: Stop HereIn general there are 2 main ways of producing a scan, either as series of page images or as (formatted) text than can easily be converted a number of mobile formats. Page images can typically be saved as a series of jpgs or as an image-based pdf. Image files have the advantage of being fast and easy to produce and always 100% correct. The disadvantage is that they have a large file size and often don't work well on mobile devices or small screens. Formatted text is far more flexible and can be easily converted to any number of formats, whether epub, mobi or even html, but it takes a lot more work to produce. If you're just interested in scanning as many books as possible, if you simply don't care for the work involved in proofing, or if you simply aren't qualified to check for spelling and punctuation issues (hi trixie ), you can save all your scans as image files and stop there. What I would actually recommend is producing TWO files. The first is a searchable-image PDF (siPDF) which contains the page images plus the raw text from the OCR underneath the image. The second is the output of the raw OCR saved in an editable format like rtf, docx (NOT doc) or html. This second file is usually tagged with a (UC) at the end for 'UnCorrected' (for instance: 'Anna Sewell - Black Beauty (UC).html) What this does is allows someone else the opportunity of proofing your scan and converting it to epub or whatever even if they don't have FR. For information on producing an siPDF and other formats, skip down to the post on 'Saving'
|
|
|
Post by Claire on Aug 11, 2013 10:40:04 GMT 1
Thanks very much for this Ally. Just a quick question. If you already have the books scans done, I assume you can then open them with Finereader to do the OCR? I have a few I have scanned in the past (and I suspect other people will be the same) which I wanted to try and convert. I don't have the books any longer to re-scan.
|
|
allyk
Pony Clubber
Posts: 184
|
Post by allyk on Aug 11, 2013 17:32:57 GMT 1
Absolutely, FineReader will open most any existing set of image files or even pdf files.
Just follow the same general steps: save your FineReader document, make sure the options are set, then drop the images on FineReader.
|
|
allyk
Pony Clubber
Posts: 184
|
Post by allyk on Aug 11, 2013 18:47:42 GMT 1
Option 2: Producing Formatted Text
While the OCR is usually pretty good, there are always problems with it. Which is why the next step in producing formatted text is:
Proofing
This is the real meat of the operation and where you'll spend the majority of your time. It's also where we separate the pretenders from the contenders.
Screen Layout You should be in the 'standard' 4 pane layout with thumbnails on the far left, page image on the left, page text on the right and zoom window in the bottom.
If you aren't, check that the following options are set in the menu: View -> Pages Window -> Show Pages Window View -> Pages Window -> Thumbnails View -> Pages Window -> Left View -> Images/Text Window -> Show Page Image and Page Text View -> Images/Text Window -> Highlight uncertain characters View -> Zoom Window -> Show Zoom Window View -> Zoom Window -> Dock Bottom
Setup On the 'Save' button on the toolbar is a small down triangle. Make sure it is set to Word or PDF EVEN IF YOU WANT A DIFFERENT FORMAT. After changing the save format, it will prompt you to go ahead and save the file, but just cancel out of it.
Beside the 'Save' button is a dropdown box. Make sure it is set to 'Exact Copy' and not 'Editable Copy' or 'Formatted Text' or 'Flexible Layout'. What this does is allow lines on the text side to exactly match lines on the image side, which makes comparing them while proofing much easier.
Underneath the dropdown box are 2 buttons. - set 'Keep pictures' on - set 'Keep headers and footers' off
If you go to a page with headers/footers, you should see an empty green box where the header is on the text pane. This shows you exactly what it is removing from that page.
Proofing
This isn't complex but it does require good concentration and is absolutely vital to producing a quality scan.
Read the book in the text window, constantly referring to the image window if you have a question. Correct any errors as you find them.
The blue highlights are characters that the program isn't certain of. The blue won't show up in the final save (unless you want it to).
Don't worry so much about formatting, just go for textual correctness.
IF YOU NOTICE AN ERROR THAT WOULD PASS SPELLCHECK (for instance 'the' sometimes gets recognized as 'die') be sure to make a separate note of it so that when you are finished you can do a final check to see if the same error occurred in any other places that you missed when proofing.
It's just a good idea in general to keep a list of notes to yourself about special formatting (like smallcaps or lists or poetry or whatever) you want to revisit later to make sure it got formatted correctly.
After you have finished reading it, click the 'Verification' button in the Text window or 'Tools > Verification...'
Split words - FineReader is usually fairly intelligent about putting words back together that have been split at the end of a line. You can tell whether it will combine words by looking for the line-continuation character (like a dash except it has a 'tail' that drops down on the right side). If you see the line-continuation character you know that it will join the words when saving. If this is incorrect, simply replace it with a regular dash. If you don't see a line-continuation character where there should be one, you will have to manually join the word yourself.
Congrats, you've now completed the most difficult part!
|
|
allyk
Pony Clubber
Posts: 184
|
Post by allyk on Aug 11, 2013 19:13:53 GMT 1
Saving
Whether you are doing Option 1 or Option 2, you should ALWAYS PRODUCE BOTH A TEXT AND AN IMAGE FILE.
If you are just doing Option 1 and and are focused on the image file, the UC (UnCorrected) text file will allow someone else to finish the process and produce a nicely formatted text file.
If you are doing Option 2 and are producing an epub (for instance), you STILL need to produce an image file for reference sake so if there is ever any question about the OCR, there is something to go back and compare it to.
siPDF (searchable image PDF)
Click the Save button on the toolbar or 'File > Save Document As > PDF Document'.
Click the 'Options' button on the bottom right of the Save dialog and make sure the settings are like this: Default paper size: Use original image size Save mode: Text under the page image Use Mixed Raster Content: Checked All other checkboxes: Unchecked Image settings: Best quality (source image resolution) Font settings: Use predefined fonts
Click OK to go back to the save dialog, name your file (for instance 'Anna Sewell - Black Beauty (siPDF).pdf') and click Save.
HTML I use HTML, but if you prefer .docx (not .doc, Calibre doesn't handle those) or .rtf they are similar.
Click 'File > Save Document As > HTML Document'.
Click the 'Options' button on the bottom right of the Save dialog and make sure the settings are like this: Retain layout: Formatted text Keep pictures: I process them separately, but whatever you want Picture settings (if you check 'Keep pictures'): Best quality (source image resolution) All other options: Unchecked (make sure to uncheck 'Use CSS' too)
Click 'OK' to close the options window and then 'Save' to create your file.
The main annoyance here is that setting retain layout to 'Formatted text' affects the view back in the main FineReader window. When proofing you want it on 'Exact Copy' so you can easily match the text to the image. But when saving you want 'Formatted text' because you don't want to preserve linebreaks and stuff like that. So once you finish saving, be sure to go and switch the view back to 'Exact Copy'. If you don't see 'Exact Copy', be sure to set the format on the save button back to PDF.
|
|
allyk
Pony Clubber
Posts: 184
|
Post by allyk on Aug 11, 2013 19:19:15 GMT 1
Polishing - General
1. Go through your notes- Check all spots that need special formatting and check for any OCR errors you noted that might evade spellcheck
2. Check for common OCR issues
Search for instances of '1' and other special characters like Code:
< > * / \ # @ | ^
3. Spellcheck in Word. Yes, you spellchecked in FineReader which is good, but Word has some different/better proofing tools which will catch things you missed.
DON'T EDIT HTML DIRECTLY IN WORD! It creates an unholy mess.
Instead create a copy of the html file and open the COPY in Word. Then use the html editor of your choice to manually fix errors in the original as you find them in Word.
4. Paragraph check
This is a simple check that can make a big difference in the quality of your final output. On the left half of your screen put FineReader with basically just the image and thumbnail pane showing. On the right half of the screen open your document. Scroll through both checking that the paragraphs match exactly. FineReader will sometimes mess-up where it puts paragraph breaks, especially at page boundaries where it likes to either unnecessarily split paragraphs or incorrectly join 2 separate paragraphs.
While this is mainly for paragraphs, keeps your eyes open. I never fail to catch at least one other issue while going through it, whether it's related to special formatting or something else.
This will also make sure all section breaks show up
5. Hyphen check
This is completely optional, but i like to find every instance of a hyphen (-) to check a) if there aren't words that should be joined b) if they should actually be em-dashes c) if they are just stray OCR marks
6. Final read through
This is another optional check and we're certainly reaching the point of diminishing returns, but I like to do a read through of the document in its final format to make sure I didn't miss anything.
|
|
allyk
Pony Clubber
Posts: 184
|
Post by allyk on Aug 11, 2013 20:41:21 GMT 1
Processing ImagesThis is where I'm weakest on, so I'm not going to give a specific workflow, but generally if you scan in grayscale you can increase the contrast some (but not too much) to cleanup most of the background and then do some spot erasing to remove stray specs. Then I resize the image so it's no wider than 400 pixels for full-page images. Double-page images may get more, smaller images get less. Creating epub/mobiI have some tools to work on the html and I manually create epub and mobi files from that. The whole process is rather more complicated and involved than anyone here probably wants to mess with. Fortunately there is an easier way. Once you have finished formatting your html/docx/rtf, you can use Calibre to automatically convert it for you. Now I don't use Calibre so unfortunately I can't offer any help on this part but I understand it is fairly straightforward and it has a large support community of people who would be happy to answer any of your questions about it. You may notice that there's an option to save directly to epub directly from FR, but again I have no idea how well that works.
|
|
allyk
Pony Clubber
Posts: 184
|
Post by allyk on Aug 11, 2013 20:44:02 GMT 1
Ok, that's all I have for now, if you have any questions be sure to let me know. I'd also be happy to review any work and offer suggestions if I can.
|
|
allyk
Pony Clubber
Posts: 184
|
Post by allyk on Aug 12, 2013 20:03:52 GMT 1
And if you have existing image files you want to process but can't be bothered with FineReader, I can do that side of it for you and produce the siPDF and UC format of your choice (html/docx/rtf). Granted you do make it slightly more difficult by not having the side-by-side proofing in FineReader, but it's not bad.
|
|
|
Post by Claire on Aug 13, 2013 15:08:52 GMT 1
Wow thanks for this scanning tutorial Ally - its really excellent and comprehensive. I am going to give it a go later on in the week when I have some spare time.
Just a thought - you may want to change the title to something like scanning and conversion to e-book - if people are looking to do the conversion part only they may pass over the thread.
|
|
allyk
Pony Clubber
Posts: 184
|
Post by allyk on Aug 13, 2013 17:00:18 GMT 1
Wow thanks for this scanning tutorial Ally - its really excellent and comprehensive. I am going to give it a go later on in the week when I have some spare time. Thanks and good luck! Just a thought - you may want to change the title to something like scanning and conversion to e-book - if people are looking to do the conversion part only they may pass over the thread. How's that? I tried to make it more generic, but of course feel free to change it to whatever you want
|
|
|
Post by Claire on Aug 14, 2013 21:42:41 GMT 1
Thanks Ally thats great the title sums it up perfectly. Will let you know how I get on will start with a short story
|
|
allyk
Pony Clubber
Posts: 184
|
Post by allyk on Aug 19, 2013 6:26:51 GMT 1
Updated the crop instructions slightly to be quicker and easier.
|
|