The strong Search engine included in SharePoint is one of the reasons that enterprises around the world embrace SharePoint. Instead of relying on navigation to look for an item or document, users now rely on search every day. Not only in the enterprise but in every day personal use as well. When is the last time you navigated categories on eBay, Wikipedia, Craigslist, etc.?
However, a lot of documents, especially PDFs and some scanned documents are not searchable because even if they are in PDF format, they are simply an image inside a PDF. All this valuable information is not searchable, and search based features including the new DLP in SharePoint 2016 will not be able to function on those documents. In order to be able to search inside those documents you need an OCR (Optical Character Recognition) solution, and that’s what we will be reviewing today. We will be reviewing a product called Aquaforest Searchlight that works on SharePoint 2010, 2013 as well as Office 365
- Audit document stores to determine which documents require processing.
- Document Stores are monitored to deal with new and updated documents.
- Dashboard provides a convenient summary of the state of all managed stores.
- Provides detailed conversion reporting.
- High Performance Multi-Core Support.
- Convenient GUI which enables management of all stores via a single interface
- OCR Support for over 100 languages including Chinese, Korean and Japanese
Aquaforest Searchlight is a client side application, meaning that you don’t install anything on SharePoint the server, all the hard work is done on a client computer. Before starting to go in the product, let’s take a look at what my goal will be for this review. I uploaded a TIFF file named Dracula that contains an extract of the novel by Bram Stoker.
After leaving it a few hours, I could find the file by the title in Office 365, but when searching for “Munich”, Office 365 returned nothing! Let’s try to fix that by using Aquaforest Searchlight.
After installing the application and the pre-requisites, we will need to add a Library in Aquaforest. A “Library” does not equal to a document library, it can be an entire Site Collection.
After we click the “Add New Library” button we are guided through a wizard so we can select the exact settings we want for our Library.
On the Library Settings page we have multiple choices
- Is it a SharePoint On-Premises, Office 365 or File Share Library that we want to add
- Do you only want to Audit, or Audit and OCR. Audit means that Searchlight will analyze how many documents are not searchable, while Audit and OCR will find those documents, and then make them searchable.
- We can select the number of cores that we want the application to use. The application will use 1 core / document, so if we give it 10 cores, it can process 10 documents simultaneously. If you plan to OCR thousands, or millions of documents on the first run, maybe it’s a good idea to run it on a virtual server for the initial “transformation” and then move a lower performing machine for day to day.
- Since the application will of course modify the document in order to make it readable, we can select if we want to turn versioning on if it’s off, publish a major version with the new searchable document, and of course describe the check-in comment. The “original” version will be kept as a past version, depending on the versioning rules you have on the library.
We then go to the Document Settings where we can specify the behavior for each document type and filter which documents get OCR’d.
- For PDFs we can select if we want to process them if they are already fully searchable, partially or not at all searchable.
- For the TIFF files, we can select if we process them, and if we delete the original, as the Searchlight application converts them to PDF files.
- We have the same settings for BMP, JPEG and PNG Files.
- We can select where the Temp Folder location is. The temp folder is where Searchlight will download files while it does the magic to make them searchable.
- We can select a date range for the library, so we don’t OCR all the old documents that provide no additional value in the Search Engine.
- This is a setting I personally loved seeing there, we can choose to retain all the original metadata on the document. So if a document gets downloaded, and re-uploaded but searchable, those columns will remain the same! With the “Check in Comment” we selected previously, here is what it will look like in the Version History. Modified / Modified By did not change even if the OCR took place a few days later!
After the Document Settings are in place, we can select where we put our Archives, if we decide to keep them of course. The archives are all the original documents before Searchlight made them Searchable.
We then go to the OCR Settings. In the OCR Settings we have two different options we can use, the Aquaforest OCR engine, or the IRIS OCR Engine
I have asked Aquaforest what the difference is, and the main difference is that the Extended OCR Engine works with multiple languages and supports more languages than the Aquaforest OCR one. So if you need to translate documents in more languages, make sure to select the extended choice. Both engines have multiple choices such as rotating the image, or deskewing the documents.
After we select our OCR properties, we can move on to create a schedule for the library
We can either run this job manually, or run it every day or hour, to keep our documents always searchable! After this, we select our Email settings if we want to receive emails when a job is done, or fails.
After we finish and we start the job, the Aquaforest Searchlight tool will first audit the document library and report on its Searchability (How much % of the library is indexable) and then start transforming the documents into Searchable ones.
After the job is done and we wait for the Office 365 crawler to crawl the Site Collection, I could successfully crawl the “Dracula” document and find text from inside it
In this blog post we had an overview of the Aquaforest Searchlight tool that allows enterprises to make their PDF and image documents searchable, in order to provide additional value in SharePoint. I found the Searchlight application really easy to use, and the 10 or so documents I have uploaded have been transformed pretty fast, even if I only gave it one core to do all the OCR. Some things that you will need to be careful about are making sure that your temp and Archive folders have enough space on them if you need to OCR thousands of documents, as every document that gets downloaded can fill up the C: drive pretty fast.
The thing that I loved most is the fact that the application can turn on versioning, while making sure the important metadata such as “Modified/Created By” and “Created / Modified” do not change after a document is transformed. That would have been a deal breaker for most companies where those four columns are of significant importance.
I didn’t really find anything I didn’t like in the Aquaforest Searchlight tool, as it does everything that it says it does and I am happy to see it work with SharePoint 2010/ 2013 / SharePoint Online, and has been tested with SharePoint 2016 RC so I am sure it will work with it once the product is released!
With Data Loss Prevention becoming a more important topic for every company, and with the DLP Features in SharePoint 2016 / SharePoint online relying on search, having your documents in fully searchable format is a real plus. If you’re looking for an OCR Solution for SharePoint, make sure to check out Aquaforest Searchlight by clicking on the logo below: