Indexing MDI files from Office Document Imaging  
Author Message
RolfHa





PostPosted: SharePoint - Search, Indexing MDI files from Office Document Imaging Top

Hi, Microsoft document imaging is a nice tool to get paper scans into sharpoint. In the 2003 version MDI files, the output of MicrosoftDocumentImaging, have been included in search results by an default iFilter scanning the meta data which has been added by the ocr part of MDI. It seems that this have been discontinued in MOSS 2007. This is a real bad thing not having this capability anymore. Does somebody have a workaround like installing the iFilter DLL on the server with some hacks or similar tricks

Rolf




SharePoint Products and Technologies1  
 
 
PsharkAuburn





PostPosted: SharePoint - Search, Indexing MDI files from Office Document Imaging Top

The functionality is already available as long as you have MODI installed (either version 11.0 => Office 2003 or version 12.0 => Office 2007). I've successfully been able to get TIF and MDI indexing to work in both the WSS 3.0 betas as well as WSS 3.0 Final Releases in both the free framework installation as well as the full blown MOSS 2007 installation for the past several months. If this is still a current issue for you please holler and post here, else if this is a dead thread just disregard. The necessary tweaking is pretty interesting as its very much buried especially in the WSS 3.0 release (non MOSS installations).
 
 
RolfHa





PostPosted: SharePoint - Search, Indexing MDI files from Office Document Imaging Top

I have it installed and it is working on XP and SPS 2003 for me. I really would like to know how to make it work on Vista and SPS 2007. For me, this is the reason not to upgrade!

If you can help I would like to know the tweaks!

Thanks

Rolf



 
 
Ed





PostPosted: SharePoint - Search, Indexing MDI files from Office Document Imaging Top

Hi, I'd love to see some info on how to make that work in WSS 3.

Also it seems that Document Imaging does not work on Vista-64 bit.

Anyone have any info on that It installed fine but as soon as I open a tif it crashes.



 
 
PsharkAuburn





PostPosted: SharePoint - Search, Indexing MDI files from Office Document Imaging Top

[upon previewing, i know the fonts are all screwy here...just copy and paste or squint . I'll work on reformatting this after I get some sleep. I've written this for WSS 3.0, but if you're using MOSS 2007, the steps are virtually identical. You might just need to do a little control-F in regedit to get to the appropriate key location (i think its under OfficeServer instead of Shared Tools - something like that)]

To Install support for PDF / MDI / TIF / XPS indexing in Sharepoint 3.0

Sorry for the delay - yeah MDI indexing was crucial for me as well as the filesizes are much more compressed. The search filter settings are located in 3 different locations, but are all virtually identical in nature - they all relate to specific registry keys. First off, you need to make sure the MDI filter is installed in SQL server. You need to install Microsoft Document Imaging from the Office 2003 or Office 2007 installation - under Additional Tools if you do a custom installation. Then in SQL Server, run the following query to determine installed filters: select document_type, class_id, version from sys.fulltext_document_types. This will return a list of the extensions filtered along with the clsid of the components handling them. If you have an entry for .mdi files, then copy the clsid to a location you can reference later. If you do not see .mdi listed but you have MODI installed, you need to run the following sql query to enable the iFilter (something about Microsoft not signing the component so sql blocks it by default):

Code Snippet

exec sp_fulltext_service 'load_os_resources', 1;

exec sp_fulltext_service 'verify_signature', 0;
go

After this, re-run the first query to verify .mdi files are now associated with the correct component (should be clsid: 62160CBE-AFCB-4795-9B68-DDE5BA6D2524). With this done, time to modify the registry. With WSS 3.0 installed, you want to navigate to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0 - and do an export for backup purposes of everything under this key. Navigate to the sub key: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Applications\89a9d417-5231-45b7-a9f5-08d4e043d8ce\Gather\Search\Extensions\ExtensionList - you're CLSID may differ from mine, but the structure should be similar. In the extension list you will see a list of keys listed by number with extension types as the data. Add the extensions you're interested in indexing (for me I added keys 38,39,40 with data of "pdf","mdi","xps" respectively). Next navigate to the following key: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Setup\ContentIndexCommon\Filters\Default - Here is a key called "DefaultExtList" that should currently tell what extensions are handled by sharepoints default iFilter. I made sure to remove "Tif" and "Tiff" from this list as they need to be handled through the MODI filter. Next navigate to the key: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Setup\ContentIndexCommon\Filters\Extension - Here you find how sharepoint maps extensions to the components to handle them. Insert keys following the existing layout by adding your extensions along with the CSLID for your MODI component (should be 62160CBE-AFCB-4795-9B68-DDE5BA6D2524). On a side note, if you pulled out the SQL iFilters list in the first step, feel free to setup PDF indexing along the way - just poing the extension PDF at the CSLID found in your SQL Server. Also, take care to point TIF and TIFF files to the MODI component as well. Last step here, I went to the following key: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Setup\Filters - and followed the existing structure to tie my TIF and PDF extensions to the appropriate MIME-TYPEs ('application/pdf' and 'image/tiff' respectively).

With this done, do another backup of this entire part of the registry in case things need to be redone. While I originally thought these changes would do the trick, after a server reboot I found out that all the registry settings I modified were reverted back to the original settings automatically (I was pretty pissed). If this happens, you can simply execute the backed up .reg file to merge the changes you made right back in. After my registry was put back to normal I figured there must be saved copies of this part of the registry somewhere (sharepoint must do some comparison against factory default settings and modify if it notices changes/corruption). The copy of the registry is actually held in your SQL Server database for sharepoint as a binary field in a table. Specifically, in your Search Database under the table MSSconfiguration is a record with the field Name='RegistryBlob' and BigValue= the binary data containing the original registry information. I wrote a little .net application that retrieved the binary bytes from this record and wrote them to the filesystem. Open up the file in a text editor and you will see it is just another .reg file. You need to make the same updates you made to the actual registry in this file, so that upon comparison Sharepoint will not recognize any differences. This is were cutting and pasting from the backup .reg files I mentioned previously will make this very easy. Upon finishing the file updates, modify your application to then read the bytes from the file and update the existing record with the new binary data. Lastly, upon searching around the hard drive I found another file located around the Program Files\Common Files\Microsoft Shared\etc... called 'registryblob.reg' that was another 3rd copy of the same registry information. I simply copied over this file (after making a backup) with the same file I was uploading in binary back to the MSSconfiguration table. With this done, you should be good to go. Again, we're making the same registry modifications but in 3 different locations. Once in the actual registry, once in a .reg file around the hard drive, and once in a binary file held in the Search Database table MSSconfiguration. Reboot the server, upload a .mdi/.tif/.pdf file(s), wait 10 mins or so and do a search against content in the file. Should be jackpot. The following is the code I used to write the binary into/out of the table (pretty basic stuff):

Code Snippet

'Read binary to local file

Dim sql As String = "SELECT BigValue FROM MSSConfiguration WHERE Name = 'RegistryBlob'"

Dim sqlcom As New SqlClient.SqlCommand(sql, SqlConnection4)

sqlcom.Connection.Open()

Dim bytes() As Byte = sqlcom.ExecuteScalar

sqlcom.Connection.Close()

Dim fs As New System.IO.FileStream("c:\db_original.reg", IO.FileMode.Create)

Dim bw As New System.IO.BinaryWriter(fs)

bw.Write(bytes)

bw.Close()

fs.Close()

MsgBox("done")

'Write binary back to database

Dim sqlcom As New SqlClient.SqlCommand(sql, SqlConnection4)

Dim fs As New System.IO.FileStream("c:\db_modified.reg", IO.FileMode.Open)

Dim br As New System.IO.BinaryReader(fs)

Dim data() As Byte = br.ReadBytes(fs.Length)

br.Close()

fs.Close()

Try

sqlcom.Connection.Open()

sqlcom.ExecuteNonQuery()

Catch ex As Exception

MsgBox(ex.Message)

Finally

sqlcom.Connection.Close()

End Try

MsgBox("done")


 
 
Jerry





PostPosted: SharePoint - Search, Indexing MDI files from Office Document Imaging Top

I have used scanned TIF in my document archive for years, and up until WDS 3 was able to search on it. Now I installed Vista and Office 2007, and gone is the search on the OCR-ed text. Will appreciate any advice on how to get it back.
 
 
Steven Thong





PostPosted: SharePoint - Search, Indexing MDI files from Office Document Imaging Top

I manage to get WSS 3.0 to index the TIF files after following the walkthru provided here. However, I have to manually run the "Recognize Text with OCR" in MODI before uploading the file to WSS in order to search the content of the file. Previously in WSS 2.0, MODI component can automatically OCR any TIF files uploaded to WSS and we'll be able to search the content of the image but it's not working here despite following the steps provided. Any idea on getting the OCR part to work
 
 
andrewwiseman





PostPosted: SharePoint - Search, Indexing MDI files from Office Document Imaging Top

I'm running both SQL commands but nothing is getting identified for .mdi. Any ideas
 
 
Bob





PostPosted: SharePoint - Search, Indexing MDI files from Office Document Imaging Top

I am having the same trouble with the .mdi not appearing in the document_type list. I'm using SQL 2005, could that be the problem
 
 
Jerry_D





PostPosted: SharePoint - Search, Indexing MDI files from Office Document Imaging Top

I don't have the SQL server installed. Is there a similar approach for the SQLServer-less computers
 
 
PsharkAuburn





PostPosted: SharePoint - Search, Indexing MDI files from Office Document Imaging Top

No, Sql Server 2005 handles this task quite nicely. The iFilters necessary for .Tif, .Mdi, and other extension indexing is bundled with Microsoft Office. Specifically in the document imaging software which you need to make sure was installed. The easiest way to gaurantee this is to goto your Office installation, choose customized install, and make sure all components under Office Shared AND Office Tools are selected to be installed. With this done, run the database stored procedures:

exec sp_fulltext_service 'load_os_resources', 1;

exec sp_fulltext_service 'verify_signature', 0;
go

Now, dump the Sql Server database table and take a look for the extension .mdi being indexed thru SQL.

select document_type, class_id, version from sys.fulltext_document_types


 
 
Sharedot





PostPosted: SharePoint - Search, Indexing MDI files from Office Document Imaging Top

I did all these things but didnot get content in search. File name of tiff is coming in search. But content is not coming.

For pdf I am getting all contents in search. But I need for TIFF content should be come in search.

any help....


 
 
Sachin Dekate





PostPosted: SharePoint - Search, Indexing MDI files from Office Document Imaging Top

Follow all the steps mention by PsharkAuburn to modify registry entry under HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0

instead of HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0

It will search the tiff content.



 
 
PsharkAuburn





PostPosted: SharePoint - Search, Indexing MDI files from Office Document Imaging Top

Good call. That sounds about right, I have WSS 3.0, not MOSS 2007. I'd expect for Microsoft Office Sharepoint Server (MOSS) there would be some slight deviation in the registry key location, but the overall layout should be pretty darn similar. Make sure that you register both .TIFF and .TIF in the procedures above. Out of curiosity, do files with the extension .mdi ocr index correctly
 
 
andrewwiseman





PostPosted: SharePoint - Search, Indexing MDI files from Office Document Imaging Top

Should the path for SQL 2005 be used in all instances I don’t have the path for adding keys 38, 39, 40 under Office 12 but it does appear under

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Applications\89a9d417-5231-45b7-a9f5-08d4e043d8ce\Gather\Search\Extensions\ExtensionList

On a test server I am running this location is missing but it is there on our live server. Obviously I would like to test it on the test server first.