Blaise Information Systems (Blaise) executed a turnkey project for CLJ Legal Network Sdn Bhd (CLJLN) to provide a Legal Search Service on the Net.
The project had two components:
1. Writing the software for Web based search & retrieval of Case Laws, Legislations and other relevant data
2. Initial data preparation and data loading.
In this document we will explain the process of Data Preparation and Data Loading.
The primary data belonged to two categories: (a) Case Laws dating back to 1900 and (b) All Legislations of Federal Government numbering over 600.
(a) Case Laws:
CLJLN had cases from 1900s to 1999 at the time of implementation. Due to the spread of time over 90 years, the data was available in many different formats:
(a) 1994-1999 cases were in Page Maker Version 5.x format.
(b) 1993-1998 cases were in Page Maker Version 4.x format.
(c) Partial set of cases were in RTF format.
(d) Partial set of cases were in printed books, amenable to scanning
(e) Partial set of cases were in very old printed books.
The objective was to convert all these files into SOIF format with multiple fields (Citation, Court, Case Title, Case Number, Judges, Case Date, Head Notes, Cases Referred, Legislations Referred, Other References, Counsels, Judgment Text) for enable search and retrieval. The files are also to be simultaneously converted into PDF files of one page each, representing the exact image of the published page.
We will detail for each of the above, the approach adopted for conversion:
(a) 1994-1999 cases in Page Maker Version 5.x format.
1994-1999 cases were in three different journals, (i) Current Law Journal, (ii) Business Law Journal and (iii) Industrial Law Reports. Each of the journals had its own formatting style. The formatting style had also undergone changes across years. There were about 10,000 cases of the average size of 7-8 pages each.
Our approach was first to analyze the various differing styles of presentation in content and style for each of the journals and prepare a comprehensive rule list. The rule set preparation is the most important step. The rule sets took into account the general formatting style, specific keywords, keywords at defined positions, the order of contents both forward as well as backward. The rule set was flexible to allow minor variations in punctuations, spacing or spelling mistakes.
The page maker files were converted to Postscript by generating simple scripts and using the scripts to run Pagemaker and generate Postscript files. A ‘C ’ program was written to read the postscript files and apply partial rule set to generate intermediate files. The intermediate files are then sent through many filters, based on decision trees incorporated in the filters to generate SOIF files, which are similar to XML.
Simultaneously PDF format pages for each of the printed page was generated using the postscript files, Ghostscript and custom built script files.
The process of generating the rule set and writing the filter programs were iterative. It took about five iterations to attain accuracy level of over 99.99%.
(b) 1993-1998 cases in Page Maker Version 4.x format.
For these files the approach adopted was similar to the previous set. The path was significantly different upto the level of generating postscript files and the rule-set was altered afterwards to suit the changes in formats of these years.
(c) Partial set of cases in RTF format.
These were comparatively easier. There was one common format across all these RTF files and a uniform style had been used throughout. This made development of the filter much easier.
(d) Partial set of cases in printed books, amenable to scanning:
These pages were scanned and converted using OCR software. A team of data entry operators and proof readers, formatted and verified them for content and exact alignment as in the original page. The pages were prepared in pagemaker 5. Once the typesetting was done, the conversion was a smooth process using the filters developed earlier.
(e) Partial set of cases in very old printed books.
These pages were not amenable for scanning. They were entered afresh and formatted in Page Maker. A team of proof readers verified the content and formatting. Once available in Page Maker the conversion process was smooth.
In both (d) and (e), the formatting was done using Pagemaker, so that the earlier software for conversion as well PDF generation could be used. Because of the volume of work, items (d) and (e) took the maximum time of over six months to complete.
All legislations were entered from the published books of the Government. They were data entered and proof read. Programs were written to separate the legislations by sections and to include Table of Contents. This is a straight forward process without much of complications.
D&B’s associate in Philippines was maintaining the local data. D&B wanted to merge the data with the database in Singapore. All the data available in Singapore was in RTF format and it has to be converted to a structured format suitable for import into Access.
There were over 15,000 files, each file having an average of 6-8 pages of A4 size. The arrangement of document had some pattern but not uniform throughout. There were variations in content, positioning as well as style.
The problem was to identify information content, which are to be taken into Access. By going through a sample of about 100 files at Random, a rule-set was defined which is a combination of keywords, positioning in a region and formatting such as Bold, Italics, Font-size.
A perl program was written incorporating the rule-set and by going through 3 iterations the program was improved to process all the files. Simultaneously, name matching techniques were used to eliminate duplicate entries.
The main challenge here was the time factor. The entire job was to be completed within a period of 10 days, on-site at Philippines. Our team of two persons completed the conversion as well as duplicate removal within the 10 day period.
NCS is a Collection Agency based at Atlanta, to whom we provide regular data services.
NCS sends the data of defaulters as scanned images using Internet/FTP. On an average one person’s data will be about 7-8 pages of US Letter size. The pages are a combination of Handwritten, Type Written, Type Written with Handwritten comments. We receive on an average of 200-300 files in a given batch.
These files are downloaded and sent for typing. Each defaulter’s data are entered into one or more rows in a spreadsheet, depending on whether it’s an individual or group of people. The relevant details to be entered can be in any of the pages. Data Entry Operators visually scan all pages to identify relevant data to be typed. There is a set of calculations to itemise details of the amount owed. The data entry operator enters 10 files at a time and on completion sends these 10 files for the next stage of verification.
The typed data is then processed by a filter program, which verifies the codifications, the telephone numbers, area codes with the City/Region, The US Post Codes and quite a few business rules. It prints an error list of definite errors as well as potentially doubtful cases alongwith a neatly formatted ½ page printed information for each spreadsheet row.
The printed pages then goes to the verifier, who once again verifies the data with the scanned image, the printed page and the error list generated by the filter program. The proofreader after verification, updates the data for any omissions, mistakes.
All the smaller batches of 10 files are merged using a merging program to prepare one big spreadsheet. This is again sent through the filter program for further checking. All the potential doubtful cases are once again manually verified.
During the process, comments are prepared wherever there are doubts, decision is made in one of the many ways or the data is insufficient to cover all the business rules. These comment list is sent along with the completed spreadsheet.
Based on the general volumes, we have committed resource by which any batch of upto 250 files are processed in a single day and sent back the same day.
© 2003 Blaise Information Systems. All Rights Reserved.