The view of the IT team was that if the source code to 4thWrite, the underlying database technology, could be obtained then these problems could be resolved. This would allow additional functionality to be added at the 'C' level so protecting the investment in INFOSTAT for a considerable time. In addition the database technology could be migrated to a lower cost platform. This would facilitate a development environment, lower costs, dual running and a lower cost maintenance. These requirements led to the start of the Walton Centre Linux project.
THE ESCROW AGREEMENT, SOURCE CODE AND LINUX
When the INFOSTAT HISS system was purchased in 1992 the Walton Centre had sought assurances from CHC(UK) Ltd. that, in the event of the company's failure, the investment in INFOSTAT would be protected. CHC assured the Walton Centre that they had entered into an software escrow agreement and tapes containing the software source code had been placed in the hands of a 3rd party, the National Computing Centre (NCC). However two important things were neglected (1) CHC did not opt for the NCC's verification service which would have verified the contents of the tapes (2) The validity of the legal agreement between CHC and the NCC was not checked by the Walton Centre.
In April 1995 the new IT management came to apply for access to the source code it was discovered that escrow agreement was legally flawed and although other CHC clients had received copied of the source code tapes, the Walton Centre's right to the source code was unclear. The responsibility for this lay with CHC for the failure to lodge correct documentation. The NCC was not responsible, nor at fault.
Eventually in March 1996, after complex legal discussions between the Walton Centre and CHC's receivers, KPMG, the Walton Centre received copies of the two source code tapes.
At this point we should introduce the two main protagonists:
- Eric Taylor (Senior Analyst/Programmer). Eric is the Walton Centre's UNIX, 'C' and 1960s guru who thought it was about time PC users got themselves a proper operating system.
- Neil Spencer-Jones (Head of IT ) at the time a committed PC LAN person who thought UNIX was for ex-hippies like Eric.
The first tape containing the INFOSTAT source code already in the Walton Centre's possession was easily read by Eric being in simple UNIX cpio format. The second tape containing the 'C' source code for 4thWrite (Version 8) could not be read. It was proven that there was not a tape error as the contents could be read raw by the UNIX utility dd. The problem was that no-one knew how the tape had been written. Some ex-CHC staff and staff in the NCC's escrow department assisted but no solution could be found. Two external data recovery agencies were sent copies of the UNIX dd output files, but to no avail. Hex dumps of the tape contents lead us to believe the tapes had been written on a DEC Alpha running OSF-1. We tried various Internet newsgroups assistance. A kind person in the USA offered to read the dd file on his DEC Alpha, but this failed. Eric then spent a few weeks looking at hex dumps of the dd file trying to work out the data structures as they clearly weren't streamed. It clearly looked like a file system image, but what file system?
In the meantime Neil was looking for the alternative platform to run the application on. The original thoughts were SCO UNIX. However, during the Easter holiday break Neil was looking for something to play with while his wife, Claire, was revising for her MA Final exams, so he downloaded the basic Slackware 3.0 Linux system from the Internet. This was loaded on a spare 486DX2-66 PC which he was building for Claire, and he was hooked.
Over the next few weeks Linux CD's were obtained and books borrowed from the local library. It was difficult to believe that here was a true multi-user operating system that was more robust than Windows95, and had better connectivity than any LAN operating system, yet was developed by enthusiasts and available for free. Neil soon had his own TCP/IP network at home using his PC running Windows95, the 486DX2-66 PC (Claire never got it, she and still has to use an old 8086) and a 68030 based Amiga running Workbench.
During April Neil installed Linux on a ACER ACROS P75 at the Walton Centre and even Eric was impressed.
[For further information on Linux, the GNU UNIX Clone see ]
To allow us to further investigate the source code tape contents the output file from the tape dd was moved to this new Linux PC. Some days later for some reason Neil did a 'files' on the directory containing the dd file, and to his amazement Linux identified it as a: Little-Endian New file systems dump (looks like the DEC Alpha guess was right). Now the HP9000 box we had been working on did have the dump and restore programs but these wouldn't touch our file or tape regardless of command line options, as these are Big-Endian boxes. A recent version of BSD derived dump and restore source code was obtained and compiled up for the Linux box, but we were dismayed that this reported errors and wouldn't process the file. After a few days head scratching we discovered that the HP-UX dd command was reading 2048 byte blocks off the tape but only outputting the first 512 bytes to the dd output file and throwing the rest away!. Once we had this cracked we could read the source code tape on the HP9000 DAT, FTP it to the Linux PC and process it with our Linux version of BSD restore. This produced some 80Mb of source code. The real fun could now start.
PORTING THE CODE
Looking at the our live system we calculated we could squeeze a copy of the basic database configuration in about 2GB, plus space for the executables and source code. As we were still in the very early stages of development, and a publicly funded organisation, we did not want to spend any money until we were more sure of the success of the project. So we decided to begin the development on the ACER P75 desktop which we upgraded with 32Mb RAM and four 800Mb IDE drives with parts from the IT departments service stock. And guess what? It fell over whenever the disks on the second IDE interface were accessed. By now we had installed linux on nearly all the IT staff's home machines and knew that this was unusual. After some playing around we decided that the problem must have been the ACER. So we looked round for another PC. The only machines available were some Multimedia Compaq Presario 7106's (486DX4-100s). These have a slim-line case and are not very expandable. So with some sticky tape, bubble wrap and a soldering iron we managed to squeeze in the 4 drives and 32Mb of RAM and development could begin. We were using Linux Slackware 3.0 and were using the 1.2.13 production kernel as this was known to be stable. We did no want to use any of the later development kernels, even those that appeared stable, as we would introduce too many variables into our development.
At this point it is worth noting the specification of the HP-UX machine we were hoping to migrate from. It was a HP9000/H30 with dual HP-RISC processors, 4GB of disk and 160Mb of RAM. Could we equal this power on a PC platform?
We put the 'C' code on our Compaq and began examining it. Firstly there was a directory called documentation that was, of course, empty. So we had to work totally blind with just the source code which contained some sparse and cryptic comments.
It was clear from the code that over time the system had been compiled for various UNIX machines/flavours including SCO, DGUX, MIPS, HP, SUN and there were compile flags for these versions. So our spirits were lifted some what, as the code looked pretty portable. People may be surprised at our scepticism, but our experience with CHC software had shown us that they sometimes did some wacky things! As there was a compile makefile lying about we decided to run make with the SCO flags, this being a 80x86 UNIX flavour. Did it compile? Did it heck! The libraries and headers were all over the place. There were lots and lots of little tiny libraries. Eric took the source home and spent a weekend sorting out the libraries and headers, removing duplication and generally tidying it up. The following Monday we tried again.
The 4thWrite application comprises of some 30 executables. At our next attempt we only had two programs that wouldn't compile. On investigation these turned out to be some irrelevant junk left behind by the original programmers. So we copied some data off the live systems along with all the dictionaries and application configuration files, and began testing.
The first thing that has to be done is start the backend database. This depends on a script that calls 3 executables which read the configuration files and sets up the database accordingly. These configuration files are undocumented. The first of these executables sets up and configures a shared memory segment for use by various database caches, but would not run under Linux. On the HP9000 this shared memory segment was 10Mb. Investigation showed us that this was failing because Linux only supports a maximum of 4Mb in a shared memory segments. So we altered our source and the configuration files to enable support for multiple segments using different segments for the various caches. We ended up defining three 4Mb segments. The first executable would now run and define and configure the shared memory.
When we tried the second start-up executable, which initialises the database domain, it failed with a 'cannot access shared memory' error. Now we knew this read the same configuration file as the first executable and should be OK. We could see no problem. Time was spent running debug (gdb) but nothing obvious was spotted. One thing that was slightly unusual, was that we had created three segments and given then shared memory id's of 0,1&2 but if these were examined using the ipcs command they had id's of 0,128 and 256. We considered this a quirk of Linux. However one night it dawned on Neil, when investigating the problem on his home linux system, that this was a little-endian/big-endian problem. Intel systems are little-endian. He suggested this to Eric who, sorting thorough the source code found a little-endian compile option we tried this and bingo the backend database could be initialised and all the management tools worked including SQL.
We now turned our attention to the front end executables and fired up our application. To our amazement and joy up popped our application sign-on screen. We went into the application and could see the data and everything appeared to be working. However, although we could manipulate existing records when we attempted to write a new record into the database the application fell over. Interestingly we could quite happily add records using the backend management tools. What was the problem?
Eric spent several days working with debug to find the problem. The cause was that when a database record was written with the front end applications the status of the file systems was checked. Some of the kernel function calls to return information about the file systems were returning invalid structure information. This was because the default data structures defined in the program headers (statfs.h) being used were invalid for the Linux ext2 file system. The data structures for the ext2 Linux file system are the same as those for HP-UX (vfs.h) for which there was a compile option. The relevant library source was hacked to replicate the HP-UX flag as a LINUX flag and the systems re-compiled. Everything in the front end now worked.
We now turned our attention to the print system. 4thWrite uses it's own print system and spooler which is a bit cumbersome but certainly superior to the standard UNIX spooler. This also features print routing based on user id and/or terminal and/or report name. This is very flexible, in that report can follow users if they log on at a different terminal, but if a report needs special stationary it will still come out on the printer with that stationary loaded. The printer spooler also directly supports various remote printing and socket printing protocols. However, the source code we had was for 4thWrite version 8 and we were currently using version 7 on the HP9000. The major difference was a revised print spooler, so we had to re-define all our printers, not a problem just a bit tedious as we have about 30 printers and 100's of routing definitions. Anyway when this was done the spooler refused to start. Digging around with debug we found that the spooler initialisation executable starts the various printer processes by starting nice'd nohup'd background shells (bad programming or what !). The problems were simply that the command line parameters for nice hard coded into the executable were different than those for Linux GNU nice. A few lines of code were changed and everything compiled and ran fine.
Quite often during the porting we came across code we didn't like, like the nice'd background processes instead of demons, but we took and early decision that we would alter as little as possible as we had a robust application on the HP that we wanted to replicate on the Linux system. We would look at enhancements at a later date when we could be sure we were not introducing additional unknowns into the project.
MOVING TO A BIGGER PROCESSOR
Now that we had the applications installed and running on our Compaq we did some tests. Surprisingly our 100mhz DX4 with 32 Mb of RAM appeared a bit faster than our dual processor HP9000. However the application on the HP9000 runs 24hrs, 365 days a years (the database technology supports on-line backups) so we could never get all the users off the HP to try a real head to head test, but at night when there were only a few users the PC was faster. We did some testing in the IT department getting all the IT department PC's logged on plus some spare ones in finance, and with 16 users the performance seemed fine, although it was just starting to use the swap file. Our HISS application has over 150 users, but both the 24hr nature of the organisation and the type of the applications, means 20 concurrent users is fairly normal. We therefore felt that moving to a fast pentium with a bit more RAM would be more than sufficient.
So two things began: Anne Billinge our systems manager began intensive testing of the application, comms, gateways etc. While Ian Porter our IT Project Manager set about finding a suitable processor. We read all the hardware FAQ's and also placed some information in the Internet Linux newsgroups and received quite a bit of advice, some we used, some we didn't. We decided to build our own systems rather than buy a ready made one.
The final spec was:
- 133 Mhz Intel Pentium
- Intel Triton PCI Motherboard with 512k Cache
- 4 x Maxtor 2GB EIDE drives
- Adaptec 2940 PCI SCSI Controller
- Sony SCSI CD Rom
- Archive 8GB SCSI DAT drive
- 64MB RAM
- Fast NE2000 Plus compatible ethernet card
- Full Tower case
Ian built one of these machines, plus and additional one for development with only 32Mb RAM, 3 disk drives, no DAT and IDE CD-ROM. Linux was loaded on both of these machines and they were named MARX and ENGELS. We chose the EIDE disk drives as 8Gb gave us plenty of capacity, enough at current expansion rates for about 8 years. The EIDE drives are cached (512K), fast (10ms), guaranteed for five years and cost about £200 each. EIDE will easily outperform SCSI2 drives and only if we start looking at (ULTRA)FAST-WIDE SCSI will we better the performance but that would have cost considerably more and could not be justified. We originally wanted to go for PCI ethernet cards but found the support was patchy in the 1.2.13 kernel, so we opted for reliability above performance, and our applications are not network intensive anyway.
When the machines were built we moved the application off the Compaq and copied the full databases from the HP9000. We then had over two months of further testing. A few minor bugs had been noticed in the earlier testing requiring two re-compiles, but after that the system testing ran perfectly. We took the advantage of the time to improve our backup systems and enhance the terminal definitions to use colours. All went exactly to plan. Because of the nature of our organisation there were one or two things, such as the gateway to the nursing systems, we couldn't test fully without extended downtime, which we are loathed to have (All systems at the Walton Centre have achieved 99.2% availability over the last year)
During the testing we wrote up the required project information for our systems management procedures log book, including instructions on how too rebuild the Linux system, plus all applications and data, after a failure. We also made two complete recovery floppy disk sets to be stored in our data safe along with a Slackware Linux CD-ROM. To test these procedures we trashed the system and then each member of the IT staff had to re-build it with only the instructions and no other help.
On Tuesday 15th of September, at 17:00 the HP9000 was shut down and the live database and dictionaries copied to the Linux system. The IP addresses of the two boxes were swapped so for the purposes of all terminal servers, X-terminals and the Novell Client Server access the Linux box was now the old HP and vice versa. The plan was for the system to be live again by 20:00. However when we brought everything back up the gateway to the nursing systems, one of the few things we couldn't test, refused to work.
This was eventually diagnosed as being due to the entries in the /etc/hosts/equiv file. The underlying technology behind this gateway was 'rsh' and the 'rsh' sessions were failing due to security restrictions. Once this was diagnosed and the correct entries put in the /etc/hosts/equiv file the gateway worked and the whole system went live at 22:30.
The next few days there were a few obscure minor problems, mainly associated with the changed version 8 printer spooler using a different method of printing over TCP to printers on terminal servers. These were all resolved. The switch over was therefore very smooth and has not been noticed by the users, except a few people have noticed that the node name has changed to MARX. The system runs beautifully and is noticeably faster than the HP9000. Interestingly the day we went live, for no apparent reason, just chance, we had a peak of workload with about a 20% increase in concurrent access with no problems whatsoever.
BENEFITS
- The Walton Centre now has a live and a development machine meaning development can be done off line.
- A costs saving of about £5,000 a year will be realised in reduced maintenance.
- When relocating to the new site there will be a saving of £35,000 in rental charges for dual running. An overall saving of about œ12,000 a year over 5 years.
- There is now the option to keep the backend database, drop the 4thWrite 4GL, and re-develop the front end in 'C' and/or develop further LAN based client server applications.
- The Hospitals main systems now run in a truly open environment.