SECTION II DATA EASE Kenneth H. Cook and Lynn Trunp

This section describes the development, organization, and growth of three separate data bases used by SUPARS/DPS n . in order to give a broad and general overview of how data flows through the system, Appendix A presents a description of the utilization of data as it is channeled through the various programs. The three data bases that were developed were: (1) Document Data Base: includes bibliographic citations and abstracts (documentsT iron Psychological Abstracts (January 1969 - June 1971), (2) Vocabulary Data Base: all unique terms identified through the free-text processing of documents and stored in the DPS inverted file, and (3) Search Data Base: search Inquiries originally entered by users to search the document data base, including the search words and Boolean operators. A data base of documents wds developed during the previous year's experimental work in 1969 - 71;' the vocabulary and search data bases were developed as new searching algorithms for the 1970 - 71 research. 1. DOCUMENT QATA BASE The document data base was developed from machine readable types of Psychological Abstracts rented from the American Psychological Association. Ihe general rules for exclusion of common words, special character handling, sentence and paragraph endings were the same as reported in the July, 1971 report. (2) Because of formating changes and character set changes made to the 1971 KV tapes by the American Psychological Association, modifications had to be made to the SUPARS/t>PS programs in order to proaess data. First, program TRANSLATE had to be modified in order to deal with the standard IBM scientific Character train using upper case characters rather than the previously used special multiple alphabet set of characters. Second, program REFORMAT had to be modified to accomodate (a) changes of document "fields" by APA and (b) the new use of new, fixed length fields to be used as pointers to the actual variable length fields of data, rather than the previous use of all variable length fields. A description of the general logic and a flowchart of the TRANSLATE and REFORMAT programs is given in Appendix B. A chart of the fourteen fields, such as author, title, source, abstract, etc. used to organize the data for each document are also given in Appendix B. Figure 1 of Appendix B lists the fields as found on the original APA tapes before translation; Figure 4 of Appendix B, gives the translated fields that have been reformatted and

17

are ready to be processed (loaded) by the SUPARS/DPS programs. After proceeding through a VALIDATE program, which checked each document for maximum length allowable, and deleted those which were too long, documents were ready for processing, or loading* Documents from January 1969 through Novenber 1970 were processed through the two-phase DPS program that developed an inverted file of free-text words and documents. Ihe loading history of the SUPARS/DPS II document data base is shown in Table I. A total of 46,828 documents were loaded into the data base. The number of unique words and terms derived from free-text indexing of these 46,828 documents amounted to 106,702. Documents were loaded into 6 separate batches as shown in the six rows of Table I, and included tlie months of January 1969 - June 1971. A graphic representation of the growth of the data base is shown in Figure 6, "Growth Rate of SUPARS/DPS Data Base." The six numbered points on the chart represent a separate batch of documents that were loaded, and correspond to the six batches of documents shewn in Table I. Another indication of the size of the document data base is the number of tracks used on the main storage devices, the 2314 and the 3330 disk pack. The track usage is shown in Table II. Three separate files are maintained in SUPARS/DPS: (a) Dictionary, containing each uni<^ue word and the document postings, (b) Vocabulary, cxDntaining unique words, and all document numbers which contain the word, and (c) Master, containing a coded representation of the entire document. For each file, oenparisons are given of the nunber of tracks allocated and the number of tracks used on the 2314 disk pack (used with the 360 operating system) and the 3330 disk pack (used with the 370 operating system.) During two separate periods of time during the growth of the data base, the vocabulary file had to be reduced in size, or restructured, in order to fit into the available storage room on the 2314 disk packs. A restructure program allows for a more conpact and efficient storage of the large strings of document numbers listed after each unique word in the vocabulary. The first restructure program was acoonplished while the 360/50 operating system and the 2314 disk packs were in use and was done on the data base of January 1969 - April 1970. As a result of this restructure, the amount of storage spaoe needed on the disk packs was reduced by 45% in the vocabulary. •Jhe smaller document nuntoer strings contributed to both an increased search efficiency and a greater amount of available work spaoe. A second restructure was necessary when the 370/155 operating system was installed in December, 1971 and the 360/50 system was removed. The different relative track addresses on the new 3330 disk packs from the old 2314 packs necessitated a restructure of data for the vocabulary, dictionary, and master files. Ihe reduction in size of the three files from the 2314 to the 3330 in terms of the number of tracks used is given in Table II. For

18

ro

• »d
VD

00

00

o
00

en

ro

5

£
<Tk

co
CN

o
VD 00 CN

ro
•H

*r ro

•*

*

^r

r^ •» o> ^t

VD CM VD
«•»

C^

o ^ CN *. <T*
00

CN O VD O

CM O

r* •*

r% \£>
O

r-

ro
in

CTi

. £

CN

»H CN VD 00

CO

in

CO rH

CO

««T CO

%

^

VD

*

as fe r* CN

*y rH VD

CN VD ^f

I -

•» rH
•H

* r* rH

in
00

CO
V

VD

*r * ro
rH

00 VD O

ro
00

ro

* CO
rH

* CO
CN

r* *
CO

00 CN 00

8
00

T*

8
Pj a M

ro

3

« *

* VD «*

VM

1
col
in CO

o\

r* o VD

rH

in

o VD

o0 0
in

CN C\

o

in

ro o
CO

r>
H

00

a\

g
cr>
VD ON VD CTi iH
<j\ VD

»H

%

r-4

o rc*
+>

o r-

rH

rH

°>
V

*-! r~ ••
0)

Jan-June

CM«

July-Sep

May -Sept

^

-P

c

3

Oct

Jan

I

?
o

1

I
.5
CO

-Hi

•6

CN

CO

in

VD

a

19

X

Oct 1970-Jvne 1971 -

ib^m^

—:

—Key-Sept 1970

/&. •

•

J a n - A w 1970

-Oct-Dec 1969 July-Sept'1969

Jan-June 1969

* SCALES ARE NOT IN PROPORTION ** SEE TABLE I FOR EXACT FIGURES

10
• II II

20 D 0 C U 'ti~E
I!
1 1 1 1

30 NTS**
• • if
I l l -

HO

50

(IN

THOUSANDS)

Figure 6.

Growth Rate of SUPARS/DPS n Document Data Base 20

V£>

ID

CO

x?

cH

o co

in #f -

co
CM

oo o
rH

ro **
CM

oo in
fH

TT

VO 00 CM

f I

00

I

V£>

r

oo
CM

8

o o
CM

O O

o in m
rH

o in
KD

o o o
CO

o o
CM

co

CM

o o

O O O CO

I I I

O 00 O CO

.5

I
in rH ««r
in

S 8
(M O r-

jq
^ en

o

<tf

O

1^

O

iH

O

rH

£ 1 > | > |
f
0> VD

? ? ? f V
0> V£> 0% V^
CT» VO 0*\ VD

O VO

CQ

(3

(g

lo

8

<3

< D

38
c> /

Is
M

3
21

exanple, in the dictionary, for the docurents January 1969 - April 1970, the number of tracks used on the 2314 was 286, while the number of tracks used on the 3330 disk pack was only 156 after the restructure. By the time the total restructure was completed, only 1 1/2 3330 disk packs as compared with four 2314 disk packs were used to store the document data base, the search monitor, and DPS search modules, and the search data base (explained below in sub-section 3, "The Search Data Base").

2. VOC&BUIARY D M A BASE A second major data base that was made available on-line to users of SUPARS/DPS was the entire free-text vocabulary. "The vocabulary contains all the unique words taken from the free-text processing of documents and stored in the inverted file. The vocabulary, called the "dictionary" in the original DPS programs, is actually a by-product of the free-text processing. Ttie total size of the vocabulary reached 106,702 words by the time documents from January 1969 - June 1971 were processed. Figure 7, "Cutnulative Growth Rate of the SUPARS/DPS Vocabulary Data Base" shews the cumulative growth of the Vocabulary data J?ase for each batch of documents processed. In order to make use of the words contained in the vocabulary and have them accessible in an on-line mode for users to query, special search programs had to be developed. The term "delta V" is used to describe these types of searches, because the character "delta" and "V" are typed by the on-line user to access the vocabulary data base. au Special Programs Developed for a Vocabulary Data Base

T\*o new search modules had to be developed to handle the Vocabulary Data Base. The programs were necessary in order to access the vocabulary as a data base rather than have it operate in its normal capacity as a special handling procedure for the DPS Dictionary file. The vocabulary control program is a examination, with a great deal of modification, of the existing programs of the standard DPS search routine which is called by the search control program for all Delta V searches. The vocabulary control program reads user input line-by-line, storing the first user-entered search word. If the user has not established a limit to the number of words to be printed as output, the program sets it to 100. That is, a maximum of 100 items (words) will be allowed to be outputad. The program then calls the interface program, which is a modification of the DPS program which actually locates the words in the file of all free-text words. TWO output options are currently available. The first sirrply calls for a frequency count of documents in which the user-entered word appears and returns the record address of the word to the control program. In the case of a search where the user enters a word prefix, or root, and wants all words containing the root to be printed out, a maximum of 100 words at a time are located that contain the root. The string of data record addresses of these

22

o
5

I

O
v-i

o
CM

I s
&

m
CO

77 , 6 2 6

a,
o o
CO
EH

I
m
3-

*
1 —
;

•g
• H

£ s cr>

O
CO

o
Co en
00

N
«
i o
N CD »-3 rH

1

co
CM

•
CO

t

t
CN
T"J

1
iH

|
O

1
to

o

o o

1

1

I o

1 1 o
if

1

1
o
CN

<> x

( S at! venom NI) 5 Z I S XVYMeVDOA XXSX-SHUd

23

words are then returned tc the control program. In both cases, when control is returned to the vocabulary program, it reformats the output for the user and collects statistics that are stored for later retrieval. The third option, which would allcw the user to request the printing of words procediixj and following a specified word in the vocabulary, was never implemented, although programming was begun. Time restrictions did not allow the programming work to be completed, although initial efforts were begun. The major difficulty in implementation rests in the complicated structure of the DPS dictionary file. Ihe embedded master and block index records greatly complicates the problem of reading and collecting words .preceding a specified word, although the task of reading words following the word is riot as difficult. In addition to the problems of developing new search modules to access the vocabulary data base on-line, another major programming effort in the development of the vocabulary data base was the expansion of the total word size capacity of the vocabulary. b. Expanding the Capacity of the Vocabulary Data Base

One limiting factor in th£ original DPS program was that the size of the vocabulary (dictionary) file was limited to 65,534 free-text words because of the half-word, or 16-bit coding scheme, i*e. 2 1 6 • 65,534. When this limit was reached, a new file had to be defined. However, the new file would collect all new words, including those in the first file. Each file of documents would have to be searched separately with its own r a vocabulary. *w If a larger capacity could be developed for-the vocabulary, only one file would be necessary which would make for less cumbersome and more efficient searching of the data base. A second advantage would be the opportunity for the user to access a oorplete file of all free-text words that would be available to him in developing search inquiries. To increase the size of the vocabulary and maintain it in one file, the half-word, or 16-bit coding of a word identifier, was changed to acoept fullword coding of 32 bits, i.e. 2 3 2 . This change increased the capacity of the vocabulary file from approximately 65,000 words to over 4 billion words, as 2 3 2 equals 4,294,967,296, In order to proaess 32-bit word identifiers, the format of all data records containing the word identifier field (WID) and all programs creating and referencing those records had to be altered. . Ihe WID fields in the Dictionary, Master, and Master Identification Update records and those in the qictionary record area in the search DSECT were increased from two byte (halfword) to four-byte (full-word). Consequently, the relative addresses of all subsecfuent fields in tft&se records were displaced. The loading programs (PRELOAD, SORT and DCftD), two search modules (KEYWORD and POSITIONAL NOTATION PROCESSOR) and all three versions of the dictionary interface program (DICTIO) had to be changed to handle the new record formats. Progranitdng included converting pertinent half-word instructions to full-word instructions, changing the displacement values for references to all fields following WID in the data records, and altering reoord-length calculation routines to take 24

onto account the new field length. In addition to the changes necessitated by the change to full-word word identifiers, substantial changes were made to all search modules. First, the unnecessary coding was deleted to save space. All instructions and variables referring to the synonym, equivalent, and text files were renoved, as were unused routines handling weighted keywords and unlabelled search statements. Second, in order to save space in the search monitor and facilitate handling the many different output requirements, the output formatting formerly handled by the search monitor was incorporated into the search programs and new formating routines were written where necessary. Finally, all data entered by the user or written to him by the search monitor program were written into separate intermediate disk files by the search and monitor programs and file acaess routines were added to all affected prograirs. 3. THE SEARCH DATA BASE Ihe third major data base which was developed for SUPARS/DPS n was the search data base which contained the collection of search inquiries that had been previously submitted to the system and subsequently stored during October3eoenter, 1970, m$ NoventoeiHDeaenber, 1971. The development of this data base sinply meant processing each" search thrOu^S the SUPARS/DPS loading programs in the same manner as the document had* been. This ibading process created a data base containing its own dictionary, vocabulary and master file. ihe searches were loaded in two separate batches: those of the 1970 period of system operation and those fron the 1971 period. A summary of searches loaded and the niattoer of free-text words generated from those searches is shewn in Table III, on page 26. Table IV gives an indication of the size of the three files contained in the oorrplete search data base by the nuntoer of tracks used on the two different disk storage packs. Ihe DPS Dictionary file contains all unique words followed by a string of document nunobers in which a coded representation of tlie entire document is entered into the system. TABLE IV GHDWTH OF SEARCH DATA BASE

Batch

7H
#1 #2

Dates Oct-Dec '70 Nov-Dec '70 TOTALS

Searches Loaded 2,409 1,826 4,235

Total Searches Loaded 2,409 4,235

New Words Added 12,016 5,477 17,493

Total New Words Added 12,016 17,493

25

o
ro

rsi CO

Cs

IT)
CN

CM

00

jo

CN O VD

Oh CO

I
5

rH
CM

J

V0

CM

r

ro in

i
1

O
CO CO

ro

o o
CM

o o
CM

o m
P-

5
r*

o o o
rH

O O

C M

LO

I I I

O O in CM

!
i

co
CO

°

W

CO

U)

f fl fl | 1 1
VD rH O CM

9 9 s r
CO O^ "3* VD rH O CM

0) 03

M

CO <Ti ^T

<Tv O «"tf CM

in CO CM

1971

1971

1970

1970

1970

1971

rH

c8

c0

ca

.970

970

.970

rd

g
id
u
O-H

3
26

3

The development of this search data was made in three stages which are discussed in more detail in the paragraphs below: (1) definition of a data oase description, including conversion specifications for document data fields, and punctuation conventions# (2) modifications of the DPS search module and reformat programs in order to load and search the data base of searches in an on-line, interactive mode, and (3) specification of the various forms of search output available to the user. a. Definition of Data Base Description

Definition of this data base description (DBD), especially the character handling statements for the loading programs, required special care to insure that routine SUPARS/DPS processing of input records did not result in the loss of essential data. Specifically, this essential data included routine treatment of the Boolean operators AND or OR, which DPS defines as oonrnon words, and certain punctuation marks (i.e. , ; ()), which standard processing defines out of existence. These problems were solved by (a) adding special characters bo the words or converting the synbols we wanted to retain $c that the system would store them, and (b) by deleting the extra characters and reconverting the synbols at output time so they appeared normal to the user. Exanple 1, JFigure 8 contains a sanple of a user's previously entered search before any input processing. Exanple 2 shows how the sane search would be stored by DPS if no special reformatting were perforated before input processing. Note that all parentheses and Boolean operators would have been deleted and the relationships among the search words would be lost. In Exanple 3, the search appears in the form it would have taken when the reformat program was run before input processing. In the bibliographic field, dirmny characters, which are unprintable, were inserted around Boolean (AND, OR) and other logical operators (?, +, ;) to preserve them from automatic elimination in the standard DPS processing. These dummy characters are converted to parentheses and blanks at output tine, as seen in Exanple 4. In the text field, which contains the actual characters to be searched on, all the added characters are brackets, ftie adding of brackets around Boolean and other operators forms a new word, i.e. [OR] instead of OR. In this way, these new bracketed words become acceptable search words that can be entered in a search inquiry by the user. This development of new searchable words allows searching of the stored-search data base using Boolean operators as keywords, and allows the system operator to study and analy:ze the various types of logic used by previous users. The ocxtplete Data Base Description can be found in Figure 1 of Appendix III. b. Modifying Search Madule and Reformat Programs

Hie search reformat program was written as two modules which are combined by the linkage editor to form a single load module. Appendix III contains flowcharts and program descriptions of each module. In addition, the output record description found in Figure 3 (Appendix III) gives an annotated list of all fields in the search data base records.

27

en

Q)

U5

u

I I

CM

3333

i

•5

1

3333
O

i
-H
1W

00

a
(1)

&

o
w

en

o !

0)

I 3

§§33
28

+* -P .Q

8

I 3323

3333

In the flowchart of Appendix III, the main reformat program, BIBFLDS, reads as input the search statistics records collected by the ST3VTPAC program. It formats all bibliographic fields except NDOCPR and those fields which must be extracted from the user's search entries: keyword and list statements, logic, and a list of all keywords entered. These fields are prepared when the second module, FORMSRCH, is called. This program also creates the text format of the keyword lines and list statement. Control is then returned to the nain program which writes this first output record and formats and writes 'the second record, consisting of the remaining text data. c. Specification of Output Forms

Output processing of the two forms of output available to the user proceeds under the control of specially written routines. These routines allow the user to request (a) LIST SEARCH which yields a list of all search inquiries that were retrieved, and (b) LIST WORDS, which yields a listing of all the unique words contained in the searches that were retrieved. The output of the LIST SEARCH option in a search inquiry of the search data base is designed to closely approximate its input format. Because each labeled line of input, such as LI, L2, etc. in Figure 8, appears as a separate line in the printed output, it becomes necessary to check the SRCHA bibliographic field to.determine the internal end offline indicators Which are inserted'by the reformat program. jv" The LIST VORDS option prints for the user all words in order of decreasing frequency that were found in the searches that are retrieved from a search inquiry. For this option a sort routine accumulates the WRDA and WRDB fields, sorts each word in frequency order, deletes v duplicates, and formats output lines for printing.

4.

SIM4ARY

This section has described three major data bases that were developed for the SUPARS/DPS ii research. On-line access was itade available to (1) a document data base, (2) a vocabulary data base, and (3) a search data base. Data was presented to shew the growth rate of the units contained in each of .the three data bases which were documents (bibliographic citations and abstracts) for the document data base, free-text words for the vocabulary data base, and search inquiries for the search data base. In addition, prograrmiing changes that were neoessary to make the vocabulary and search data bases accessible apd searchable in an interactive, on-line mode were explained. 5. EXPLICATIONS AND PKXJECTIONS

The successful development and inplementation of interactive algorithms based on SEARCH and VOCABULARY data bases has inplications for the future use of human intelligence to augment the user's searching negotiations.

29

For exanple, imder the present e:xperimental system, a user has the ability to search through the list of free-text terms in the vocabulary for potential search words. Once the user has a sufficient number of terms, search inquiries are entered and documents retrieved. One alternative that could be considered is the development of what is kncwn as a synonynt-equivalent list (SEL). In standard DPS, terms considered to be synonyms are appended to the free-text terms and are used in retrieval much like an individual manually uses "See also" references in a controlled index. For exanple, in the batch-mode version of DPS, the SEL would take a user's search word in an inquiry, and its equivalent or synonym words initially supplied by the system operator, and process the inquiry using not only the user submitted terms but the equivalent terms as well. However, this process requires the intervention of a human indexer or subject expert to determine synonym words and build a list of equivalent words. One projection of the syiKaiym-equivalent idea using human intelligence without the intervention of a decision by a selected group of indexers or experts would be the following: Because all search inquiries previously submitted by users are stored in the search data base, the word associations in this body of inquiries could be available as data to input and automatically generate a synonym-like equivalent list. ^ For example, if an' ir^iiiry origirially entered the intersection of the term "Achievement" and the term "Motivation," a synonym-like term related to "Achievement," as the user originally decided to associate the two words, would be "Motivation."

These word associations from inquiries could be automatically incorporated into the SUPARS/DPS inverted file for the document data base using the available SEL module. The user would then have the option of entering an inquiry and selecting or not selecting the synonym equivalent function. By keeping the synonym equivalent words in their order of decreasing frequency, the user could also '.je able to select a "cutoff" figure of the first "x" terms available in the list, rather than having to have his search inquiry processed on the basis of all the terms in the listing. Initially, batch-mode updates of the equivalent list would be periodically made from the search inquiries of the search data base. Another related projection of this idea of an automatically generated synGnyio-equivalent list based on human intelligence would be the ability of the user to interactively query the list. In effect this option would form a "synanynv-equivalent data base" that could be interrogated by the user during the developing of alternative inquiries, or could be employed as an automatic means of extending the original search inquiry. Some of the advantages of using a synonym-equivalent list developed in this way would include the fact that (1) human intermediaries, experts, or indexers would not be needed to make decisions about word associations and links, (2) the equivalent list would represent a current, and up-to-date

30

option based on the cumulative intelligence of the user population, and (3) the user would have the ability to interactively query the equivalent list as an additional souroe of potential keywords for a search inquiry. The potential disadvantages of a hunanly generated equivalent list would be the necessity for continual updating of the list, and the fact that terms in the list might possibly not be considered realistic synonyms by seme users. The capability for developing and iirplementing these projections of the current work is available in terms of the programming and systems changes necessary. The larger issue in the overall system is the cost-performanoe level of the equivalent list in relation to the other interactively available algorithms in the SUPARS/DPS system.

31

APPENDIX I. SYSTEM OVERVIEW

32

USER
IIMPUT/JUTPLITI

VIST. PSYAdaTa J1CT10NARV

|( TERMINAL)

7-.\ANjL4T, PSYASSTS VOCABULARY SEFiHUAT

JPS
. UN I TOR n

N
it P5/A:3ST5| MASTER

SPECIAL SEARCH PRUURAin

SEARCH / STATISTIC?

VALIDATE LOAD

I

N
SUPARS/UPS SEARCH

DATA : BASE DESCRIPJ TI ON
$. REFORM T

SEARCH JICTIUNARY

AS

VALIDATE LOAD

SEARCH VUCAriuLARY

EARCH WASTE f ?
LQADINS PHASE SEARCH PHASE

Figure 1.

System Overview
33

APPENDIX II TRANSLATE PSYCHOLOGICAL ABSTRACTS TRANSLATE is a BftL program written to run on the IBM S/360 Model 50. It examines tapes of Psychlogical i&straets records an a diarac±er-byCharacter basis for membership in a selected subset of valid characters* When possible invalid characters are converted to valid ones and the record is written to an output file; when this is not possible the record is written to an error file.

Computer Definition 1. IBM S/360 Model 50 2. Three 2400 tape drive facilities and 9-track tapes 3. Model 1403 Printer 4. Gore requirements: a. Assenbler 140K b. Linkage Editor 128K c. Program execution 34k System Description 1. Operating System: Syracuse University Operating System (SUOS) 2. Assembler Level F translator program 3. Linkage Editor Level F program Program Description This program reads an input tape of Psychological Abstracts reoords and examines each byte to insure that it is a merrber of the character set defined as valid by the American Psychological Association. Certain characters are translated to more meaningful APL characters. Reoords containing invalid characters are written to an error tape and a message to that effect is printed. Valid reoords are written to a tape of translated documents. Input Each tape contains a copyright statement as the first record. followed by the data reoords and an end-of-file mark. It is

Data records are composed of 4 different types of fields: fixed length fields, directory fields which reference variable length fields, and variable numbers of fixed length fields. Output 1. TRANSLATED tape - see Input record description 34

Displacement

Data Item Fixed Fields

Comments

0- 3 4- 5 6>- 9 10.-11 12-13 14-18 19-20 21-26 27-30 31-34 35-38 39-42 43-46 47-50 51-54 55-58 59-62 63-66 67-70 71-74

rteoord length Generation code Year Volume nurnber Issue Number Abstract nuntoer Type of publication Journal title code Language Availability Directory Fields

Binary control field

tumeric code for 1971 documents onlyblanks for 1970 documents blank or FRGN

Subject Index Codes All directory fields are ric£it justified Subject index phrase numerics, blank-filled on the left. The Author Subdirectory fields contain the displacement of the Designator other than first byte of the corresponding variable author length data field relative to the first Affiliation of first data byte, (generation code) of the author record. Publication title Source document title Source document description Abstract Abstractor's name Variable Nurrber of Fixed Length Fields Number of classification codes Classification codes Nuntoer of subject index codes Subject index oodes This is the last field which begins at a fixed displacement. Each is 6-digit code Each is 5-digit code All are left justified

75-76

Subject index phrase Author Subdirectory

Author (s) Designator Affiliation Publication title Sourae document title Source document description Abstract Abstractor's name Figure 1. Input Record Description 35

2 characters: right justified count of nurrber of authors 2 characters each: Number of characters in each author's name

(

, £

"

w

)

GETREC

L

~"$

IPT NU "

l—K

™ISH )

TRAN5LAW TEST EACH BYTE

CNVRSj CONVERT TO YES VALID CHARACTER

N O

NO

±
PRINT ERROR

"*

Figure 2.

Translate Psychological Abstracts Logic Diagram

36

TRANSLATE

C
PARAMETER CARD
t

BEGIN

3

4L
READ IN VOLUME 4 ABSTRACT RANGE

ERROR MESSAGE PRINTED

|>T

FINISH J

roo iifia—frf

FINISH

)

Figure 3. Translate Psychological Mastracts Flow Chart 37

1A 2

M

TRANSLATE EACH 3YTE OF INPUT

ADD 1 TO ERROR COUNTER]

&

IViAKE REPLACEMENT. SET UP TO CONTINUE TRANSLATION.

SAVE AOST 4 -SET UP TO CONTINUE TRANSLATION

Figure 3.

(Gontinued)
38

2. 3.

ERROR tape - see Input record description PRINTED OUTPUT The following messages may be printed: (a) TAPE HAS NO COPYRIGHT MESSAGE

(b) ABSTRACT NUMBER nnnnnn HAS xxxxxx ERRDR(S) where nnnnnn and xxxxxx are the abstract nurtber and error count as computed by the processing program. Program Name: ABSTRACT REIDFMAT is a BAL program written to run on the IBM S/360 Model 50. Its purpose is to reformat the data contained in each Psychological Abstracts record into a format that is compatible with the SUPARS/DPS input record description. It acoonplishes this by rearranging the fields, inserting termination character, truncating field which exceed the maximum acceptable length, and desigrfating sentence terminators. REFORMAT

Conputer Definition 1. 2. 3. 4. IBM S/360 Model 50 TWo 2400 tape drive facilities and 9-track MDdel 1403 Printer Core requirements: a. Assembler 140K b . Linkage editor 128K c. Program execution 20K

tapes

System Description 1. 2. 3. Syracuse University Operating System (SOOS) Assenbler Level F translator program Linkage Editor Level F program

Program Description This program takes as input the TRANSLATED Psychological Abstract tape. It processes 1 input record at a time and produces either 2 or 3 output records for each. Each document is assigned a DPS assension nurtber. J Ihen the fields are broken down and reconstructed into a format suitable for DPS processing. The first record contains all bibliographic fields with their termination identifiers and, if there is room, the reformatted

39

abstract. The abstract is rewritten so that the character handling statements will process the punctuation properly* If the abstract will not fit in the first record, it is outputted as the second record for that document number. If the abstract is too long for a record it is truncated to the maximum length allowabJfe for an output record — 1646 characters. The last record for each document is the text portion, paragraph B, sentences 1 through 4.

Input See input record description for Translate Program (Figure 1.)

Output 1. Printed output The follcwing messages may be printed when this program is run: CDPYRIGHT STATEMENT MISSING LENGTH ERROR FOR ABST NUMBER nnnnn - if a field or the entire record exceeds limits by DPS. ERROR IN DIRBCTbRY FIELD OR AUTHOR SUB-if a narv-nuneric field found END OF PROCESSING - for successful termination of job LAST DOCUMENT NUMBER ASSIGNED WAS xxxxxx 2. REPORMATED DATA

For each input record, 2 or 3 output records are produced. If there are 2 output records, the first contains the bibliographic data and Text Paragraph A; otherwise the bibliographic data and Text Paragraph A data are separate records. Text Paragraph B is always the last output record. Each output tfeoord is preoeeded by a 4 byte control field containing the record length. (See Figure 4.)

40

Field Length

Data Item Bibliographic Data

6 4* 2* 5* Variable***

DPS Ascension Number Year Volume Number Abstract Nunfoer Author Editor Affiliation Article Title Souroe document title Souroe docurrent description Language Type of Wade** ." '• Random Number i4>stract Text Paragraph A

•••*——••

••

••!•!•

»

I —

!•«••

HI

•

•

II

I

II

I l l 111

••

I

II

•

•

1

4 Variable

Paragraph indicator Abstract Text Paragraph B

6 Variable*** Variable*** Variable*** Variable***

Document Number Article Title Souroe Doc. Title Author Affiliation

*Field is terminated with the character •• to meet DPS requirements. ihis character is rtSt included in the listed field length. **Field length range is 1 to 255 characters. No field entry in the input document is indicated by one blank as the bibliographic entry; input fields exceeding the maximum are truncated to 254 characters and an asterisk is added as the final character to signal truncation. ***Bitry is followed by a sentence indicator. Figure 4. Reformatted Data

41

REFORMAT

C
Taps generated by TRANSLATE program

3EGIN

^9©
PRINT LAST

---w
COMPUTE CURRENT DOCUMENT NOidflER

DOCUMENT ,A
ASSIGNED

-T

FINISH

j

iViOUEFIXD

SETUP

SET UP LENGTH MOVE FIXED FORMAT FIELDS [sjoc ADDRESS FOR "tq\/ARIAdLE LENGTH TO OUTPUT IFIELDS USING 3UF.FE* 1DIRECTDRY .,-

qciJiHPfr
YES

©>
ujRIT

MOVE VARIAdLE LENGTH FIELDS TO OUTPUT 6UFFER

aiHLlOGRAPHIC iECORD

-M

NEJJ

INITIALIZE OUTPUT RECORD

©
Figure 5. Reformat Psychological Abstracts Logic Diagram

42

©
MOVE TEXT FORMAT OF ABSTRACT TO! OUTPUT

(CREATE TEXT RECORD vvlITH REMAINING ! FIELDS

Figure 5. 43

(Continued)

REFORMAT

(

'BEGIN

J

Tape qnneratod by TRANSLATE prooram TA^E VALUE OF LAST DOC.

* ASSIGNEO
FROM PARN FIELD

®
PRINT LAST DOCUMENT # ASSIGNED

READ INPUT

•f

FINISH

J

PHI NT. ERROR MESSAGE

COMPUTE CURRENT DOCUMENT NUMBER

I'IOVE EACH FIXED FIELD OH NULL INDICATOR TO OUTPUT

«

Figure 6. Reformat Psychological Abstracts Flowchart

44

©

INSERT TERMINATOR AFTER EACH FIELD ENTRY

FILL FIELD LENGTHS AND DISPLACEMENTS IN VARIABLE FIELD TABLE NULL FIELD INOICATJW IN OUTPUT

NO
YES IwiOVE NANE(5) |AND TERMINATOR TO OUTPUT FIELD

10VE ALL REGAINING VARIABLE FIELDS TO OUTPUT

MOVE LANGUAGE AND TYPE FIELDSl TO OUTPUT

MOVE NULL FIELD INDICATOR TO OUTPUT

CALCULATE RANDOM NU.HdER AND -HOVE I T TO OUTPUT

@

HIOVE FIRST 254 CHARACTERS OF ABSTRACT TO OUTPUT

Figure 6.

(Continued) 45

NO, THERE AN
A-3STRAC;

IS

JiQ_

NE,AJ

INITIALIZE OUTPUT RECORD

TFXTntlT

.

P.3

[f'10V/E ABSTRACT I N TEXT FORMAT! [TO OOTPOT

m\lt

l\35{RI\CT
TO

INDICATOR OUTPUT

MOVE TITLE, SOURCE TITLE,] AUTHOR,AFFIL AND SENTENCE INDICATORS TO OUTPUT

Figure 6.

(Continued) 46

TLXTUUr ( .ENTER J

©.

-EM

TEST EACH BYTE OF ABSTRACT
•.•

FOR

NO

PIITRAN5-P.5 ftOVE FIELD TRANSLATED TO OUTPUT

ADD 2 BLANKS AT END OF SENTENCE

(

RETURN T u \ WAIN PGKi.y

'RAM5-P.5

•YiOWE FIELD TRANSLATED TO OUTPUT

-©.
YES

SET UP TO CONTINUE JJITH NEXT BYTE

O

/RETURNTON VIA IN PGffl. V

Figure 6.

(Continued)
47

APPENDIX III

SEARCH REFORMAT

Program Name:

SBCREMAT

ABSTRACT SBCFRMAT is oarcposed of two BAL programs written to run on both the IBM 3/360 MDdel 50 and the IBM S/370 Model 155. Hie modules reformat data collected by the SUPARS monitor into the format defined by the SEARCHES data base description, a format acceptable as input to the SUPARS/DPS loading and search programs.

Computer Definition 1. IBM S/360 Model 50 or S/370 Model 155 2. Two 2400 tape drive facilities and 9-tarack tapes 3. Mfcxtel 1403 Printer 4. Core requirements: a. Assenbler 140K b. Linkage Editor 128K c. Program Execution System Description 1. Syracuse University Operating System (SUOS) 2. Assenbler Level F translator program 3. Linkage Editor Level F Program Program Description This program is two BAL modules combined into one load module by the linkage editor. BIBFLDS is the main program. It reads as input the statistic records collected for it by the STATPAC programs, processing 1 input record at a time and producing for each 2 output records. Each document is assigned a DPS asaension number. Then the bibliographic fields are extracted from the raw data and moved to the output record in DPS format. Control is passed to iie second mcdule, SBCFRMAT, which reformats the remaining bibliographic fields and Paragraph A of the text portion of the record. Control is returned to the main program and output Record 1 is written. The renaining text data is formatted and written as Output Record 2. Processing continues jntil all documents have been reformatted. Before terminating a message is ^dritten out of the last document nurrber assigned.

49

Figure I f used in adjunction with the DPS Program Description and Operations Manual (H20-0477-1), pages 27-47, gives the oonplete data base description for the search data base.

50

CONCORD SEARCHES Y FILE SEARCHES 'FID DOCNO EBCD FLD SSN FLD LOGN FLD TERN FLD DATE FLP ECPU FLD ECLOCK FLD LCOST

V 1650 9999 6 9 TER -> 6 TER -' 3 TER ~> 8 TER -• 6 TER -» 6 TER -• 7 TER -"

I 1 1 1 1 1 1

FLD MAXDOF 6 TER -» 1 FLD SRCHA 255 TER ^ 1 FLD SRCTB 255 TER -> 1 ^LD LOGIC 24 TER "• 1 FLD WRD 255 TER "« 1 FLD WRDA 255 TER "• 1 FLD NDOCPR 5 TER "^ 1 FLD TXT 30 TXT 1 SPCL 0 7 5 ; SENT 080 094 1 0 7 ; TRNS NONE DMDBD DD DJSNAME^EB.DED,DISP=<JLD,UNIT=2314 DMINPUT DD DDNfiME=READR DMREFUD DD UNIT=2314,SPACE=(TRK,(20,10)),DSNAM&=SRCREF ,VOL=SER=LB0005 DMCNCRD DD UNIT=2314,SPACE=(TRK, ( 5 0 , 1 0 ) ) ,DSNAME^RO3K:,VO:i>SER=LB0005

DMTEXTS QMDICTN
DMVOCAB DMMASTR DMWRKT1 DMWRKT2 DMWRKT3 DMSORTWK01 DMSORTWK02 DMSORTWK03 DMSORTWK04 DMSORTLIB DMSYSOUT DMWORK1 DMWORK2 DMTOKK3 END

DD DD
DD DD DD DD DD DD DD DD DD DD DD DD DD DD

DUMMY U^O7IV2314,DSNA^^E^RDICIN,DISP=<Si^,VOL=SER=»LD0005
UOTT=2314,DSlW4E^RVOa\B,DISP<^^ UNIT=2314,DSNAME=SR>1ASTR,DISP=<3LD,VOIr=^^ UNTT=2314, SPACE= (TRK, ( 5 0 , 1 0 ) ) ,DSNAME=SRCWR1 ,VDL=SER=LB0005 UNTT=2314,SRACE= (TRK, ( 5 0 , 1 0 ) ) ,DSNRME=SRCWR2,VOL=SER=LB0005 UNIT=2314,SPACE= (TRK,(50,10)),DSNAME=SRCWR3 ,VOL=SER=LB0005 UNn^2314,DISP=OLD,DSllAME=SYSl.UTl UinT=2314,DISP=OLD,DSNAr>CE)=SYSl.UT2 UNIT=2314,DISP=OLD,DSNAf4E=SYSl.UT3 UNIT=2314,DISP=QLD,DSNAME=SYS1.UT4 DSN^IE=SYSl.SORTLIB,DISP^II),DCB»(BLKSIZE=3265,RECP'NU) SYSOUT=A UNIT=2314,SPACE"(TRK, ( 5 0 , 1 0 ) ) , D S t a M E ^ F O B l , V D L = S E R = L B 0 0 0 5 UNIT=2314,SPACE=(TRK,(50,10)),DSNAMB=SRCWS2 ,VOL=SER=LB0005 UNIT=2314,SPACE= (TRK, ( 5 0 , 1 0 ) ) ,DSNAME=SRCWS3 ,VOL=SER=LB0005

Figure 1. Data Base Description

51

Displacement

Data Item

?ZE£
Binary Packed Decimal Packed Decimal Packed Decimal Binary Binary Packed Decimal Binary Binary Binary Binary Binary Binary Packed Binary Character Character Character

0- 3 4- 8 9-11 12-15 16-19 20-23 24-27 28-29

Record length Social Security Hunter Log nunter Date (YYTICD) Elapsed CPU (1/300 sec.) Elapsed CLOCK (1/300 sec.) I4aximum Documents Dropped List Length Terminal Nurrter A Type . Cost User I/P Length Error Flag No. Docs Printed List type User Input User Output List

30 31
32-35 36-37

38
39-41

42
43-n

Var Var

Figure 2. Input Record Description

52

Output: TVo v a r i a b l e length records for each leg rojtmber /Record 1. Bibliographic data and f i r s t paragraph of t e x t d a t a .
Field Name Displacement 0 4 10 20 27 31 40 47 54 62 variable variable

Length
4 6 9 6 3 8 6 6 7
variable variable 24

Data itort and oenments
Record length DPS Ascension Number Social security number (in alphabetic code) Log number Terminal number (port number for hardwires, code for dial-ups) Date (YY-WI-DD) Elapsed CPU time (mmm:ss) Elapsed clock time ( r a i s ) irtrs Search cost (*ddd.cc) Labels, keywords, and operators LIST statement Frequency of occurance of each operator in same order as first 7 entries of LOGIC list for Record 2. Length count and keywords from search Overflow field for WRD, Used only if keywords exceed 254 characters, lumber of documents printed Paragraph A indicator Search, from LI through Ln in text format, or NONESRCH LIST statement

DOCNO SSH IiOCN TERN DATE EOPU BCLOCK

Lcosr
SFCHA. SRCHB
JJOGIC

WRD WRDA NDOCPR

variable variable variable variable variable variable

Variable variable

5
4 variable variable

Record 2. Second paragraph of text data.

0 4 10 14 23
variable variable variable variable variable variable

4 6 4 9
variable variable variable variable variable variable variable

Record length DPS Ascension Number Paragraph B indicator Social security number

L4-LOGN+L or IWONE T+TEPN+T or TNONE D+fMDDYY+D or DNONE LIST type - LSTBRIEF, JjSTKfcXJORD, LSTOTHER, or NOLST LOGIC - from 1 to 7 entries depending on search Completeness of search User output in format VnnAnnnnn or VO0ANONE

Figure 3.

Output Reoord Description

53

6HsE SECURITY
Tape qanerato'd by STATPAC program

CONVERT SOCIAL ,f TO ALPHABETIC CODE IN OUTPUT

ifiUVE LOO </ 5 TERMINAL T* TO OUTPUT

PRINT LAST DOCUMENT , / ASSIGNEl

COMPUTE CPU ft CLOCK TIi>i£ AND C05T..*IUVE TO OUTPUT. I'iOVE DATE AND i'iAXIiflUM DOCS. FOUND TO OUTPUT.

COMPUTE LENGTH OF USER OUTPUT!

|5ET SEARCH OMPLETENESSJ INDICATOR

5BL
EXCLUDE &V, AS,EXPERT SEARCHES

'SET SEARCH COMPLETENESS INDICATOR

CUtfPUTE DPS DOCUMENT f, WOVE IT fO OUTPUT

EXIT TO l'!0DULL 2 (FORiflSRCri)

<©

Figure 4. Reformat Searches Module 1 - BIEFIDS

54

PRINT ERROR MESSAGE

WOVE DOC $ * PA'RAGRAPH| INDICATOR TO OUTPUT

-N

MOVE 5 0 C I A L SECURITY # TO OUTPUT '

MOVE TEXT FORiHAT OF LOGN,TERN, DATE TO OUTPUT

VIOVE ALL VOLUMEABSTRACT PAIRS TO OUTPUT

N-

YES

•<IOVE L I S T & LOGIC CODES^ CHRQR I N D I CATOR TO OUTPUT

Figure 4. (Continued)

55

FGRIMSRCH

I

CMTL-^

J

I'iOVE L I S T STATEMENT TU OUTPUT

GET OUTPUT AREA ADDRESS FROM PAHA I E T E R LIST

MOVE LUCIC COUNTERS, INPUT jiORDS # DOCS. PRINTED TO OUTPUT

TRANSLATE USER INPUT FIELD

iViO\/E PARAGRAPH INDICATOR & USER INPUT I N TEXT FURi-'iAT TO OUTPUT

YES

EXECUTE FORMAT
AND IYIQVE

I'lUl/E SEARCH COMPLETENESS INDICATOR u LOGIC S Y M U L S TU PARAMETER LIST

ROUTINES DEPENDING ON CHARACTER

f

HETUM J

TRUNCATE FIELD IF EXCEEDS tfiAXIifiUi'i

IT

Figure 5. Reformat Searches Module 2 - FORMSTOI

56

YES

YES ADD 2 -JLAISI^SI

YE-

AT END OF SENTENCE

®

'1UI/E AS ftiUCHI p r FIELD AS POSSIBLE TO I3UTPUT

EXCEEDED XACTLY

WIUVE FIELU TRANSLATED TO OUTHUT

/RETURN
I'IOVE

TLA

FIELD TRANSLATED TO OUTPUT

V^ TExroor

J

<

KtrUr<N To'N I ' I A I N Pbru.y

Figure 6.

(Continued)
48