Please see slightly updated version of this paper: 

Allyson Carlyle and Joel Summerlin.  “Transforming Catalog Displays:  Record Clustering for Works of Fiction.”  Cataloging & Classification Quarterly.  v. 33, no. ¾ (2002):  13-25.

 

Allyson Carlyle and Joel Summerlin

University of Washington, School of Library & Information Science, Seattle, WA, USA

 

 

Transforming Catalog Displays:  Record Clustering for Works of Fiction

 

 

Abstract:  Displays grouping retrieved bibliographic record sets into categories or clusters may communicate search results more quickly and effectively to users than current catalogs providing long alphabetical lists of records.  In this research, automatic clustering based on types of relationships, including translation, presence of illustrations, etc., is proposed as a model for clustering.  Bibliographic records associated with three large fiction works (Kidnapped by Robert Louis Stevenson,  Bleak House by Charles Dickens, and Three Musketeers by Alexandre Dumas) are analyzed to discover the presence of relationship-type indicators to determine the extent to which an automatic clustering program would succeed in clustering work records. Preliminary results show that  94 percent of the records in this study contained indicators of cluster type that would allow them to be correctly identified automatically.

 

 

1.         Introduction

 

Works associated with large numbers of bibliographic records, such as Hamlet, by William Shakespeare, or the Koran, may cause significant problems for library catalog users seeking those works because of the long-list display problem.  Evidence from the first major study of online catalog use shows that users rank scanning long displays as the fifth most problematic aspect of using online catalogs (Matthews, Lawrence and Ferguson, 1983).  A more recent study shows that online catalog users report overload when approximately 100 to 200 or more records are retrieved (Wiberley, Daugherty and Danowski, 1995).  Many current catalog searches result in displays composed of lists of hundreds or even thousands of records.  These lists do little to shed light on the nature and characteristics of the records retrieved.  In addition, it is likely they inhibit a user’s ability to identify relevant records.  Displays that organize retrieved record sets into intelligible categories may communicate search results more quickly and effectively to users than current catalog displays that consist of long lists of brief record summaries. 

The research question addressed in this paper is:  To what extent can record clusters for a small selection of fiction works be created automatically to condense and better organize long catalog displays, making retrieval sets more intelligible to users?   To answer this question, MARC bibliographic records for three fiction works that are associated with large numbers of manifestations are analyzed to discover the extent to which they can be clustered automatically for display.  The results reported here must be considered preliminary, as the final study will analyze five works in total.

Works studied, selected from a list of the largest fiction works held in the OCLC database identified in a research project by Edward T. O’Neill (1994), are: 

Steps in the study include:


2.         Cluster Indicators and Operational Definitions

 

            Clusters used in this study are based on relationships among items and are derived from the results of two earlier research projects.  In the first project (Carlyle, 1997), relationship-based clusters were formulated following an analysis of (a) codes of Anglo-American filing rules and (b) Barbara Tillett's taxonomy of bibliographic relationships (1991).  In the second project, user clustering of manifestations of a particular work was investigated (Carlyle, 1999 and Carlyle, In review).  Because most of the relationships employed by users in their clustering task correspond to the clusters identified in the 1997 research, the clusters identified for this project are largely taken from the 1997 research.  Relationship-based clusters employed in this study include:

•  illustrated editions                                                                •   parts, selections, only

•  eds. with amplifications, e.g., introductions, prefaces          •  abridgements

•  large print, Braille, etc. editions                                          •  non-English language editions

•  editions in collections of 2 or more works                           •  nonbook format editions

•  English language editions without amplifications, illustrations, etc.           

            Clusters are defined operationally through identification of specific MARC fields, subfields, or the content of these fields or subfields, and are discovered through a record-by-record analysis.[1]  Operational definitions for each of the clusters identified above follow.  If specific subfields are not mentioned, then the presence of an indicator may appear in any subfield.  In the following discussion, individual parts of the operational definitions will be called “cluster indicators.”  Abbreviations used below are based on the OCLC Bibliographic Formats and Standards and include:  “FF” (Fixed field); “$” (subfield); and “+” (in addition to).

 

 

3.         Results

 

            MARC records representing editions of Bleak House, Kidnapped, and Three Musketeers were each analyzed by at least one of the authors.  Operational definitions were compiled as the analysis progressed.  Although we believe the definitions to be largely complete, it may be that analysis of records for other works would reveal other cluster indicators.  We did, however, attempt to generalize all of the indicators that we could so as to include all of the indicators we could imagine. 

            One of the dilemmas we faced was whether or not to include in our operational definitions cluster indicators that represented incorrect cataloging.  For example, in one record for a microform, the term “microform” appeared not in a $h of a 245 field, but in a $b.  In another example, several records included illustration cluster indicators such as “ill.” in 300 $a as opposed to $b.  Each of these instances of incorrect cataloging was judged according to the extent to which it would have the potential to incorrectly cluster items if used as a cluster indicator.  In the “microform” example, we decided not to include the presence of “microform” in a 245 $b because of the potential for incorrect clustering.  Fortunately, most microform records contain more than one cluster indicator.  In the “ill.” example, we included the presence of “ill.” in a 300 $a because we did not believe it would incorrectly cluster records as illustrated editions.

            Analysis reveals that records often contain more than one indicator of a single cluster type.  For example, a record for an illustrated edition may contain illustration indicators in the fixed field, statement of responsibility, and physical description areas.  In the results presented below, each record is counted once in a single cluster only.  A single record could also be clustered into more than one cluster, for example, an item might be illustrated and abridged.  In the results presented below, the record was counted as belonging to each cluster type appropriate, in this case, it is counted once in the illustrations cluster and once in the abridgements cluster.  Because of this, the total percents given in the tables below will not add up to 100 percent.   

            Table 1 shows the number of records analyzed for each work, as well as the percent of the total records each represents.  In addition, it breaks down these totals into two groups, English and non-English language records.  Because one of the selected works was originally written in French (Three Musketeers), it is highly likely that more than the usual number of non-English records appears here.  Note that if Bleak House and Kidnapped only were considered, the percent of non-English records would drop from twenty-eight to eight percent.

 

 

 

       English

Non-English

   Total Records

      % of Total

Bleak House

 

359

27

386

25%

Kidnapped

 

424

39

463

30%

Three Musketeers

 

336

367

703

45%

Totals:

 

1119

433

1552

100%

 

Percents:

72%

28%

100%

 

 

Table 1.  Distribution of Records Analyzed by Work and Language

            A very large number of records, 293 or 76% of the total number of records representing editions of Bleak House, would be automatically clustered into an illustrations cluster (see Table 2).  Eighteen percent of the Bleak House records would cluster into an amplifications cluster, consisting of editions that contain introductions, prefaces, afterwords, commentaries, annotations, etc.  None of the Bleak House records represent large print,

Braille, or other orthographically variant editions, and very few records represented the work in a collection, a part of the work published separately, or an abridgement.  Ten percent of the records represented translations.  Nonbook editions accounted for five percent of all records;  33 percent of these (seven records) represented microforms, and the rest (67 percent, or fourteen records) represented sound recordings.  English language editions, without amplifications, illustrations, etc., accounted for ten percent of the total records.

 

Bleak House

 English

 Non-English

   Totals

% of BH recs.

Illustrations

280

13

293

76%

Amplifications

67

1

68

18%

Large print, etc.

0

0

0

0%

Collections

3

1

4

1%

Parts

5

2

7

2%

Abridgements

4

0

4

1%

Non-English Lang.

NA

NA

39

10%

Nonbook

21

0

21

5%

English eds. only

37

NA

37

10%

 

Table 2.  Distribution of Bleak House Records Clusters

 

            The distributions for Kidnapped show many fewer illustrated editions than Bleak House, although these editions still make up a very large cluster, including 265 or 57 percent of the total Kidnapped records analyzed.  Eleven percent of these records contain some kind of introduction, preface or afterword.  A similarly small percent (three) of the records represented some kind of orthographically variant edition.  All but one of these were large print editions, while the other was Braille.  Three percent of the records represented editions that included the work in a collection, while only a single translation consisted of a part or parts of the work.  Five percent of the records represented abridgements, while six percent represented non-English language versions.  Kidnapped had almost twice as many nonbook editions as the other two works, at ten percent.  English language editions only made up 20 percent of the total records for Kidnapped.

 

Kidnapped

 English

 Non-English

   Totals

% of Kidn. recs.

Illustrations

242

23

265

57%

Amplifications

49

0

49

11%

Large print, etc.

13

0

13

3%

Collections

15

0

15

3%

Parts

0

1

1

0.2%

Abridgements

21

0

21

5%

Non-English Lang.

NA

NA

27

6%

Nonbook

48

0

48

10%

English eds. only

92

NA

92

20%

 

Table 3.  Distribution of Kidnapped Records Clusters

 

            The third and largest work, Three Musketeers, shows somewhat similar cluster patterns to the previous two works.  Somewhat less than half of the records represent items that are illustrated.  Amplifications are represented in twelve percent of the records.  Relatively small percentages of or no records are represented in the large print, collections, and parts clusters.  Six percent of the records represented items that are abridged.  Non-English items represent a large proportion (28%) of this work, much larger than the other works, most likely because it was originally written in French.  The nonbook item cluster also contains six percent of the records for Three Musketeers.  English language editions not illustrated, amplified, etc., comprise fourteen percent of the total items.

 

Three Musketeers

 English

 Non-English

   Totals

% of 3M recs.

Illustrations

167

165

332

47%

Amplifications

40

47

87

12%

Large print, etc.

0

0

0

0%

Collections

3

8

11

2%

Parts

5

2

7

1%

Abridgements

29

10

39

6%

Non-English Lang.

NA

NA

364

52%

Nonbook

37

2

39

6%

English eds. only

98

NA

98

14%

 

Table 4.  Distribution of Three Musketeers Records Clusters

 

            Inevitably, records were discovered that would not be clustered correctly if automatic identification using the cluster indicators were used; in other words, they would not be identified as being members of their appropriate clusters (Table 5).  Ninety records, or six percent of the records analyzed, would not be identified as belonging to a cluster they should, in fact, belong to.  The biggest problem for automatic clustering of the records analyzed in this study were works published in collections.  Forty-two records represented items that were collections of two or more works, but could not be identified as such.  This represents 47 percent of the total number of records that would be clustered incorrectly.  The second highest number of  incorrectly clustered records are editions with amplifications.  This represents 23, or 26 percent that would be clustered incorrectly.  Records for items representing parts or selections of works ranked third.  This represents 13, or 14 percent of the total number of records that would be clustered incorrectly.  A relatively small number of materials appearing in non-English language editions, illustrated editions, and abridgements would cluster incorrectly.  None of the large print, Braille, or nonbook materials clustered

 

Incorrectly Clustered Records

     Number

% Incorr. Clust.

Illustrations

 

4

4%

Amplifications

 

23

26%

Large print, etc.

 

0

0%

Collections

 

42

47%

Parts

 

13

14%

Abridgements

 

5

6%

Non-English Lang.

3

3%

Nonbook

 

0

0%

English eds. only

 

NA

NA

Total Incorrectly Clustered

90

100%

 

Table 5.  Distribution of Incorrectly Clustered Records

 

incorrectly;  in other words, the cluster identifiers were adequate to identify all editions that should have appeared in these clusters. 

 

 

4.         Discussion

 

            For all three works, illustrated editions form by far the largest clusters, although the range in size for each individual work is from 47 to 76 percent of the records for the different works.  All other clusters, except for the Three Musketeers non-English editions, are much smaller.  Because the illustrations clusters are so large, it would be interesting to discover whether or not users would find it useful to sub-cluster these records into groups such as abridged, illustrated editions, illustrated editions with introductions, afterwords, etc. 

            One question it seems appropriate to ask after having come to this point in the study is whether or not records representing non-English language items should appear in clusters other than the non-English cluster.  Here, non-English language items were counted in other clusters if they contained appropriate indicators, for example, the illustrations, abridgements, and amplifications clusters.  However, it should be noted that we could not necessarily identify appropriate non-English terms for amplifications in all of the languages represented, so it is possible that incorrect clustering occurred with translations.  In addition, because language may be a limitation for users, it may be more useful to simply group non-English items together and not mix them in with the English language editions in any other cluster.

            Several problems with automatic clustering presented themselves during the record analysis.  One of the important problems that appeared was the incorrect MARC tagging of works published in collections of two or more works.  A name-title added entry (700 field) with a second indicator of 2 for an analytic entry is frequently a requirement for identifying a work in a collection correctly.  Unfortunately, many of the incorrectly clustered records contained a 700 field with a blank second indicator.  Because a blank second indicator could stand for a related work, it would not be possible to correctly identify these records automatically.  Also, many records used title added entry fields, e.g., the 740 field, to identify the presence of works in the item.  These, also, would be inadequate to identify the presence of the work or additional works in the item.

A limitation of the amplification indicators is that no codified method of identifying them exists either in AACR2 or MARC.  If information about an amplification appears on a title page, it is recorded in the record, usually in the 245 $c (statement of responsibility), sometimes in a note (500) field.  This means that to identify amplifications, one must look for words in these areas that signify amplifications, for example, the ones used here, “introduction”, “preface”, “afterword”, etc.  These words seem to indicate the presence of an amplification relatively well.  However, if more ambiguous words are used, such as “letter” in the phrase “including a ‘letter’ by …”, incorrect identification may occur.

            Finally, important factors to consider in the automatic record identification are the presence of misinformation or the complete lack of information in records.  Incorrect clustering due to either of these two situations would occur whether done automatically or manually.  A particularly problematic situation for Bleak House and Three Musketeers is the presence of  records for books indicating approximately 200 or fewer pages without any indication of abridgement, condensation, or adaptation.  Most books for these works are between 400 and 800 pages long.  We are convinced that most of these records actually represent items that are abridgements or adaptations of some kind.  Similarly, sound recordings for the unabridged editions of these works usually contain ten or more cassettes; some sound recordings records for these works contain many fewer cassettes, without indicating that they are abridgements or adaptations.

 

 

5.         Conclusion

The research presented in this project builds on recent efforts to deepen our understanding of the nature of works represented by large numbers of manifestations and the impact of MARC record structure on retrieval and access of records representing these works.  It contributes to our knowledge about the extent to which records representing large fiction works contain information necessary to manipulate them for organized retrieval and display.  Because large works are often well-known and popular (Smiraglia and Leazer, 1999), it is likely that they are frequently sought in online catalogs.  This preliminary research indicates that for some large works of fiction, as many as 94 percent of records associated with a particular work could be successfully clustered using automatic methods. 

 

[Not printed in paper:  Acknowledgements:  This paper was supported largely from an OCLC/ALISE Library and Information Science Research Grant.  We would also like to acknowledge the assistance of Elizabeth S. Knight in the data collection process.]


 

REFERENCES:

 

Carlyle, Allyson.  (1997).  Fulfilling the second objective in the online catalog:  Schemes for organizing author and work records into usable displays.  Library Resources & Technical Services. 41 (2):  79-100.

 

Carlyle, Allyson.  (1999).  User categorisation of works:  Toward improved organisation of online catalogue displays.  Journal of Documentation.  55 (2):  184-208.

 

Carlyle, Allyson.  (In review.)  Developing organized information displays for complex works:  A study of user clustering behavior.

 

Matthews, J., G.S. Lawrence, and D.K. Ferguson.  (1983).  Using online catalogs.  New York, NY:  Neal Schuman.

 

Bibliographic standards and formats.  (1996)  2nd ed. Dublin, OH:  OCLC.

 

O'Neill, Edward T.  (1994).  Manifestations of fiction works.  Annual Review of OCLC Research 1994.  Dublin, OH:  11-15.

 

Smiraglia, Richard P. and Gregory H. Leazer.  (1999).  Derivative bibliographic relationships:  The work relationship in a global bibliographic database.  Journal of the American Society for Information Science.  50 (6):  493-504.

 

Tillett, Barbara B.  (1991).  A taxonomy of bibliographic relationships.  Library Resources & Technical Services. 35 (2):  150-158.

 

Wiberley, Stephen E. Jr., Robert Allen Daugherty, and James A. Danowski.  (1995).  User persistence in displaying online catalog postings:  LUIS.  Library Resources & Technical Services.  39 (3):  247-264.



[1] Records were identified as editions of the works studied based on a preliminary record identification process.  A second study will be conducted later focusing on record identification.

[2]The only nonbook formats identified here are formats that would still allow the item to be an edition of the work, e.g., sound recordings, computer files, and microforms; other nonbook formats, e.g., moving image materials or kits, would not be considered to be editions of the work (they would be related works) and therefore are not included here.