“Transforming Catalog Displays: Record Clustering for Works of Fiction.” Cataloging & Classification Quarterly. v. 33, no. ¾ (2002): 13-25.
Transforming Catalog Displays: Record Clustering for Works of Fiction
Allyson Carlyle and Joel Summerlin
Contact Information:
Allyson Carlyle
Information School
University of Washington
Box 352840
Seattle, WA 98195-2840
USA
(206) 543-1887
Joel Summerlin
1136 N. 83rd St.
Seattle, WA 98103
USA
(425) 401-4279
Transforming Catalog Displays: Record Clustering for Works of Fiction
Allyson Carlyle and Joel Summerlin
Allyson Carlyle is Assistant Professor, Information School, University of Washington, Seattle, WA, USA
(email: acarlyle@u.washington.edu). Joel Summerlin is Thesaurus Lead, Corbis, Bellevue, WA, USA (email: joelsumm@yahoo.com).
This paper is a slightly revised version of “Transforming Catalog Displays: Record Clustering for Works of Fiction” originally published in: Dynamism and Stability in Knowledge Organization. Proceedings of the Sixth International ISKO Conference, 10-13 July 2000, Toronto, Canada edited by Clare Beghtol, Lynne C. Howarth, Nancy J. Williamson, ©2000 by ERGON Verlag Dr. H.-J. Dietrich, Würzburg, Germany: pp. 320-326. The OCLC/ALISE Library and Information Science Research Grant Award supported much of the initial data collection and early stages of the research. We would also like to acknowledge Collette Davis, Sara L. Ranger, and Misha Stone for their assistance with data collection, and Harry Bruce, Lisa M. Fusco, Maurice Green, Joe Janes, and Karen Pettigrew for editorial assistance and advice. We would also like to thank Richard P. Smiraglia for his patience and support.
Keywords: online catalog design, clusters, automatic clustering, known items, works, bibliographic relationships
Abstract. Displays grouping retrieved bibliographic record sets into categories or clusters may communicate search results more quickly and effectively to users than current catalog displays providing long alphabetical lists of records. In this research, automatic clustering based on types of relationships, such as translation, presence of illustrations, etc., is proposed as a model for clustering. Bibliographic records associated with three large fiction works (Kidnapped by Robert Louis Stevenson, Bleak House by Charles Dickens, and Three Musketeers by Alexandre Dumas) are analyzed to discover the presence of relationship-type indicators to determine the extent to which an automatic clustering program would succeed in clustering work records. Preliminary results show that 94 percent of the records in this study contained indicators of cluster type that would allow them to be correctly identified automatically. However, the clusters formed by the relationship types used here are of unequal size. Because of this, it is suggested that alternative strategies be investigated for their potential to create more useful clustered displays.
INTRODUCTION AND RATIONALE FOR THE RESEARCH
Works associated with large numbers of bibliographic records, such as Hamlet, by William Shakespeare, or the Koran, may cause significant problems for library catalog users seeking those works because of the long-list display problem. Evidence from the first major study of online catalog use shows that users rank scanning long displays as the fifth most problematic aspect of using online catalogs (Matthews, Lawrence and Ferguson, 1983). A more recent study shows that online catalog users report overload when approximately 100 to 200 or more records are retrieved (Wiberley, Daugherty and Danowski, 1995). Many current catalog searches result in displays composed of lists of hundreds or even thousands of records. These lists do little to shed light on the nature and characteristics of the records retrieved. In addition, it is likely they inhibit a user’s ability to identify relevant records. Displays that organize retrieved record sets into intelligible categories may communicate search results more quickly and effectively to users than current catalog displays that consist of long lists of brief record summaries.
The research question addressed in this paper is: To what extent can record clusters for a small selection of fiction works be created automatically to condense and better organize long catalog displays, making retrieval sets more intelligible to users? To answer this question, MARC bibliographic records for three fiction works that are associated with large numbers of manifestations are analyzed to discover the extent to which they can be clustered automatically for display. The results reported here must be considered preliminary, as the final study will analyze five works in total.
The works studied were selected from a list of the largest fiction works held in the OCLC database identified in a research project by Edward T. O’Neill (1994). They include:
· Bleak House, Charles Dickens
· Kidnapped, Robert Louis Stevenson
· Three Musketeers, Alexandre Dumas.
Steps in the study include:
METHODOLOGY
The first step in the research project was to identify cluster types that could be used to group records representing editions of works. Cluster types identified in this study were based on relationships among items and were derived from the results of two earlier research projects. In the first project (Carlyle, 1997), relationship-based clusters were formulated following an analysis of (a) codes of Anglo-American filing rules and (b) Barbara Tillett's taxonomy of bibliographic relationships (1991). In the second project, user clustering of manifestations of a particular work was investigated (Carlyle, 1999 and Carlyle, 2001). Because most of the relationships employed by users in their clustering task corresponded to the clusters identified in the 1997 research, the cluster types identified for this project were largely taken from the 1997 research.
Relationship-based cluster types employed in this study include:
• illustrated editions • parts, selections, only
• eds. with amplifications, e.g., introductions, prefaces • abridgements
• large print, Braille, etc. editions • non-English language editions
• editions in collections of 2 or more works • nonbook format editions
• English language editions without amplifications, illustrations, etc.
The second step in the research project was to analyze the records provided by OCLC for the works included in the study in order to operationalize the cluster types identified above. In other words, records were analyzed to discover the variety of indicators of each cluster type present in the records assembled for the study. Thus, cluster types were defined through the identification of specific indicators – the presence of selected MARC fields, subfields, or the content of these fields or subfields – and were discovered through a record-by-record analysis.1
The analysis was performed manually for two reasons. First, potential cluster indicators could not be predicted, even given the extensive cataloging knowledge of the researchers. Second, it was considered essential to observe the extent to which errors or variations would impede automatic clustering.
Operational definitions for each of the cluster types identified in the research are summarized below. Complete operational definitions appear in Appendix 1. If specific subfields are not mentioned, then the presence of an indicator may appear in any subfield. In the following discussion, individual parts of the operational definitions will be called “cluster indicators.”
MARC records representing editions of Bleak House, Kidnapped, and Three Musketeers were each analyzed by one of the researchers. Operational definitions were compiled and refined as the analysis progressed. Although we believe the definitions to be largely complete, it may be that analysis of records for other works would reveal other cluster indicators. We did, however, attempt to make the indicators as general as possible so as to cover as many potential variations as possible.
One of the dilemmas we faced was whether to include in our operational definitions cluster indicators that represented incorrect cataloging. For example, in one record for a microform, the term “microform” appeared not in a $h of a 245 field, but in a $b. In another example, several records included illustration cluster indicators such as “ill.” in 300 $a as opposed to $b. Each of these instances of incorrect cataloging was judged according to the extent to which it would have the potential to incorrectly cluster items if used as a cluster indicator. In the “microform” example, we decided not to include the presence of “microform” in a 245 $b because of the potential for incorrect clustering. Fortunately, most microform records contain more than one cluster indicator. In the “ill.” example, we included the presence of “ill.” in a 300 $a because we did not believe it would incorrectly cluster records as illustrated editions.
Once the compilation of cluster identifiers was complete, each record was again analyzed to discover the presence of cluster indicators. Records that could not be clustered correctly using the cluster identifiers outlined in Appendix 1 were examined to determine why automatic clustering methods would not successfully identify them.
RESULTS
Analysis reveals that records often contain more than one indicator of a single cluster type. For example, a record for an illustrated edition may contain cluster indicators in the illustrations fixed field, statement of responsibility (245 subfield c), and physical description areas (300 field). In the results presented below, each record is counted once in a single cluster only. A single record could also be clustered into more than one cluster, for example, an item might be illustrated and abridged. In the results presented below, the record was counted as belonging to all appropriate cluster types, thus, a record could be counted once in the illustrations cluster and once in the abridgements cluster. Because of this, the total percents given in the tables below do not add up to 100 percent.
In each table for an individual work presented below, results from English and non-English language records are presented separately. Results are divided into these two groups because we wanted to note any differences in record structure, content, and quality based on language of original text. Anecdotal evidence suggests that records for editions in languages other than English are frequently shorter than English-language record, in part because minimal level cataloging procedures are frequently applied to them. In addition, wide variations in quality have been noted in records for non-English language items.
Table 1 shows the number of records analyzed for each work, as well as the percent of the total records each represents. Because one of the selected works was originally written in French (Three Musketeers), it is highly likely that more than the usual number of non-English records appears here. Note that if Bleak House and Kidnapped only were considered, the percent of non-English records would drop from twenty-eight to eight percent.
TABLE 1 ABOUT HERE.
A very large number of records, 293 or 76% of the total number of records representing editions of Bleak House, would be automatically clustered into an illustrations cluster (see Table 2). Eighteen percent of the Bleak House records would cluster into an amplifications cluster, consisting of editions that contain introductions, prefaces, afterwords, commentaries, annotations, etc. None of the Bleak House records represent large print, Braille, or other orthographically variant editions, and very few records represented the work in a collection, a part of the work published separately, or an abridgement. Ten percent of the records represented translations. Nonbook editions accounted for five percent of all records; 33 percent of these (seven records) represented microforms, and the rest (67 percent, or fourteen records) represented sound recordings. English language editions, without amplifications, illustrations, etc., accounted for ten percent of the total records.
TABLE 2 ABOUT HERE.
The distributions for Kidnapped show many fewer illustrated editions than Bleak House, although these editions still make up a very large cluster, including 265 or 57 percent of the total Kidnapped records analyzed (see Table 3). Eleven percent of these records contain some kind of introduction, preface or afterword. A similarly small percent (three) of the records represented some kind of orthographically variant edition. All but one of these were large print editions, while the other was Braille. Three percent of the records represented editions that included the work in a collection, while only a single translation consisted of a part or parts of the work. Five percent of the records represented abridgements, while six percent represented non-English language versions. Kidnapped had almost twice as many nonbook editions as the other two works, at ten percent. English language editions only made up 20 percent of the total records for Kidnapped.
TABLE 3 ABOUT HERE.
The third work, Three Musketeers, which is associated with the largest number of bibliographic records, shows similar cluster patterns to the previous two works (see Table 4). Somewhat less than half of the records represent items that are illustrated. Amplifications are represented in twelve percent of the records. Relatively small percentages or no records are represented in the large print, collections, and parts clusters. Six percent of the records represented items that are abridged. Non-English items represent a large proportion (28%) of this work, much larger than the other works, most likely because it was originally written in French. The nonbook item cluster also contains six percent of the records for Three Musketeers. English language editions not illustrated, amplified, etc., comprise fourteen percent of the total items.
TABLE 4 ABOUT HERE.
Inevitably, records were discovered that would not be clustered correctly if automatic identification using the cluster indicators were used; in other words, they would not be identified as being members of their appropriate clusters (Table 5). Ninety records, or six percent of the records analyzed, would not be identified as belonging to a cluster to which they should, in fact, belong. The biggest problem for automatic clustering of the records analyzed in this study were works published in collections. Forty-two records represented items that were collections of two or more works, but could not be identified as such. This represents 47 percent of the total number of records that would be clustered incorrectly. The second highest number of incorrectly clustered records are editions with amplifications. This represents 23, or 26 percent that would be clustered incorrectly. Records for items representing parts or selections of works ranked third. This represents 13, or 14 percent of the total number of records that would be clustered incorrectly. A relatively small number of materials appearing in non-English language editions, illustrated editions, and abridgements would cluster incorrectly. None of the large print, Braille, or nonbook materials clustered incorrectly. In other words, the cluster identifiers were adequate to identify all editions that should have appeared in these clusters.
TABLE 5 ABOUT HERE.
DISCUSSION
Ideally, a display clustering a large number of items would present clusters that clarify the nature of items retrieved and would be composed of manageable numbers of items. For all three works of the works studied, illustrated editions form notably larger clusters than any other cluster type, although the range for each individual work is from 47 to 76 percent of the records for the different works. With the exception of the non-English editions of the Three Musketeers, all of the other clusters created are much smaller. Because the illustrations clusters are so large, it would be highly desirable to discover alternative methods of clustering. For example, it would be useful to determine whether users would find it helpful to cluster the records representing illustrated editions that are also abridged, have introductions, etc. into those other clusters, leaving the illustrated editions cluster to be composed of illustrated editions that have no other cluster attributes only . Another possibility to investigate would be to sub-cluster very large clusters into groups such as abridged illustrated editions; illustrated editions with introductions, afterwords, etc.; and large print illustrated editions. Alternatively, we would suggest investigating completely alternative methods of subclustering, such as that proposed by Elaine Svenonius (1988). Svenonius defined the following sets of equivalent bibliographic records: work, text, typesetting (or edition), subedition, imprint, and reprint (p. 7). These equivalence sets, excluding work, could also be used to investigate the potential of clustered displays to help users navigate large retrieval sets.
Another issue introduced by this study is whether records representing non-English language items should appear in clusters other than the non-English cluster. Here, non-English language items were counted in other clusters if they contained appropriate indicators, for example, the illustrations, abridgements, and amplifications clusters. However, it should be noted that we could not necessarily identify appropriate non-English terms for amplifications in all of the languages represented, so it is possible that incorrect clustering occurred with translations. In addition, because language may be a limitation for users, it may be more useful to simply group non-English items together and not mix them in with the English language editions in any other cluster.
Several problems with automatic clustering proposed in this study presented themselves during the record analysis. One of the important problems that appeared was the incorrect MARC tagging of works published in collections of two or more works. A name-title added entry (700 field) with a second indicator of 2 for an analytic entry is frequently a requirement for identifying a work in a collection correctly. Unfortunately, many of the incorrectly clustered records contained a 700 field with a blank second indicator. Because a blank second indicator could stand for a related work, it would not be possible to correctly identify these records automatically. Also, many records used title added entry fields, e.g., the 740 field, to identify the presence of works in the item. These, also, would be inadequate to identify the presence of the work or additional works in the item.
A limitation of the amplification indicators is that no codified method of identifying them exists either in AACR2 or MARC. If information about an amplification appears on a title page, it is recorded in the record, usually in the 245 $c (statement of responsibility), sometimes in a note (500) field. This means that to identify amplifications, one must look for words in these areas that signify amplifications; for example, the ones used here, “introduction”, “preface”, “afterword”, etc. These words seem to indicate the presence of an amplification relatively well. However, if more ambiguous words are used, such as “letter” in the phrase “including a ‘letter’ by …”, incorrect identification may occur.
Finally, important factors to consider in the automatic record identification used in this study are the presence of misinformation or the complete lack of information in records. Incorrect clustering due to either of these two situations would occur whether done automatically or manually. A particularly problematic situation for Bleak House and Three Musketeers is the presence of records for books indicating approximately 200 or fewer pages without any indication of abridgement, condensation, or adaptation. Most books for these works are between 400 and 800 pages long. We are convinced that most of these records actually represent items that are abridgements or adaptations of some kind. Similarly, sound recordings for the unabridged editions of these works usually contain ten or more cassettes; some sound recordings records for these works contain many fewer cassettes, without indicating that they are abridgements or adaptations.
CONCLUSION
The research presented in this paper builds on recent efforts to deepen our understanding of the nature of works represented by large numbers of manifestations and the impact of MARC record structure on retrieval and access of records representing these works. It contributes to our knowledge about the extent to which records representing large fiction works contain information necessary to manipulate them for organized retrieval and display. Because large works are often well-known and popular (Smiraglia and Leazer, 1999), it is likely that they are frequently sought in online catalogs. This preliminary research indicates that for some large works of fiction, as many as 94 percent of records associated with a particular work could be successfully clustered using automatic methods. While more research is necessary to explore the success of automatic clustering on other types of works, including non-fiction and non-book works, and to refine or replace the cluster types studied here, the high degree of success of the automatic clustering methods used in this study suggests that future research in this area would be fruitful.
NOTES
[1] Records were identified as editions of the works studied in a record identification process using the contents of author, title, and call number fields. See Allyson Carlyle and Sara Ranger, “Facilitating Retrieval of Fiction Works in Online Catalogs.” Proceedings of the 12th ASIS&T SIG/CR Classification Research Workshop, November 4, 2001, Held at the 64th ASIS&T Annual Meeting, November 2-8, 2001, Washington , D.C. Efthimis N. Efthimiadis, ed. Silver Spring, MD: American Society for Information Science and Technology, 2001: 1-11.
2The only nonbook formats identified here are formats that would still allow the item to be an edition of the work, e.g., sound recordings, computer files, and microforms; other nonbook formats, e.g., moving image materials or kits, would not be considered to be editions of the work (they would be related works) and therefore are not included here.
REFERENCES
Carlyle, Allyson. 1997. “Fulfilling the Second Objective in the Online Catalog: Schemes for Organizing Author and Work Records into Usable Displays.” Library Resources & Technical Services. 41 (2): 79-100.
Carlyle, Allyson. 1999. “User Categorisation of Works: Toward Improved Organisation of Online Catalogue Displays.” Journal of Documentation. 55 (2): 184-208.
Carlyle, Allyson. 2001. “Developing Organized Information Displays for Voluminous Works: A Study of User Clustering Behavior.” Information Processing & Management. 37 (5): 677-699.
Matthews, Joseph R, Gary S. Lawrence, and Douglas K. Ferguson. 1983. Using online catalogs. New York, NY: Neal Schuman.
O'Neill, Edward T. 1994. “Manifestations of Fiction Works.” Annual Review of OCLC Research, 1994. Dublin, OH: 11-15.
Pease, Sue and Mary Noel Gouke. (1982). "Patterns of Use in an Online Catalog and a Card Catalog." College & Research Libraries. 43 (July): 279-291.
Smiraglia, Richard P. and Gregory H. Leazer. 1999. “Derivative Bibliographic Relationships: The Work Relationship in a Global Bibliographic Database.” Journal of the American Society for Information Science. 50 (6): 493-504.
Svenonius, Elaine. 1988. “Clustering Equivalent Bibliographic Records.” Annual Review of OCLC Research, July 1987-June 1988. Dublin, OH: OCLC: 6-8.
Tillett, Barbara B. 1991. “A Taxonomy of Bibliographic Relationships.” Library Resources & Technical Services. 35 (2): 150-158.
Wiberley, Stephen E. Jr., Robert Allen Daugherty, and James A. Danowski. 1995. “User Persistence in Displaying Online Catalog Postings: LUIS.” Library Resources & Technical Services. 39 (3): 247-264.
APPENDIX 1. Operational Definitions of Cluster Types Used in the Research
NOTE: If specific subfields are not mentioned, then the presence of an indicator may appear in any subfield. Abbreviations used below are based on the OCLC Bibliographic Formats and Standards and include: “FF” (Fixed field); “$” (subfield); and “+” (in addition to).
Cluster Types:
· illustrated editions: presence of any letter in an Ills. FF; presence of the truncated (truncation indicated by *) terms “ill*”, “port*”, “plate*”, or “map*) in either the 245 $c, the 300 field, or the 250 field; presence of “ill*” in a 440 or 490 field; presence of “illus* in a 500 field; presence of “ill*” in a 700 $e.
· editions with amplifications: presence of the terms “afterword”, “annot*”, “comment*”, “intro*”, “note*, “prolog*”, or “pref* in a 245 $c or 250 field; presence of “introd*”, “commentary”, “notes”, “supplement” in a 500 field.
· large print, Braille, or other orthographic variation editions: presence of “d” or “f” in the Form FF; presence of the term “large type” in a 245 $c, 250, 440, 490, 500, 650 $a, or 830 field; presence of the term “large print” in a 245 $h; presence of the term “Braille” in a 245 $h or a 553 field.
· editions in collections: presence of uniform title of work in 505 field + 100 $a author of work; presence of 700 field with 2nd indicator “2” with correct author and uniform title for work + (100 $a is not author of work or 245 $a is not title of our work); presence of 700 field with 2nd indicator “2” with author and uniform title for another work; 500 $a or 501 $a beginning with “With” or “Bound with”.
· editions composed of parts, selections only: presence of 240 $a uniform title + ($k “selections” or presence of $p); presence of LC call number for work indicating parts; presence of 700 field with 2nd indicator “2” with correct author and uniform title for work + presence of ($k “selections” or presence of $p)
· abridgements: presence of “abridge*” in a 245, 250 $a, 511, or 520 field; presence of “condens*” in a 245, 250 $a, 511, or 520 field.
· non-English editions: presence of any text other than “eng” in Lang FF; presence of 041 1_; presence of 240 $L if not “English”; presence of “translat*” in 245 $c, 500, or 520 field
· nonbook format editions: Sound Recordings: presence of Type FF “i” or “j” presence of 007 $a “s”; presence of 245 $H “sound*”; presence of 300 field $a with (“sound*” or “cassett*”) ; presence of 305 or 362 fields. Computer files: presence of Type FF “m”; presence of 245 $h “computer file”; 256, 538, or 856 fields present; Microforms: presence of Form FF “a”, “b” or “c”; presence of 245 $h “microform”; presence of “micro*” in 533 field $a.
· English language editions without illustrations, amplifications, etc.: all of the records not clustered in any of the above clusters.
|
|
|
English |
Non-English |
Total Records |
% of Total |
|
Bleak House |
|
359 |
27 |
386 |
25% |
|
Kidnapped |
|
424 |
39 |
463 |
30% |
|
Three Musketeers |
|
336 |
367 |
703 |
45% |
|
Totals: |
|
1119 |
433 |
1552 |
100% |
|
|
Percents: |
72% |
28% |
100% |
|
Table 1. Distribution of Records Analyzed by Work and Language
|
Bleak House |
English |
Non-English |
Totals |
% of BH recs. |
|
Illustrations |
280 |
13 |
293 |
76% |
|
Amplifications |
67 |
1 |
68 |
18% |
|
Large print, etc. |
0 |
0 |
0 |
0% |
|
Collections |
3 |
1 |
4 |
1% |
|
Parts |
5 |
2 |
7 |
2% |
|
Abridgements |
4 |
0 |
4 |
1% |
|
Non-English Lang. |
NA |
NA |
39 |
10% |
|
Nonbook |
21 |
0 |
21 |
5% |
|
English eds. only |
37 |
NA |
37 |
10% |
Table 2. Distribution of Bleak House Records Clusters
|
Kidnapped |
English |
Non-English |
Totals |
% of Kidn. recs. |
|
Illustrations |
242 |
23 |
265 |
57% |
|
Amplifications |
49 |
0 |
49 |
11% |
|
Large print, etc. |
13 |
0 |
13 |
3% |
|
Collections |
15 |
0 |
15 |
3% |
|
Parts |
0 |
1 |
1 |
0.2% |
|
Abridgements |
21 |
0 |
21 |
5% |
|
Non-English Lang. |
NA |
NA |
27 |
6% |
|
Nonbook |
48 |
0 |
48 |
10% |
|
English eds. only |
92 |
NA |
92 |
20% |
Table 3. Distribution of Kidnapped Records Clusters
|
Three Musketeers |
English |
Non-English |
Totals |
% of 3M recs. |
|
Illustrations |
167 |
165 |
332 |
47% |
|
Amplifications |
40 |
47 |
87 |
12% |
|
Large print, etc. |
0 |
0 |
0 |
0% |
|
Collections |
3 |
8 |
11 |
2% |
|
Parts |
5 |
2 |
7 |
1% |
|
Abridgements |
29 |
10 |
39 |
6% |
|
Non-English Lang. |
NA |
NA |
364 |
52% |
|
Nonbook |
37 |
2 |
39 |
6% |
|
English eds. only |
98 |
NA |
98 |
14% |
Table 4.
Distribution of Three Musketeers Records Clusters
|
Incorrectly Clustered Records |
Number |
% Incorrectly Clustered |
|
Illustrations |
4 |
4% |
|
Amplifications |
23 |
26% |
|
Large print, etc. |
0 |
0% |
|
Collections |
42 |
47% |
|
Parts |
13 |
14% |
|
Abridgements |
5 |
6% |
|
Non-English Lang. |
3 |
3% |
|
Nonbook |
0 |
0% |
|
English eds. only |
NA |
NA |
|
Total Incorrectly Clustered |
90 |
100% |
Table 5. Distribution of Incorrectly Clustered Records