|
INFO 320
Pre-Web Information Systems
Fundamental Problems:
- Form What is the appropriate (best? natural?) form of information?
Information has a form?
What's the relationship between an oral myth and its written form?
Is there a canonical version of King Lear?
Is there a canonical version of this html page?
Why does eXtensible Markup Language separate content from presentation?
|
- Meaning What is the meaning of information?
Information has a meaning?
What is the meaning of "Moby Dick"?
What is the meaning of "42nite"? or "42 nite"? or "4 2 nite"? or "4 2nite"?
What is the meaning of this yellow box?
|
Our Legacy Information Technology Revolution: Print
| Johannes Gutenberg - "Man of the Millennium" |
more
|
Class discussion: Printing press; moveable type; book creation beyond control of elites; Ben Franklin can
publish political tracts; appearance of dictionaries; codification of language and grammar; sudden appearance
of problem of illiteracy;
competition to Church hagiography; threatened information elites, disruptive information technology, new ways
of reading and writing, etc.
| Rapid Advances in Technology of Information |
more
|
- 1800 - Stanhope tests iron printing press
- 1803 - Cylindrical paper-making machine produces cheap paper
- 1804 - First book printed by stereotype process
- 1811 - Steam-powered cylindrical printing press
- 1822 - Composing machine for setting type
- 1832 - Penny Weeklies build large circulations
- 1836 - Dickens' Pickwick Papers invents serial publication
For the first time in the world's history, there were a lot of books.
In his authoritative study, Charles Dickens and His Publishers (1978), Robert L. Patten points
to the interrelated effects that Pickwick Papers had upon author, publisher, and audience. According to
Patten, although Sketches by Boz
inaugurated Dickens's career, Pickwick made it. Dickens's first continuous
fiction -- many would deny that it is a novel -- ushered in the age of the novel, which critics looking
backward from the perspective of the eighties and nineties thought either the glory or the curse of the
Victorian era. The success of the flimsy shilling parts, issued in green wrappers once each month from April 1836
to November 1837, was unprecedented in the history of literature. The lion's share of credit for that success
has always, and properly, gone to the pseudonymous "Boz," a twenty-four-year-old shorthand writer with a quick
eye, a fluent pen, and an inexhaustible, buoyant, and loving imagination. Critics from 1836 onwards have tended
to slight the part played in the runaway reception of the book by its unusual format; yet subsequent to Dickens's
success with Pickwick, parts publication became for thirty years a chief means of democratizing and enormously
expanding the Victorian book-reading and book-buying public.
Dickens and his publishers discovered the potential of serial publication virtually by accident. Even
though in the half century after Pickwick most of the novels appeared "compact in three separate and
individual volumes" as Mr. Omer describes David Copperfield's maiden effort, and were not bought but
borrowed from the great circulating libraries like Mudie's and W. H. Smith's, serial publication opened
up a new reading and buying public that subsequent publishers and formats did then exploit in a variety
of ways. Furthermore, serial publication yielded profits hitherto thought impossible for any publisher
or author, and transformed Dickens, Chapman, and Hall from minor figures in Victorian letters to titans.
What forces made that format suddenly possible, and how the changes in publishing converged in 1836 and
were connected by two shrewd, courageous, and lucky booksellers with the one man who could write letterpress
for all the people, needs to be understood more fully than it has been so far. The prodigious success of
Pickwick in parts signals a revolution in publishing.
more
|
New information technologies produce new forms of information.
Development of Databases
- 1960 - National Library of Medicine (NLM) - MEDLARS system.
- 1960 - First public demonstration of online searching, SDC's Protosynthex.
- 1965 - Chemical Abstracts issued Chemical & Biological Abstracts, printed and magnetic tape formats.
- 1965 - Beginning of the CAS Chemical Registry System database funded by NSF, NIH and DoD.
- 1967 - Engineering Index.
- 1969 - BioSciences Information Service (BIOSIS)
- 1969 - LC MARC tapes for books available to subscribers.
- 1967 - First production search service, Lockheed's DIALOG serving NASA.
- 1967 - Data Corporation: Ohio Bar Automated Research (OBAR) full-text retrieval system.
- 1970 - Data Corporation became Mead Data Central (MDC).
- 1973 - MDC's LEXIS became operational.
- 1971 - 1972: Lockheed offers DIALOG services to database producers.
- 1974 - 18 databases offered.
- 1971 - NLM's MEDLINE (MEDLARS on-line) became operational.
- early 1970's - SDC's ORBIT developed with DoD contract.
- 1970-1975: Transition to on-line searching led by govt.; scientific numeric database.
- 1970-1975: For profit companies entering the A&I market. Predicasts, Congressional Information Service.
Databases sold/leased/offered online through DIALOG or SDC's ORBIT.
Example database vendor company history:
Key dates of the Dialog Corporation
How is "information" constructed to form a bibliographic record?
Database suppliers gather
information, index it and create bibliographic records.
Philosophically, one might say that they "create" the information. Database suppliers may be
profit or non-profit organizations. An example would be
ERIC.
Database suppliers lease their databases to vendors such as
the Dialog corporation. Dialog vends
access to hundreds of databases,
just as a supermarket sells
hundreds of food products.
How is "information" derived from a bibliographic record
The Dialog example
Consider this group of data:
SHERWIN-WILLIAMS CO
408 E 16TH ST
CHEYENNE, WY 82001-4604
TELEPHONE: 307-638-8781
COUNTY: LARAMIE
INDUSTRY: RETAIL TRADE
PRIMARY SIC AND YELLOW PAGE PRODUCT LINE(S):
5231 (PAINT GLASS & WALLPAPER STORES)
523107 (PAINT-RETAIL)
SECONDARY SIC(S) AND YELLOW PAGE PRODUCT LINE(S):
2851 (PAINTS VARNISHES LACQUERS & ENAMELS)
285103 (PAINT-MANUFACTURERS)
5085 (INDUSTRIAL SUPPLIES)
508508 (SPRAYING EQUIPMENT-WHOLESALE)
5198 (PAINTS VARNISHES & SUPPLIES)
519803 (PAINT-WHOLESALE)
5231 (PAINT GLASS & WALLPAPER STORES)
523106 (WALLPAPERS & WALLCOVERINGS-RETAIL)
5713 (FLOOR COVERING STORES)
571305 (CARPET & RUG DEALERS-NEW)
|
Methods
- Human indexing: Somebody comes along and says that Moby Dick is about
"fishing for a big whale" or maybe ""whaling for a big fish" or something like that.
Class discussion: Is an indexer a 'privileged reader'? What is the
connection between indexer privilege and indexer correctness?
- Automatic indexing: consists of ripping the text into "words" based on white-space
normalization, tossing high frequency "meaningless" words such as the and then locating
the index terms in an inverted index.
Class discussion: What's the relationship between syntax and semantics?
Introduction to Modern Information Retrieval by Gerard Salton and Michael J. McGill. 1983.
(p.52) "Of all the operations required in information retrieval, the most crucial and probably the most difficult one consists in assigning appropriate terms and identifiers capable of representing the content of the collection items. This task, known as indexing, is normally performed manually by trained experts. In modern environments the indexing task can be performed automatically."
(p.71) "Such a process must start with the identification of all the individual words that constitute the documents." (Word problem)
(p.71) "For many practical purposes, it is sufficient to use document excerpts for analysis, such as the titles and abstracts. The available experimental evidence indicates that the use of abstracts in addition to titles brings substantial advantages in retrieval effectiveness." (Form of information problem)
(p.71) "...the high-frequency function words need to be eliminated. These comprise 40 to 50 percent of the text words, and as suggested earlier, these words are poor discriminators and cannot possibly be used by themselves to identify document content. (Word problem: How to retrieve "Vitamin A" if we disregard "A"?)
(p.71) "It is useful first to remove word suffixes (and possibly also prefixes), thereby reducing the original words to word stem form. This reduces a variety of different forms such as analysis, analyzing, analyzer, analyzed, and analysing to a common word stem 'analy.'" (Word problem: What's the difference between an analyst and an analysand?)
|
Some advanced reading: Indexing Texts with SMART
Assignment 1
For class discussion
Reading One:
How important is orthography?
Strategies of the Construction of Information
Consider busy, messy reality:
Pets are registered in King County and are categorized as "domestic" (e.g., a dog) or
"exotic" (e.g. a cobra snake). Registered pets have a registration number, and most pets have a given name.
Each pet has an owner of record with an address. Pets are also given a general descriptive term such as "poodle."
Some questions about busy, messy reality:
- Would you want to record the zip code of a Seattle area in every pet description? Or, would you rather
design things so that Seattle zip codes are recorded once in their own document?
- What is the relationship between an element like "dog" and an attribute like "registration", or an
element like "pet" and an attribute like "type"? Could things be turned around so that "registration was
an element and "dog" was its qualifier?
- Who is going to control the descriptive term "poodle?" Suppose that tomorrow someone registers a another
poodle, but describes it as "Standard poodle." What's the relationship between "poodle," "standard poodle,"
"miniature poodle," etc.?
Strategy #1 Putting stuff in its place
Strategy #2 Managing qualifiers
Strategy #3 Managing descriptors
|
|