Skip Navigation / Jump to Content

erpaAdvisory

Answered questions

Return to the ErpaAdvisory index page | View list of unanswered questions

There are currently 21 answered questions on ErpaAdvisory:

Questions 16 to 20 shown below.

Previous | Next

Submitted by pjm on 15 November 2002 at 12:59

We are a psychology institute and we conduct a considerable portion of our research on the Internet studying how people use information and how they present it on their Web-sites. For this reason we monitor a number of Web-sites and document how they evolve. However, the Internet is very dynamic and its contents volatile. In order to document our research, we have to find a means to ensure enduring availability of the web-material we employ. This external Web-material is not in our sphere of influence after all. For this reason we would like to collect it and store it in our own repository.
Can you advise us, on how we could capture the material?

Answered by dutched on 15 November 2002 at 13:34

Scope of Question
You attempt to preserve digital objects from external web-sites and your question concerns the active collection of the material. The identification and selection of the material is performed as part of your research work, so this will not be addressed in this answer. Legal issues are not in the scope of your question although they might arise and should be examined (see for example the Swedish National Library, and new Swedish law). After collecting the material you have to manage it in the archive and have to actively ensure its long-term preservation; both involve lots of tasks and decisions and the issues surrounding them can only be briefly touched here.

Answer
Collection
Since you already identified the material that you want to obtain, any solution should allow you to clearly define the material to be acquired.

There are free software packages available that are capable to collect web-pages automatically: GNU wget for Linux (http://www.gnu.org/software/wget/wget.html) and HTTrack for Windows (http://www.httrack.com/) are tools for retrieving Web-pages; both tools can follow the links in hypertext recursively to mirror whole Web-sites.

You have to bear in mind, however, that any automatic solution is prone to technical flaws. This is due to the dynamic nature of the Web and the array of formats, plug-ins, and other software-widgets it accommodates. Sites that are generated dynamically are particularly difficult to capture automatically.

A software environment for collection
The solution needs to be more comprehensive than what these tools described above offer. It must enable you to: (1) customise the capturing of the material to your specific requirements; and (2) incorporate processes necessary for the management of the acquired material.

(1) These tools for collecting the material do not offer the feature to acquire Web-pages at regular intervals. In order to acquire specific Web-sites repeatedly at a certain frequency, the tool has to be executed again (or an automatic script needs to be installed that activates the tool). A system suitable for your requirements should allow you to select the sites you wish to acquire, the method of acquisition, and lets you define the frequency at which the material should be acquired.

(2) The tools only download the files from the Web-servers and store them in the file-structure as they are organised on the Web-site. The data still has to be transferred into an archive for long-term retention. To dot this you will need to assign metadata to those files, to pack all the files from one Web-site at a specific point in time together to a single record, and undertake any special processes required to incorporate the record in your archive.
There is no software environment freely available or offered by a vendor that has the ability to automate these tasks. If you need to collect a large number of different sites at a high frequency, you may opt to implement a software environment that automates these tasks as far as possible. A system that tackles these tasks and goes beyond that has been designed by the Pandora Project at the National Library of Australia [link to project description Pandora], the PANDAS system. The description of PANDAS at (http://pandora.nla.gov.au/manual/pandas/) gives an impression of such a system and its requirements.


Long-term preservation
After collecting the material, annotating it with metadata, preparing it for ingest, and incorporating it into your archive, you have to ensure that it stays accessible. Hardware always runs the risk of deterioration which endangers the data on the carrier. In addition, software formats are superseded constantly, thus rendering the data obsolete. There are no off the shelf solutions that solve these problem. Nevertheless, you should formulate a strategy for the long-term preservation of the web-material as soon as possible. In the long term it will be easier and more cost-effective if you prepare for digital preservation from the outset.

Before formulating a strategy you have to clearly identify the attributes of the digital objects that are significant to you and that you want to preserve. For example, if you only want to preserve the look of the web-page and not necessarily its functionality, you taking snapshots of the web-page and storing them as image files in a TIFF format is a practicable option. The TIFF format is deemed to be a relatively durable format. Retaining the functionality of a website is a far more complex issue however, and has not fully been addressed.

The issues relating to the long-term preservation of digital material are still matter of intensive research.

Summary and Recommendation
Collecting material from the web and preserving it over the long term involves many tasks and responsibilities that must not be underrated. You have to be clear about all the technical as well as organisational challenges you face and formulate a clear strategy, otherwise your objectives and decisions might prove intractable. The issues you should address concern largely three areas: the collection, the management, and the long-term preservation of the web material.

(1) Formulate a strategy for the management and the long-term preservation of the material in your archive.

(2) Identify all processes that must be executed before the material can be incorporated in the archive. Based on this specification you may desire to install a software environment that accelerates or even automates these tasks.

(3) Evaluate available tools for collecting material on the web. Keep in mind that no automatic tool can handle all the technical challenges to be faced on the Internet, therefore it is necessary to perform quality checks on the acquired material.


For further reading refer to:
William Y. Arms, Roger Adkins, Cassy Ammen, and Allene Hayes: Collecting and Preserving the Web: The Minerva Prototype. In: RLG DigiNews, April 15, 2001. http://www.rlg.org/preserv/diginews/diginews5-2.html#feature1
Kenneth Thibodeau: Overview of Technological Approaches to Digital Preservation and Challenges in Coming Years. In: The State of Digital Preservation: An International Perspective – Conference Proceedings. CLIR Reports (Council on Library and Information Resources), Washington D.C., July 2002. http://www.clir.org/pubs/reports/pub107/thibodeau.html


Submitted by pjm on 11 November 2002 at 16:24

I'm looking for reports concerning best practice/good practice on digital cultural content esp. museums, libraries and archives. If you have any information concerning this issues I would be very pleased to receive some (Internet links would be fine).

Answered by dutched on 12 November 2002 at 8:58

Your question is very broad and embraces various fields within the area of digital cultural heritage. In order to structure your research it is best approached from a process-oriented perspective: practices with respect to digital cultural content deal with either
() Creating digital resources
() Appraising
() Managing
() Preserving
() Providing Access to them

Below is a list of literature that is rather general but essentially focuses on managing digital resources. Also, you will find references to hubs that are a good starting point for further research.
----------------------------

DLM Forum'96: Guidelines on best practices for using electronic information. http://europa.eu.int/ISPO/dlm/documents/guidelines.html

Hodge, Gail: Best Practices for Digital Archiving: An Information Life Cycle Approach. D-Lib Magazine, January 2000. http://www.dlib.org/dlib/january00/01hodge.html

Pam Gatenby: Digital Archiving - Developing Policy And Best Practice Guidelines At The National Library Of Australia. ICSTI Forum No.33 (March 2000). http://www.icsti.org/forum/33/#Gatenby

Inter-university Consortium for Political and Social Research. Guide to Social Science Data Preparation and Archiving. 2002. http://www.icpsr.umich.edu/ACCESS/dpm.html

Minnesota Historical Society State Archives Department. Electronic Records Management Guidelines. August 2001. http://www.mnhs.org/preserve/records/electronicrecords/erguidelines.pdf

The NINCH Guide to Good Practice in the Digital Representation and Management of Cultural Heritage Materials. http://www.nyu.edu/its/humanities/ninchguide/

PRO: Management, Appraisal and Preservation of Electronic Records - Volume 2: Procedures. http://www.pro.gov.uk/recordsmanagement/eros/guidelines/default.htm



Gateways and Bibliographies

Arts and Humanities Data Service (AHDS) - Guides to Good Practice in the Creation and Use of Digital Resources: http://ahds.ac.uk/guides.htm

CLIR (Council on Library and Information Resources) Reports: http://www.clir.org/pubs/reports/reports.html

Cultivate, European Cultural Heritage Network; a DigiCult project:
http://www.cultivate-europe.org/

DigiCult, Digital Heritage and Cultural Content; Part of the IST Programme of the European Commission: http://www.cordis.lu/ist/ka3/digicult/home.html#What_is_DigiCult

PADI (Preserving Access to Digital Information) Gateway; National Library of Australia: http://www.nla.gov.au/padi/
(http://search.nla.gov.au/padisearch/query.html?col=padimeta&qp=&qs=&qc=&pw=100%25&ws=0&la=&fs=&qt=best+practices&qm=1&ql=&st=11&nh=10&lk=1&rf=0)

the projects and partners of the TRIS project: http://www.trisweb.org/


Submitted by pjm on 30 October 2002 at 15:26

Are there standard file formats I should use to preserve my organisation’s digital information?

Answered by britished on 30 October 2002 at 16:17

Scope of Question
This question relates generally to file formats and their role in the preservation of digital information.

Answer
A file format is a representation of the arrangement of the structural and data elements of a file in a unique and specific manner. The choice of file format for preservation very much depends on the type of information or data that is being preserved.
Despite the wide range of formats available, there are a few principles which should be observed when choosing a preservation file format. The two major problems with file formats are:
1. They are constantly evolving
2. New formats emerge and supersede others
Important issues to bear in mind for preservation therefore include the openness of the format and its expected longevity. Considerations of quality, flexibility and the support of programmes also need to be taken into account.
It is essential to make a decision about the nature of the information or data that you wish to preserve. If it is principally the informational value that is of concern, a different format may be suitable in comparison to one chosen to preserve the functionality of the data. It may be best to preserve the information in the format in which it was created. Or, as an example, if it is simply the information in a Microsoft Word document which needs to be preserved and the formatting is not considered important (tables, italics, colours, hyperlinks), then it could be converted to plain text such as ASCII (American Standard Code for Information Interchange). Conversion to a data format like ASCII has advantages because it is non-proprietary and most software packages will accept ASCII data.
Image file formats have received more attention than other formats, which has led to greater discussion about the suitability of a variety of types and a degree of concurrence of opinion. (See Question: Image File Formats for Digital Preservation) For audio files there has also been some move towards consistency as the .WAV format (WAVeform sound format) has moved from being an industry standard to become a de facto standard. The fact that this format is so predominant for the preservation of audio files and that there are already millions of hours of audio material stored in this format adds to its potential longevity. The Victorian Electronic Records Strategy (VERS) has recommended the use of Acrobat’s Portable Document Format (PDF) as a long-term format for the preservation of document type records. They have made this decision based on the fact that PDF has a widely published specification and a wide community acceptance. In addition, it renders documents in an accurate fashion and stores text in a searchable form. It is the belief of VERS that, despite some misgivings, of all the formats currently available, PDF best meets their needs for long term preservation of document type records.
Summary and Recommendation
By using file formats which are (or are as close as possible to being) well documented, tested, non-proprietary, non-compressed, and useable on a wide range of hardware and software platforms, it is hoped that there will be less risk of a need for frequent migration and an avoidance of the high cost involved in such preservation activities. This is currently the recommended approach to take. It is also important, however, that the chosen format accommodates the properties and functionalities needed for the information to be interpreted and understood into the future.
For further information refer to:
Gregory W. Lawrence, William R. Kehoe, Oya Y. Rieger, William H. Walters, and Anne R. Kenney Risk Management of Digital Information: A File Format Investigation June 2000
John Mark Ockerbloom, Archiving and Preserving PDF Files, February 2001
PADI http://www.nla.gov.au/padi/topics/44.html
VERS http://www.prov.vic.gov.au/vers/
ERPANET Toledo Workshop Report
Question: Image File Formats for Digital Preservation


Submitted by pjm on 23 October 2002 at 15:54

What is the difference between “Conversion” and “Migration”?

Answered by dutched on 24 October 2002 at 7:43

Scope of Question
Your question concerns strategies and techniques relating to digital preservation and roots in the different perceptions of these terms between research projects. Technical terms and concepts in digital preservation do not have a clear definition sometimes. This is due to the fact that people contributing to research in digital preservation have different backgrounds and work in inherently different areas. Also, digital preservation still is a relatively young field of research.

Answer
Succinctly: ‘Conversion’ is a technique that is employed in the preservation strategy of ‘Migration’.

Explanation of the answer
Conversion is the transfer of a digital object from one software format to another. The term refers to the technical process of conversion and is normally associated with an automated software tool that conducts this process. One conversion tool could, for example, convert Microsoft Word 97 files into Rich Text Format, or another tool might convert PNG images into the JPEG format. With such a conversion tool a bulk of digital objects can be transferred from the respective source to the target format automatically.

The term Migration refers to a digital preservation strategy. Any Migration strategy consists of a bundle of activities. It concerns the transfer of digital objects from one generation of a technology generation to a new one. The allows the object to be accessible on a future computer system. There exist two strands of Migration: In one strand the object is periodically transferred from one technology generation to the subsequent, thereby creating a trail of successive Migration steps; the other strand involves only one step, namely the transfer of the original digital object to an object that is accessible at the time when access to the object is requested (‘Migration on Request’).

For further reading refer to:
- Paul Wheatley: Migration – a CAMiLEON discussion paper. In: Ariadne Issue 29, September 2001; ISSN 1361-3200. http://www.ariadne.ac.uk/issue29/camileon/
- Digital Preservation Testbed: Migration – Context and Current Status. White Paper, December 2001; http://www.digitaleduurzaamheid.nl/bibliotheek/docs/Migration.pdf


Submitted by pjm on 23 October 2002 at 13:34

As an art gallery we present our holdings on our web-site for commercial reasons: customers are able to pre-view the artworks we offer for sale. For this reason, we create digital images of the objects and make them available on the Web in a JPG format. Our plan now is to preserve these images. What data format is the most suitable for preservation of images? Will JPG suffice?

Answered by dutched on 23 October 2002 at 13:50

Scope of Question
Your question addresses the properties of image file formats in general, and which format to use for long-term preservation.

Answer
Properties of File Formats
The JPG format (alias JPEG - Joint Photographic Expert Group) is a good choice for viewing photographic colour images on the Web. Due to the compression algorithm it uses (based on Discrete Cosine Transformation), JPEG is able to downsize files considerably, thus enabling faster downloads via the Internet. However, it is a 'lossy compression' meaning that you lose information depending on the degree of compression defined; this will result in 'compression artefacts', which are the blurring of sharp edges and inadequacies on solid colour areas.
Another widely used image format on the Web, the GIF format (Graphics Interchange Format), makes use of a loss-less compression algorithm. Thus, it is suitable for representing sharp edges and solid colour areas, yet the file size is usually bigger. Also, GIF is restricted to 256 colours, which makes it unsuitable for photographic images with smooth colour gradients.

In a nutshell, it depends on the type of image, which format is suitable: for presenting photographic images on the Web use JPG, for pencil drawings and sketches the GIF format.

Long-term Preservation
Both image formats, however, are not suitable for preservation. Due to the properties described above, they are both relatively small-sized at the expense of information loss. For Preservation, a high-quality image is needed, from which copies for access can be taken.

For long-term preservation it is also important to stick to a standard file format thereby ensuring that the image endures for as long as possible. Any file format you chose, however, will likely be replaced by other formats in the future. Therefore, you will have to actively ensure that your images are accessible beyond that point in time. From the current point of view, the most viable long-term preservation strategy for a homogeneous (i.e. with few different file formats) archive containing only images is converting the files to new file formats as they are superseded by others.

Currently, a prevalent format for archival storage is TIFF, the Tagged Image File Format, to be precise 'Uncompressed Baseline TIFF Rev 6'; it is a standard format, does not loose information through compression, and is capable to store a high resolution and colour depth of the image. Yet, there are several extensions of the format available that use different kinds of compression. Attention must be paid to the properties of the respective compression algorithm.

A format considered to be even more appropriate for the long-term preservation of images is the PNG (Portable Network Graphic) format. It provides an even greater colour depth than TIFF (up to 48 bit). As another advantage, the format is free of copyright. Yet, it is not as widely adopted as TIFF, and hence tools and support for PNG are still somewhat patchy.

Summary and Recommendation
You should carefully define the properties for each image type (e.g. photo, pencil drawing) and the specific use (e.g. archival storage, Web presentation) and chose the image format accordingly. For presentation purposes JPEG or GIF are appropriate formats. For long-term preservation, however, you should employ TIFF or PNG. Transfer your images to archival storage directly after digitisation. (Do not use JPEG or GIF as an intermediate format so as not to loose information). Monitor technology and convert all your images to a new format, if your previously chosen format runs danger to become obsolete.


For further reading refer to:
Report on the ERPANET Workshop in Toledo, June 23-25 2002 [link]
Technical Advisory Service for Images: File Formats. http://www.tasi.ac.uk/advice/creating/fformat.html
Hewlett Packard: Conversion and Document Formats - Backfile conversion and format issues for information stored in digital archives. DLM Forum 2002 - Industry White Paper 2, ISBN 3-936534-02-0.
Adobe: TIFF Format Support and Development. http://www.adobe.com/support/salesdocs/2596.htm
Greg Roelofs: PNG Home Site. http://www.libpng.org/pub/png/

NOF-DIGI mailinglist, March 2003, thread: "version 5.1 Technical Standards and Guidelines".


Previous | Next