Archiving101.com; in depth no nonsense information about archiving and related technologies.
11th January 2010

EDRM data set project


This was made available earlier thiswee. The EDRM Data Set project http://edrm.net/activities/projects/data-set has 3 different sets of data available for free:

 

  • The EDRM Data Set Enron PST files: Enron e-mail messages and attachments organized in 32 zipped files, each less than 700 MB in size, containing 168 .pst files.
  • The EDRM File Formats Data Set: 381 files covering 200 file formats.
  • The EDRM Internationalization Data Set: A snapshot of selected Ubuntu localization mailing list archives covering 23 languages in 724 MB of email.

 

 Might be good for those that would like to test some discovery on content instead of the manual created data.

posted in Categorization, eDiscovery | 0 Comments

21st March 2008

The Practical Need for Proactive Email Categorization (PART 1 of 2)

Archiving email is a business necessity in an ever increasing number of organizations. There are several primary reasons for this:

  • Legal or other regulation
  • Mailbox size management
  • Litigation response/Federal Rules of Civil Procedure (FRCP)

This last reason is what gives many organizations heartburn, and many are attempting to get the “digital landfill” that is their current email archive under control. To most, this takes the form of archive categorization, a method for automatically:

  • Setting retention periods for email messages
  • Properly marking potentially privileged email
  • Categorizing and/or eliminating non-business and duplicate email

The best time to implement some form of categorization is before you are forced to respond to an FRCP request. To facilitate this, the legal department and the information technology groups need to come together. I’m not suggesting that holding hands and singing “Kum Bah Yah” is the solution to your archive headaches, but here is what I’ve seen:

IT: “We need an archive retention strategy.”
Legal: “Yes.”
IT: “We need to come to consensus on what items to keep for what periods.”

At this point the legal group typically answers one of two ways:

Legal: “That’s simple, keep everything forever.”

-or-

Legal: “That’s simple, keep everything for thirty days, and then nuke it.”

While both of these might seem an exaggeration on my part, I assure you it isn’t. Virtually every organization that I’ve worked with has taken one or the other of these naïve approaches at some point. The first time that they had to actually retrieve something from the archive they realized one of two things:

  • The archive is so full of cruft that it’s impossible to find anything. We can’t differentiate anything that might be responsive from the non-business email, and we’re spending a lot of money examining these emails to determine whether they are privileged communications.

-or-

  • There isn’t much in the archive that matters (and, since you also deleted it from the mailboxes and PST files on the user desktop, your users are up in arms).

Just for reference, it is possible to find an archiving middle ground where companies retain a manageable amount of information while still being responsive to the current and future demands.

For example, some organizations have taken to targeted archiving - only archive the users that might be involved as custodians in the future. For many organizations this represents only a small fraction of the total user base.

Remember that setting retention periods properly based on user role and content is essential to proper archive hygiene, and removing non-business email can cut your archive size by a significant percentage.

Bradley Young is vice president of services for MessageGate, an email controls company (http://www.MessageGate.com). Bradley blogs at http://randomdesiderata.blogspot.com/.

posted in Categorization, eDiscovery | 1 Comment