Last week, several publications including www.cbronline.com and www.zdnet.co.uk featured an article on IBM GPFS scanning a massive 10 billion files in 43 minutes, click here for the article. Instantly I thought ‘wow’, but then I started to think ‘but is this actually important, or just another advancement vanity project from the technology industry? After some jovial, thought provoking chats with colleagues on the subject, put simply – everyone should care.
In many of our HPC installations, particularly research driven projects at Universities, metadata performance has become a key focus. By their nature, university research departments have many researchers, looking at multiple research projects, creating millions (or more) of files. At a recent installation at the University of Edinburgh (read the story here), we did not want the number of users, or the number of folders impacting overall performance of the server and storage cluster so we opted for metadata on a separate SSD using GPFS as the file management system.
So, putting your metadata on flash memory – separate from your mainstay disk storage – and using GPFS is nothing new. However, the speed of querying is clearly something new and quite unbelievable. Whilst IBM’s demonstration was very much flag waving – ‘look at how fast I am’ – below the surface there are very strong messages applicable in the real world – performance, cost, compliance and competitive advantage.
We have found the limit of traditional “Walk the Tree” scans of File systems to look for files, which have been modified, has been hit with file systems of 100’s of TBs in size (With it taking more than a day to even scan whether files have changed, let alone back up these changes). Using GPFS and Fast metadata storage, metadata can be scanned for changes and that list of changed files sent to the appropriate backup application.
The same applies to GPFS polices used for Hierarchical Storage Management, allowing the Filesystem to quickly scan (In metadata) for files which fit into certain critera quickly, for example to move all files which have not been accessed in the last 30 Days off to slower, more cost effective (or even tape) storage.
According to Mark Blowers from Ovum, in 2011 and beyond, data management will be a key area due to the sheer volumes now passing through enterprises. “The management of data will come to a head for CIOs in 2011, who will realise that it is an issue that can no longer be ignored. The issue of hardware capacity and the drain on resources will see data management make it on to the investment agenda for IT departments in 2011. We believe they need to address both master data management and storage management to deal with the issue effectively.”
Poor information management creates a bottleneck for companies on multiple levels, as well as adds unnecessary cost burdens. Take the cost of storage, for example; failing to manage data results in the retention of redundant information and records, which occupy storage space within a system. Then there’s the time cost associated with chasing information needed to meet regulatory requirements or to make a business or strategic decision.
Although this is an extreme example of a metadata storage solution this world record breaking benchmark proves that it is feasible using SSD technology to scan and then back up multi-petabyte filesystems – given current rates of data growth it will not be too long before we get there!