Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js
This article is part of a series of articles that I am writing to illustrate the use of the EMC Centera SDK and the .NET wrapper being developed as open source project to store "fixed content" on the EMC Centera storage appliance. But before I start, I want to explain what "fixed content" is and give an overview of the reasoning behind the emergence of this type of storage.
Fixed Content Definition
Fixed content is information that never changes after its creation. It's actively referenced, typically shared among users and must be retained (maintaining a copy of fixed content for a mandatory period of time) for a long period of time. Examples include: electronic documents, presentations, and e-books; rich media such as movies, videos, digital photographs, and audio files; check images and financial statements; bioinformatics, X-rays, MRIs, and CAT scans; CAD/CAM diagrams and blueprints and e-mail messages.
- An average enterprise (a 250-person organization) generates approximately 1.5TB of e-mails per year
- A picture archive in a large hospital may generate more than 5TB per year in digital X-rays or MRIs
- Banks are scanning millions of check images per year, requiring multiple terabytes of storage
State of the Industry
A large portion of all digital information is fixed content. It is expected that fixed content is the largest portion of digital content created by the human race in the next century, exceeding all dynamic content put together.
Also, The information life cycle drives to more fixed content. Enterprises embracing things such as email and electronic documents are increasing the need for fixed content storage exponentially. Finally, emerging regulations requiring retention (maintaining a copy of fixed content for a mandatory period of time) in the financial and healthcare industries are creating a huge need for fixed content storage and fixed content solutions.
The EMC Centera appliance is one of the appliances available in the market today to satisfy that need. Other companies such as NETApp has solutions equivalent to the Centera. But, this series of articles is specific to showing how to code using the Centera SDK.
What You Will Need to Develop Against the Appliance
- To start writing content to the Centera Appliance, you will need to have the Centera SDK. You will need to register on the EMC site to download the SDK. There are a number of versions of the SDK available for download. Use 3.1SP1 version. This link will take you to the site to download the SDK.
Note that the only way to save content on most "fixed content" storage devices is through the use of the device-propriety API(s) that the device manufacturer publishes. Some manufactures do offer an open standard (CIFS, NFS, HTTP, and WebDAV interfaces) to read/write to their own devices. But usually, you end up losing a lot of the device's power. Things like WORM (write-once-read-many) functionality or retention capabilities are usually lost with the open standards.
- You also will need the .NET wrapper for the Centera SDK. The latest version of the opensource.net project is on sourceForge. The link is http://sourceforge.net/projects/cosi-dot-net.
- You need to have access to the "Public Centera" appliances. EMC recognized that the Cenetra device is not available everywhere and did set up an appliance on the Internet that developers can develop against. The content of this appliance is purged periodically by EMC. The latest IP(s) can be found on EMC site. As of this writing the valid IP(s) are:
- EMEA1 - 220.127.116.11, 18.104.22.168, 22.214.171.124, 126.96.36.199
- EMEA2 - 188.8.131.52, 184.108.40.206, 220.127.116.11, 18.104.22.168
- EMEA3 - 22.214.171.124, 126.96.36.199, 188.8.131.52, 184.108.40.206
- EMEA4 - 220.127.116.11, 18.104.22.168
- EMEA5 - 22.214.171.124, 126.96.36.199
- US1 - 188.8.131.52, 184.108.40.206, 220.127.116.11, 18.104.22.168
- US2 - 22.214.171.124, 126.96.36.199, 188.8.131.52, 184.108.40.206
- US3 - 220.127.116.11, 18.104.22.168, 22.214.171.124, 126.96.36.199
- US4 - 188.8.131.52, 184.108.40.206, 220.127.116.11, 18.104.22.168
- US5 - 22.214.171.124, 126.96.36.199, 188.8.131.52, 184.108.40.206
Special Architecture Knowledge You Need
- Centera Appliance stores Content. This content is stored using an address. This content/address combination is called CAS (or content addressable storage). So you will hear/read about this term in the industry these days.
- The smallest block of data that can be stored must be housed inside a memory block the SDK calls "C-Clip." In other words, you have to create a C-Clip and place your content inside the C-Clip first. Then, you send the C-Clip to Centera to be saved. The C-Clip itself is made of two other components, the Content Descriptor File (CDF for short) and the BLOB.
- The Content Descriptor File (CDF) is an XML file that holds metadata. The CDF contains TAGS and ATTRIBUTES.
- An XML Tag in the CDF
- A user-defined name
- Example: <Application_Name>ImageStore2004</Application_Name>
- An XML attribute in the CDF
- A user-defined value
- Example: <My_App name= "ImageStoreServer"/>
- They hold objects stored on Centera.
- They are represented as a distinct bit sequence of the object you are trying to store.