Building the Right Environment to Support AI, Machine Learning and Deep Learning
- Homogeneous data arrays containing elements of the same type
- Multimedia - audio, video and graphics files
- Interim data for internal use (logs of various types, caches)
- Streams of calculated data of various types (e.g. recorded video stream or massive computation results)
- Documents (simple or compound)
The ways for storing such a data are as follows:
- Files in file system
- Structured storages
- Archives (as a specific form of structured storage)
- Remote (distributed, cloud) storages
Homogeneous data arrays
Homogeneous data arrays contain elements of the same type. Examples of a homogeneous data array may be a simple table, temperature data over time or last year stock values.
- For homogeneous data arrays, regular files do not provide possibility for convenient and fast search. You have to create, maintain and constantly update special indexing files. Modification of the data structure is almost impossible. Metainformation is limited. There is no built-in run-time compression or encryption of data.
- Relational databases are well suited for homogeneous data. They comprise a set of predefined records with rigid internal format. Main advantage of relational databases is an ability to locate data quickly according to specified criterion, as well as transactional support of data integrity. Their significant shortcoming is that relational databases will not work well for large-size data of variable length (BLOB fields are usually stored separately from the rest of the record). Moreover, keeping data in relational databases requires: a) use of specific DBMS, which limits severely portability of the data and of the application itself, b) pre-planning of database structure, including interrelational links and indexing policy, c) researching details of peak loads is required for efficient database development, which also may be a serious overhead.
- Structured storages are somewhat analogous to a file system, i.e. storages are a specific set of enveloped named streams (files). Such storage can be stored at any location, i.e. in a single file on a disk, in a database record, or even in RAM. The main advantage of this approach is that it allows efficient adding or deleting data in an existing storage, provides the effective manipulation of data of various sizes (from small to huge). The storages represent separate units (files) and therefore can be easily relocated, copied, duplicated, backed up. There is no need to track all files generated by an application. Moreover, journal keeping makes it possible to restore content completely or partially, thus eliminating accidents or failures. The disadvantage may be relatively slower search inside these huge data arrays.
- ZIP archives, as a specific form of the structured storage, can be used for storing homogenous data arrays, but only in case when the most of access is read-only. Standardized nature of ZIP format makes it easy to use, especially in cross-platform applications, but this format is not suitable for the data to be modified after packing, so adding and deleting of data is a time-consuming operation.
- Remote and distributed storages are the next level of storage in which actual data location and data access are provided by specific layer used for encapsulating of access mechanics. In such storages data can actually be stored in databases or be distributed among different file systems, but the actual storage organization does not matter for an end-user. The user observes only a set of objects accessed through an API, or, as a variant, through file system calls. Good example is cloud storages. These types of data storages are to be used in large software complexes. Among other advantages one can mention unified data access without a need to think about actual ways how data are stored. Its disadvantages - they cannot be efficiently managed and controlled, and backup or migration of data is complicated.
Audio, video and graphic files
Storing a single (or several) multimedia files is simple. Complexities appear when you need to maintain a large number of files and want to perform a search across the multimedia collection.
- Only very simple and sparse multimedia files can be stored as regular files. Even for an average home collection, simple file-based multimedia data storage becomes unmanageable very quickly. This is mostly due to size of these files, inability to handle any annotation, tags or metadata, and low speed of copying or relocation.
- Relational databases are a dubious way of storing audio, video or similar types of data. RDBMS are not well suited for keeping large BLOBs, especially when it comes to storing video files of big size. Also each type of data requires it's own table (due to different sets of metadata that needs to be stored). On the other hand RDBMS can be handy as they offer powerful search capabilities, which is very suitable for read-only collections.
- Structured storages work perfectly well for storing of multimedia files when the storage supports metadata and fast search through them. If this search is not supported, structured storage becomes a variant of the file system.
- Remote and distributed storages are among the best solutions when it comes to storing of video, music or similar data. Storage represents a single unit where all elements of a multimedia or video game can be safely stored. There is no risk of loosing a single but important file. Searches are fast and efficient if the storage supports tags and metadata.
Temporary data are generated by software on the fly and usually have a validity term. Most of updates are very frequent. In addition, such intermediate information should stay easily accessible, integral, and, in many cases, encrypted and secured. It is still possible to use regular files for these purposes. This approach will result in high resource consumption, there is no reliable way to control and enforce integrity of data and their encryption functions should be implemented by your software.
- For a long time files have been used as a way of interim data storage. They are quite suitable for storing low-priority unsecured temporary data of insignificant size. Meanwhile, modern legislations of several countries dictate more careful and responsive treatment of interim data. As a result, regular file system becomes less suitable when issue of data security, vulnerability, and protection from tampering becomes paramount.
- Relational databases are not usually used for interim data storage due to absence (as a rule) of clearly defined structure and interrelated nature of elements. Low speed of upgrade, issues of compression and security add to this unsuitability. At the same time, a relational database can contain interim data related to the database itself and its operation. Also a database can be used for some kind of data cache or for storing activity logs (journal files). RDBMS doesn't suit well, if the data are required to be stored for a long term (years) and to be signed or encrypted.
- Structured storages may be considered as an optimal solution when a large volume of interim data need to be stored, accessed, indexed and searched, compressed and encrypted on-the-fly. Structured storages may be build with anti-tempering functions, or, should the requirements be present, - provide an easy way for data removal or replacement. As always, such storages can be easily copied or moved without need for taking special care to preserve data integrity.
- ZIP archives are rarely used for interim data storage. Fast (as a rule) interim data turnaround makes them impractical in most situations. An encrypted archive may be suitable for this type of data only when snapshots are to be stored for long time and need to be protected from loss or tempering.
- Remote and distributed storages are used for interim data streams basically due to space considerations. They don't provide speed or easy management and backup, often required for interim data.