MongoDB

MongoDB

Intro

Definition: A database management system (DBMS) that uses a document-oriented database model. It is document oriented database which is used to high availability, easy scalability and high performance. It supports the dynamic schema design.

Key features:

Document Oriented, schema Less, capped collections, fexible data model, agile and highly scalable database, faster than traditional databases
Replication: MongoDB duplicates the data-set.
Ad-hoc Queries: It supports ad-hoc queries by indexing the BSON documents & using a unique query language.
Schemaless: It is very flexible because of its schema-less database that is written in C++.
Sharding distribution of data across multiple machines. It supports deployment with large sets of data.
Indexing: Mongo DB supports generic secondary indexes, allowing a variety of fast queries and provides unique, compound, geospatial and full-text indexing capabilities as well.
Aggregation: Mongo DB supports an “aggregation pipeline” that allows you to build complex aggregations from simple pieces and allow the database to optimize it.

ACID

Atomic: it either fully completes or it does not
Consistent: no reader will see a "partially applied" update
Isolated: no reader will see a "dirty" read
Durable: (with the appropriate write concern)

Historically MongoDB does not support default multi-document ACID transactions (multiple-document updates that can be rolled back and are ACID-compliant). However, MongoDB provides atomic operation on a single document. MongoDB 4.0 will add support for multi-document transactions, making it the only database to combine the speed, flexibility, and power of the document model with ACID data integrity guarantees.

Normalization VS Denormalization

Normalization: Dividing up data into multiple collections with references between collections. Denormalization: embedding all of the data in a single document.

Normalization will provide an update efficient data representation. Denormalization will make data reading efficient.

Use embedded (denorm) when:

you have “contains” relationships between entities.
you have one-to-many relationships between entities.

Use normalized data models:

when embedding would result in duplication of data but would not provide sufficient read performance advantages to outweigh the implications of the duplication.
to represent more complex many-to-many relationships.

Avoid Normalization if you have to do lookups due to is slow, specially in sharded collections

Miscelanea

Cardinality: one-to-one, one-to-many, or many-to-many.

Namespace: The concatenation of the collection name and database name => instasent-admin.sms

Documents: Data is stored in BSON documents. Documents that tend to share a similar structure are organized as collections. Advantages of documents:

-   Documents  correspond to native data types in many programming languages.
-   Embedded documents and arrays reduce need for expensive joins.

Creating a schema:

-   Combine objects into one document if you use them together. Otherwise, separate them
-   Do joins while write, and not when it is on read
-   For most frequent use cases optimize your schema
-   Do complex aggregation in the schema

Profiler: MongoDB includes a database profiler which shows performance characteristics of each operation against the database. With this profiler you can find queries (and write operations) which are slower than they should be and use this information for determining when an index is needed.

ObjectID:

-   a 4-byte value representing the seconds since the Unix epoch,
-   a 5-byte random value, and
-   a 3-byte counter, starting with a random value.

Covered query:

-   fields used in the query are part of an index used in the query, and
-   the fields returned in the results are in the same index
-

Transaction: A logical, atomic unit of work that contains one or more SQL statements. MongoDB (prior 4) does not use traditional locking or complex transactions with rollback, as it is designed to be light weight, fast and predictable in its performance. By keeping transaction support extremely simple, performance is enhanced, especially in a system that may run across many servers.

Why are data files so large?: MongoDB does aggressive preallocation of reserved space to avoid file system fragmentation.

How does MongoDB provide consistency? MongoDB uses the reader-writer locks, allowing simultaneous readers to access any supply like a database or any collection. But always offers private access to singles writes.

Dot notation: access the elements of an array and to access the fields of an embedded document.

Indexes

Indexes are special data structures that store a small portion of the collection’s data set in an easy to traverse form. MongoDB automatically creates a unique index on the _id field. Indexes properties:

Adding an index has some negative performance impact for write operations.
When active, each index consumes disk space and memory. This usage grows over time can becomes significant.

Indexing an array: An array field can be indexed in MongoDB. In this case, MongoDB would index each value of the array so you can query for individual items

Aggregation

Aggregations are operations that process data records and return computed results. Types:

the aggregation pipeline,
the map-reduce function,
and single purpose aggregation methods and commands: count, distinct group

What are the disadvantages of MongoDB?

A 32-bit edition has 2GB data limit.
MongoDB is only ideal for implementing things like analytics/caching where impact of small data loss is insignificante.
In MongoDB, it’s difficult to represent relationships between data so you end up doing that manually by creating another table to represent the relationship between rows in two or more tables.

Tips

Even though MongoDB doesn’t enforce it, it is vital to design a schema.
Likewise, indexes have to be designed in conjunction with your schema and access patterns.
Avoid large objects, and especially large arrays.
Be careful with MongoDB’s settings, especially when it concerns security and durability.
MongoDB doesn’t have a query optimizer, so you have to be very careful how you order the query operations. For example, you need to make sure that the data is reduced as early as possible in the pipeline via $match and $project
Create authentication

Sharding

The procedure of splits the data-set into discrete parts. By putting a subset of data on each machine, it becomes possible to store more data and handle more load without requiring larger or more powerful machines, just a larger quantity of less-powerful machines.

Database systems with large data sets:

High query rates put stress on the CPU capacity of the server.
Larger data sets exceed the storage capacity of a single machine.
Dataset sizes larger than the system’s RAM stress the I/O capacity of disk drives.

To address these issues of scale, database systems have two basic approaches:

Vertical Scaling
Sharding or Horizontal Scaling
Shards are used to store the data.
Query Routers, or mongos : a routing service for MongoDB shard configurations that processes queries from the application layer.
Config servers stores the cluster’s metadata. This data contains a mapping of the cluster’s data set to the shards. The query router uses this metadata to target operations to specific shards.

Replication

The process of duplicating the data-set. Replication provides redundancy and increases data availability. With multiple copies of data on different database servers, replication protects a database from the loss of a single server.

Replica set: is a group of servers with one primary and multiple secondary’s, servers that keep copies of the primary’s data. If the primary crashes, the secondary’s can elect a new primary from amongst themselves.

Oplog: capped collection that keeps a rolling record of all operations that modify the data stored in your databases. Oplog is just capped collection where MongoDB tracks all changes in its collections

Journal is a feature of underlying storage engine. Without a journal, if mongod exits unexpectedly, you must assume your data is in an inconsistent state With journaling enabled, if mongod stops unexpectedly, the program can recover everything written to the journal, and the data remains in a consistent state.

GridFs

A mechanism for storing large binary files in MongoDB. GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of 16MB. GridFS does not have issues with storing large numbers of files in the same directory. For storing and retrieving large files such as images, video files and audio files GridFS is used. By default, it uses two files fs.files and fs.chunks to store the file’s metadata and the chunks.

Rollback: Rollback can fail if there are more than 300 MB of data or about 30 minutes of operations to roll back. In these cases, you must re-sync the node that is stuck in rollback

Table of Contents