Big Data: Basic concepts
1. Why today is the age of big data
Simply put: Massive data , cloud computing , the age of big data
2. Where does big data come from?
There are three main sources:
• Structural data generated by the machine.
• Non-structural data generated by humans.
• Mixed data generated by the organization.
Examples of data generated by the machine: cash register notes, fixed format.
Examples of non-structural data generated by humans: comment data on social platforms, uploaded images, videos, and so on.
Examples of data generated by institutions: a supermarket, with all the purchase and sale data, customer shopping data, as well as the official website reviews of the supermarket, there are structured data, there are unstructured data.
3. How big data generates value
Value comes from consolidating different types of data sources! Take supermarket examples, through the purchase and sale data, customer shopping data, social network public opinion monitoring data, forecast sales expectations in the next few days, and then develop appropriate marketing strategies to increase sales.
4. Definition of "big data" - 6 "Vs"
We define what "big data" is through six dimensions, and the English words for this dimension begin with the letter "V", so it can also be shorthand for 6 "V". They are: Volume (Scale), Velocity (Speed), Variety (Diverse), Veracity (Quality), Valence (Connect), Value (Value)
• Volume: Refers to the huge amount of data generated every day
• Velocity (speed): This means that data is generated faster and faster
• Variety: refers to the diversity of data formats, such as text, speech, pictures, and so on
• Veracity (Quality): This means that the quality of the data can vary greatly
• Valence: Refers to how big data is linked
• Value: Data processing can bring unusual insights that generate value
5. 5 "Ps" in "Data Science"
The knowledge of using big data to generate value is defined as "data science". Specifically, this learning can be defined by five "Ps": Purpose (target), People(people), Process (process), Platforms (platform), Programmability (programmable)
• Purpose: Take advantage of the problems or challenges big data wants to solve
• People: Data scientists often have skills in a variety of fields, including science or business knowledge, statistical knowledge, machine learning and mathematics knowledge, data management knowledge, programming, and computer knowledge. In general, it is often necessary to work in teams of "complementary" scientists.
• Process: This includes how teams communicate, what techniques are used, what workflows are in use, and so on
• Platforms( Platforms): Includes what kind of computing and storage platforms to take
• Programmability (programmable): Data science requires the help of programming languages such as R and Patterns, MapReduce, etc.
6. Ask questions
It's important to ask the right questions before you really start analyzing your data!!! Famous saying: Properly defining the problem to be solved is equivalent to half of the problem that has been solved!
7. The workflow for data analysis
The workflow for data analysis consists of five main steps:
• Get the data
• Prepare data: Includes data exploration and pre-processing
• Analyzing data: The process of building a model
• Presentation Results: Visualize data conclusions
• Application Conclusions: Present ideas and form actions
8. What is a distributed file system
The physical state of a distributed file system is a bunch of cabinets full of hosts. Distributed file systems are stored by first cutting a file into n copies (in the figure, for example, five copies) and then copying the five copies and storing them in different hosts in different cabinets.
Why would you do that?
There are three main benefits:
• Data Scalability: Not enough storage to increase the disk array
• Fault Tolerance: If the host or cabinet is down, it is difficult to cause data loss or the system to stop working
• High Concurrency: Parallel processing of data is possible
9. Hadoop eco-environment
Hadoop is a set of software that handles a variety of frameworks for distributed storage, cloud computing, big data processing, and more. We explain hadoop in this "cascade structure" below. In Cascade Structures, the structure of the upper layer depends on the resources provided by the next layer. In the figure below, B and c depend on the resources provided by a, and there is no dependency between b and c.
Hadoop is such an "ecological environment" that can be represented by the following "cascade structure" diagram:
Here's a brief introduction to some of the others, others for the reader to learn about.
• HDFS: Distributed storage file system, the foundation of almost all upper-tier applications.
• YARN: Manager for provisioning underlying resources and managing processes
• MapReduce: A simple program for executing resources provisioned through YARN
• Hive: Advanced programming model, similar to SQL queries
• Pig: Advanced programming model, data flow script