Many people have experienced data analysis, but in the first step of data analysis, how to obtain data is an important step. Data extraction is the basis of data analysis. Without data, analysis is meaningless. Sometimes, how many data sources we have, how much data we have, and how well the data quality will determine what happens to our output.
We need to consider that the trend of a
data is influenced by multiple dimensions. We need to collect as many data
dimensions as possible through multi-source data collection, while ensuring the
quality of the data so that high-quality data analysis results can be obtained.
There are many tools for data analysis, such as FineReport, Tableau, Power BI, etc.
So, from the data collection point of view, what are the data sources? I have divided the data source into the following four categories.
1.Open data source (Government, university and enterprise)2.Crawler scraping (Web and application)3.Log collection (Frontend capture backend script)4.Sensor (Image, speed, thermal)
four types of data sources include: open data sources, crawler scraping, log
collection, and sensors. They all have their own characteristics.
sources are generally industry-specific databases. For example, the US Census
Bureau opened up data on population information, regional distribution, and
education in the United States. In addition to the government, enterprises and
universities will also open corresponding datasets. It is important to know
that many studies are based on open data sources. You need the same data set to
compare the quality of the algorithm.
type of data source is log collection, which is the operation of the
statistical user. We can track the event at the front end, collect scripts and
statistics on the back end, and analyze the access of the website.
the sensor, which basically collects physical information. Such as images,
videos, or the speed, heat, pressure, etc. of an object. Since the main
emphasis of this paper is data collection, this method will not be described.
we know that there are four types of data sources, how do you collect them?
How to Use Open Data Source？
The following table shows some authoritative open data source.
If you are looking for a data source in a certain field, such as the financial sector, you can see if the government, universities, and enterprises have open data sources.
How to Use Crawlers
to Scrape the Data
should be the most common way, such as the evaluation data you want for the
restaurant. Of course, we must pay attention to copyright here.
and many websites also have anti-crawling mechanisms.
straightforward way is to write crawler code in Python, of course, you need to learn
basic syntax of Python. In addition, PHP can also write a crawler, but it is
not as good as Python, especially when it comes to multi-threaded operations.
Python crawler, you basically go through three processes.
1. Crawl content using Requests. We can use the Requests library to crawl web page information. The Requests library can be said to be an excellent tool for Python crawlers, which is Python’s HTTP library. It is very convenient to crawl the data in the webpage through this library, which can save us a lot of time.
2. Parse the content using XPath. XPath is an acronym for XML Path. It is a language used to determine the location of a part of an XML document and is often used as a small query language in development. XPath can be indexed by elements and attributes.
3. Save your data with Pandas. Pandas is an advanced data structure that makes data analysis easier, and we can use Pandas to save crawled data. Finally, it is written to the database such as XLS or MySQL through Pandas.
XPath, and Pandas are three useful tools for Python. Of course, there are many other
tools to write Python crawlers, such as Selenium, PhantomJS, or Puppteteer.
In addition, we can also crawl the webpage information without programming. Here are three commonly used crawlers.
crawling, cookies, Sessions, etc. The application can analyze and retrieve data
from a website and convert it into meaningful data. It can also use machine
learning techniques to identify complex documents and export them to JSON, CSV,
Google Sheets, and more.
How to Use Log Collection Tool
The biggest role
of log collection is to improve the performance of the system by analyzing user
access conditions, thereby increasing the system load. Timely discovery of
system load bottlenecks can also facilitate technical personnel to optimize the
system based on actual user access conditions.
The log records
the entire process of the user’s visit to the website: who is at what time,
through what channels (such as search engine, URL input), what operations have
been performed; whether the system has generated errors; even the user’s IP,
HTTP request time, user agent, etc.
collection can be divided into two forms.
1. Collected through the web server. For example, httpd, Nginx, and Tomcat all have their own logging function. At the same time, many Internet companies have their own massive data collection tools, which are mostly used for system log collection, such as Chukwa of Hadoop, Flume of Cloudera, Scribe of Facebook, etc. These tools are distributed architectures that meet hundreds of MB of log data collection and transmission requirements per second.
What is Event Tracking?
Event tracking is a key step in log collection.
Event tracking is
to collect the corresponding information and report it at the location where you
set. For example, the access status of a page, including user information,
device information; or the user’s operation behavior on the page, including the
length of staying time. Each event tracking is like a camera. It collects user
behavior data and performs multi-dimensional analysis of the data, which can
truly restore the user usage scenarios and user needs.
So how do we track
Event tracking is
to embed statistical code where you need statistics, of course, the implant
code can be written by yourself, or you can use third-party statistical tools.
I have talked about the principle of “do not repeat producing a
wheel”. For the event tracking toolsg, the market is quite mature. I can recommend
some to you. Three-party tools such as Google Analysis, Talkingdata, and more.
They all use the front-end tracking method, and then the user’s behavior data
can be seen in the third-party tools. But if we want to see deeper user
behavior, we need to customize the event tracking settings.
To sum up, log collection helps us understand the user’s operational data and is suitable for scenarios such as operation and maintenance monitoring, security auditing, and business data analysis. A typical web server will have its own logging capabilities, or you can use Flume to collect, aggregate, and transfer large volumes of log data from different server clusters. Of course, we can also use third-party statistical tools or custom buried points to get the statistics we want.
Data extraction is the key to data analysis. Sometimes we use Python web crawlers to solve the problem. In fact, data collection methods are very wide. Some can directly use open data sources, such as the price and transaction data of Bitcoin history. You can download directly from Kaggle and don’t need to crawl it yourself.
On the other hand, according to our needs, the data that needs to be collected is also different, such as the transportation industry, data collection will be related to camera or speedometer. For operations personnel, log collection and analysis are key point. So we need to choose the right acquisition tool for a specific business scenario.
If you want to know more about data analysis, just follow FineReport Reporting Software.