人工智能人工智能·

Spark and Big Data Application

发布时间:2024-07-23 10:57:13阅读量:248
学术文章
·
介绍文
转载请注明来源

Spark and Big Data Application

Digital data is growing fast in volume courtesy of the innovations that are taking place in the digital world. The data quantities are huge and are from different sources, including smartphones and social networks, which the relational databases and the analytical techniques are incapable of storing or processing. These huge amounts of data get the name Big Data from this characteristic. Traditional machine learning cannot handle the volume and velocity of the data and it therefore requires the right system and programming model to complete the analysis. Big data comes in different forms, and their distribution and analysis are possible through data frameworks like Hadoop and Spark. Large data analytics can also benefit from the use of different types of Artificial Intelligence (AI) like Machine Learning for results. The use of big data tools and AI techniques has changed the phase of data analysis. This paper will examine a big data computing model based on Spark and the application of big data. Additionally, it will highlight big data platforms.

Big Data Platforms

Several platforms form big data, including batch processing, real-time processing and interactive analytics. In batch processing, there are extensive computations that also require time for data processing. The most common form of batch processing is Apache Hadoop. The platform is scalable, cost-effective, flexible and tolerates faults involved in the processing of big data. There are also other Hadoop Platforms like the Hadoop Distributed File System (HDFS), Yet Another Resource Negotiator (YARN) and MapReduce (Rahmani et al., 2021). These platforms can also operate across the big data value chain, which works from aggregating, storing, processing and managing big data. When it comes to stream processing platforms, Apache Spark and Storm are the most popular platforms. Such applications used in stream processing play a big part in weather and transport systems. It is possible to access datasets and use them for various operations remotely. The users can also access the system directly and interact with the data. An example of an interactive analytic platform is the Apache Drill.

Apache Spark Big Data Computing

It is a text-based framework whose origin traces back to the University of Berkeley back in 2009 and has since become a powerful tool capable of macro processing, unlike other tools. The core is the basis of the features that are found in Spark. Spark has three application programming interfaces (APIs), namely Java, Python and Scala, and its core is an API location that acts as a definition for the RDD dataset found in its original programming. Apache Spark supports the building and running of fast and high-level sophisticated applications on Hadoop. Spark engine is responsible for supporting tools that can support SQL, for instance, queries, stream data applications, and manage complex analytics like machine learning and graph algorithms. There are several big data ecosystems based on the Spark platform, including Spark runtime, spark SQL, graph X, Spark streaming and even the MLlib-Algorithm library (Chen, 2021).

The analysis of spark runtime requires the determination of functionality, for instance, the scheduling of tasks and the management of memory. In instances that involve data transmission inside Spark using RDD structure, the idea is for the determination of the core logic data information of Spark. The first will involve the division of data into many subsets and transmitting the subsets to nodes in the cluster to help in processing. The second steps lead to result protection after the calculation. The protection it is possible to have the same file content as the results of the calculation and it also becomes easier to back up the file content. The final step happens only in the case where calculation errors occur at any of the subsets, which will involve the resorting of the subset to do away with the errors.

Shark is where Spark SQL is developed. Shark employs the concepts from HQL parsing, logical execution plan translation, and execution plan optimization in hive to achieve hive compatibility. One could argue that the spark job merely replaces the physical execution plan. Additionally, it is dependent on the past Metastore and Hive series. When it comes to historical compatibility, Spark SQL solely uses the HQL parser, history meta store, and Hive server. Stated differently, spark SQL takes over since HQL is parsed into an abstract syntax tree (AST). The catalyst is in charge of creating and optimizing the execution plan. It is considerably easier to build in Scala thanks to its functional language capabilities, such as pattern matching.

Among the many benefits of Spark, SQL integration is the most noteworthy. Spark SPL seamlessly combines spark programs and SQL queries with spark SQL. Users may query structured data in the spark program using conventional data frame APIs or SQL. It is compatible with R, python scala and Java by employing knowledgeable API to work with the data. This benefit can increase the effectiveness of our research. Running SQL or hiveql queries on pre-existing warehouses with Hive integration is the second benefit. Unified data access offers a third benefit since it connects to any data source in an identical manner. Access to a range of data sources such as hive Avro parquet ORC JSON and JDBC can be facilitated by using data frames and SQL even adding data from different sources is possible. The final advantage comes in the form of unified data access, which allows for standard data connections via JDBC or ODBC connections. Server mode supports industry-standard JDBC and ODBC connections for business intelligence products.

When the calculating algorithm is analyzed using the Mllib algorithm library, it is shown to have a high level of complexity. Iterative computations need a lot of CPU power because all calculations must be stored on disk in order to wait for task processing to begin.

A portion of the work can be completed entirely in memory using the Spark platform. Because the iterative calculation is becoming more efficient, the corresponding iteration portion of the calculation task is moved straight to memory. It can also achieve disk and network operation under specific conditions. In conclusion, Spark offers even more impressive benefits for iterative computing, which can be used as a platform for distributed machine learning. Using a collaborative filtering algorithm is meant to figure out the concept before sharing the idea with the users.

There are steps necessary in the algorithm application, and the description is as follows. System filtering comes first. After choosing the people who share similar interests, the objects are chosen and categorized based on their preferences, creating a new set or sequence. Users in this process can be thought of as neighbors, and it is best to arrange and make use of related users in order to ascertain the best execution strategy. Secondly, cooperative filtering. The integration of user preferences is the essential component of the final suggestion, which is completed by means of the phases of user preferences gathering, similarity analysis, and recommendation based on the computation results. To accurately gather data behavior, the first consideration is choosing a user system and then organizing it based on user behavior and subsequently processing the data before issuing recommendations on what users may like provided their preferences.

The spark system is responsible for expanding spark data and also divides Spark streaming data depending on the time mode and eventually forms RDD. When processing streaming data that has small intervals, there will likely be a processing delay which leads to the time processing system. The advantage of using spark streaming includes its response to a fault, which, as it is, has a strong fault tolerance that makes it capable of handling errors and recovering data. Another benefit of using it is its ability to connect to relevant spark modules. One of its added advantages, apart from completing the flow of data, is the ability to deal with complexity.

Another key component of Spark is Graph X, whose foundation is on the large -scale graph computing with the ability to compute large graph data processing in Spark with the aid of Graph X properties. In the event of analysis of graph x, it is verifiable that it could be a source of rich data operators like the core and optimization operators. Graph X also has the capability of meeting graph operations of several distributed clusters and has enough API interfaces. According to Chen (2021), Graph X plays a critical role in Spark as it improves its data absorption and scale.

Applications of Big Data

Medical Application

Healthcare facilities and medical organizations use big data to anticipate potential health risks and prevent them. The healthcare system acquires large benefits from large-scale data analysis. For example, analytics has made it possible to diagnose diseases in their early stages, like cases of breast cancer. It has also supported the processing of medical images and medical signals for providing high-quality diagnosis and monitoring of patient symptoms. Additionally, tracking chronic diseases such as diabetes has been easier. Other uses are preventing the incidence of contagious diseases, education through social networks, genetic data analysis and personalized medicine. Data types include biomedical data, web data, data from different electronic health records (EHRs), hospital information systems (HISs), and omics data, which includes genomics, transcriptomics, proteomics, and pharmacogenomics. Rich data, such as demographics, test results, diagnoses, and personal information, are included in the EHR and HIS data (Ismail Ebada et al., 2022). Consequently, the significance of health data analysis has been taken into consideration, which has led to knowledge for scholars to control, manage and process big data courtesy of tools like Apache Spark.

Apache Spark's memory calculations enable rapid processing of large healthcare datasets, both unstructured and structured. It can analyze data faster than typical map reduction approaches, approximately 100 times faster. Spark's lambda architecture enables both real-time and batch processing. The tool analyzes streaming healthcare data and includes a library for handling Twitter data streams. Several messages relate to health, and communication is possible through the use of social networks. Twitter, for instance, provides real-time public health statistics that are less expensive to get since real-time disease surveillance entirely depends on mining the information from Twitter API methods (Nair et al., 2018). Spark's machine learning package, MLlib, can help create decision trees for illness prediction. Spark analyzes large amounts of data as it streams. It outperforms Hadoop in streaming due to its memory-based processing capabilities. Data and intermediate outcomes are kept in memory, which helps avoid delays that come in the input-output stage of data that has to move back and forth on the hard disk.

Spark uses resilient distributed datasets (RDDs), which are immutable object collections stored in memory and partitioned among nodes for parallel calculations. Spark is a batch-processing engine that can handle streaming data from several sources, including Twitter. The Spark engine handles small batches of the incoming data streams after division. Discretized streams (DStreams) are a series of RDDs that symbolize a continuous data stream in Spark streaming. The Spark engine computes the fundamental RDD transformations that result from operations on DStreams. Popular learning methods that aid in solving machine learning problems include collaborative filtering, clustering, and classification found in MLlib.

Natural Disaster Prediction and Management

The use of big data is a crucial part of responding to natural disasters. Some of the applications of the tools are in preventing disasters, providing relief in times of disaster and managing disasters in times of emergency. In the era of the Internet of Things (IoT), the complexity of natural disaster management has become simpler. AI Spark Big Model has enabled a new way of thinking, technical system and innovative ability to deal with complex disaster information, resulting in an effective way of dealing with comprehensive disaster prevention and reduction. Big data has strategic significance and, together with its core values, is mainly composed of three concepts. The first is on the level of strategic thinking which defines the point when the disaster problem is beyond the ability of contemporary scientific understanding. Such problems are high-frequency and high-intensity disasters and go beyond the ones that the world defines as basic disasters.

Additionally, comprehensive disaster reduction goes beyond the national strategy. The level of innovation and development of information science and technology in big data refers to the second concept that defines a situation where disaster big data results in challenges in traditional information science and technology. At this point, there is an acceleration in the formation of technology systems and new theories, which have complex information serving as the core. The third concept defines the level of social innovation and development. Here, disaster big data is the core capability of ensuring that the world has Internet and is smartly secure and can reduce disaster using these smart strategies.

The ability to efficiently monitor disaster information, standardize data, predict abnormalities and provide early warnings of signs of disaster, the existing forms assessing disaster risks and all kinds of disasters form disaster cloud computing (Chen, 2021). Since it deals with early warning, analyzes how to manage emergencies, provides emergency information services and analyses the calculation of other models, it is capable of building a large distributed ring in the way of shared infrastructure. It serves the major purpose of providing safe and faster disaster cloud data processing and predicts and issues an early warning. It also provides a disaster risk assessment model, emergency management service model and other related services.

Face Recognition

Face recognition is another application of big data analytics tools like Apache Spark. Businesses are able to use big data analysis systems to detect faces, recognize the faces, recognize gender, age, facial expressions and gazes of people watching the machines advertising and other digital signs that originate from face recognition technology. The information that big data processing technology collects becomes an instrumental part of the analysis of customer preferences. In the end, it helps provide customer-friendly services according to their gender, age and emotional state. It generally facilitates the connection between people and things, hence capable of making recommendations on the advertisements and messages that certain individuals prefer. It is also useful in the non-manual and automatic uses aimed at recognition of face, eyes and any other information that could help in predicting subsequent operations.

Conclusion

The paper addresses various issues, starting with the definition of big data and analyzing the types of platforms that form big data. It also summarizes the applications in Apache Spark by defining Spark SQL, graph x, Mllib algorithm library and Spark streaming and how they operate. It, however, still needs to be determined which platforms are stable, and it is therefore advisable to use the tools that have undergone tests and proven their scalability, flexibility and stability. Additionally, the essay introduces the reader to the areas where big data play crucial roles. The areas include its use in the medical field in the prediction, analysis, prevention and management of various forms of diseases. Other areas include its use in the management and prediction of natural disasters and its use in face recognition, among other areas. Technology is still evolving, and big data modelling could still use some improvements like having an enhanced real-time data processing which other tools like Apache Spark have already made possible courtesy of Twitter APIs. Most industries are also moving toward using big data processing systems that are accurate and precise to enhance their economic and social benefits. Batch processing, which is the most common big data processing, needs to meet the frequency of data processing requirements that these industries need.

References

  1. Chen, S. (2021). Research on Big Data Computing Model based on Spark and Big Data Applications. Journal of Physics: Conference Series, 2082(1), 012017. https://doi.org/10.1088/1742-6596/2082/1/012017
  2. Ismail Ebada, A., Elhenawy, I., Jeong, C.-W., Nam, Y., Elbakry, H., & Abdelrazek, S. (2022). Applying Apache Spark on Streaming Big Data for Health Status Prediction. Computers, Materials & Continua, 70(2), 3511–3527. https://doi.org/10.32604/cmc.2022.019458
  3. Nair, L. R., Shetty, S. D., & Shetty, S. D. (2018). Applying Spark based machine learning model on streaming big data for health status prediction. Computers & Electrical Engineering, 65, 393–399. https://doi.org/10.1016/j.compeleceng.2017.03.009
  4. Rahmani, A. M., Azhir, E., Ali, S., Mohammadi, M., Ahmed, O. H., Yassin Ghafour, M., Hasan Ahmed, S., & Hosseinzadeh, M. (2021). Artificial intelligence approaches and mechanisms for big data analytics: a systematic study. PeerJ Computer Science, 7, e488. https://doi.org/10.7717/peerj-cs.488
评论区

暂无评论,来发布第一条评论吧!

弦圈热门内容

Vue3+Django实现保持登录状态

Request.session对象会自动生成一个cookie,该cookie名字默认为sessionoid,储存session的session_key值。当会话session过期后,该cookie将会自动消失。只有该cookie清空后,登录后的用户才能重新登录。 通过后端设置session的过期时间,时间到后,通过浏览器可以看到,cookie自动消失了。因此只需要设置,用户存在localStorage中的token值跟着cookie一起消失,就能够通过设置session的过期时间,来控制用户保持登录状态的时间,如七天免登录。 我们可能会认为其实可以根据cookie来控制登录状态,实际上想要在vue中操作cookie值是不容易。在JavaScript中,我们可以通过document.getElementId()之类的指令来控制cookie。但在vue中,使用document的指令很可能会得到空的返回值。因此,不能通过直接控制cookie的方法来实现控制登录状态。 因此,我们需要在后端多写一个接口,来检验token的有效性,从而控制用户的登录状态。 注意:cookie只会影响用户是否能够再 ...

在Nuxt 3中如何配置PrismJS和Autoloader插件?

在之前的文章Vue或Nuxt中如何渲染数学公式?中,我们讲解了如何在Nuxt.js中使用本地CDN渲染数学公式。通过本地CDN,我们可以很轻松的加载很多JavaScript库,且不用担心因此导致页面加载变慢。PrismJS是一个轻量的代码高亮JavaScript库,如果使用平常的方式配置PrismJS(参考如何使用Nuxt实现高亮代码块?),那么你想高亮的每一个语言几乎都需要引入一遍,比如说你想高亮Typescript,那么你需要添加import "prismjs/components/prism-typescript"。显然这很麻烦。然而,PrismJS有多个插件扩充了它的功能。Autoloader插件可以自动加载你需要的语言,让你不必再一个个的引入你需要高亮语言的文件。加载Autoloader插件最简单的方法是用CDN,在本文,我们将会讲解在Nuxt 3中,如何用加载Mathjax的同一种方法来配置PrismJS和它的autoloader插件。1. 首先从Github中下载PrismJS的源代码:https://github.com/PrismJS/prism/archive/re ...

Nginx安装后command not found

问题:我在我的Debian 12服务器上安装了nginx,安装过程完全按照官方文档Installing NGINX Open Source | NGINX Documentation。刚开始,一切正常,运行sudo nginx -v也正确返回了nginx的版本😇。但是最近我运行相同的命令,却给我返回command not found:sudo nginx sudo: nginx: command not found nginx bash: nginx: command not found无论是sudo nginx还是nginx都给我返回相同的结果😣。解决方案:想要解决这个问题,首先你需要确定nginx的安装路径。它可能安装在/usr/local/sbin 或 /usr/sbin中。一旦你找到你的nginx路径,比如/usr/local/sbin/nginx,直接运行sudo /usr/local/sbin/nginx -v nginx version: nginx/1.22.1就会输出nginx的版本,问题解决。如果你不想每次都输入nginx的完整路径,那么你必须根据下面的做法将ng ...

localStorage,sessionStorage和cookie的区别

我们可以将token和用户信息存进localStorage、sessionStorage或者cookies中,他们都是浏览器的数据存储方式。在使用浏览器的时候,打开开发者模式,点击应用程序,就能查看自己在该网站对应的LocalStorage、SessionStorage以及cookies中所存放的内容。LocalStorage、SessionStorage、Cookies的存储区别:LocalStorage没有生命周期,不会过期,需要“手动”删除。在浏览器发送请求时,LocalStorage中的数据不会被发送至后端服务器。LocalStorage所占用的空间较小。SessionStorage与LocalStorage类似,不同点在于SessionStorage有生命周期。关闭浏览器结束会话后,SessionStorage中的数据便会过期自动删除。SessionStorage所占用的空间较小。Cookies也是有生命周期的,它在过期后自动删除。同时,浏览器发送请求时,会同时将Cookies发送至后端服务器。Cookie所占用的空间较大。

pyttsx3运行错误

接上文Python实现语音朗读,运行示例代码时import pyttsx3 engine = pyttsx3.init() engine.say('开车不规范,亲人两行泪,I love China') engine.runAndWait()弹出以下错误:经过检查,pywin32等库都已经安装好了。尝试使用win32com库替代pyttsx3,结果仍然报错,报错内容为win32 api。之后又尝试了几种办法,仍然都是跟win32有关的报错。因为之前pip安装总是SSL报错,刚开始以为是SSL报错导致安装出错。但是修复SSL报错问题后(见Python pip安装SSL证书错误),该问题仍然没解决。最后经过了解,可能是pywin32版本过高所导致。一般需要将pywin32版本控制在305以下,可以使用225或者226这样的低版本。于是使用pip下载对应版本pip install pypiwin32 pip install pywin32 == 225然而,下载时发现已经没有225版本可以下载。因此另寻办法。最终,发现是pywin32安装的版本有问题,导致包虽然有了,但是却无法识别,导致出现N ...

大学毕业转行后的一点想法

最近成功把以前写的PDF格式的数学文章,几乎完美复刻到HTML网页上面,文章中的数学公式使用JS插件Mathjax渲染。之后会陆续更新到网站上,希望以后能让更多人无需下载就能看到,这也算给大学四年一个结尾。链接如下👇👇👇Note on arithmetic algebraic geometry, An introduction to different branches of mathematics, Note on perfectoid spaces, 代数几何简介​然后我目前只会把我以前留下的notes、introduction之类的弄成HTML这样网页的形式。至于我写的论文存arXiv上面就好了,谷歌搜也能搜到我的论文。目前来看,距离我论文完成也过去一年半了,并没有太多人对于推广perfect这一概念感兴趣。但值得一提的是,目前来看,我的工作更加受到老外的欣赏和认可,没有一个中国的Phd给我写过信,说看过我的文章。虽然关于perfect这一系列的工作没有全部完成,还可以继续深入耕耘,说不定还能多产出几篇论文吧,算下来我本科完成了4篇论文,有5篇未完成,总页数超过100页。但这一切 ...