AIAI·

Spark and Big Data Application

Publié à 2024-07-23 10:57:13Vu 247 fois
Article académique
·
Introduction
Réimpression Veuillez indiquer la source
Catégories d'écriture

Spark and Big Data Application

Digital data is growing fast in volume courtesy of the innovations that are taking place in the digital world. The data quantities are huge and are from different sources, including smartphones and social networks, which the relational databases and the analytical techniques are incapable of storing or processing. These huge amounts of data get the name Big Data from this characteristic. Traditional machine learning cannot handle the volume and velocity of the data and it therefore requires the right system and programming model to complete the analysis. Big data comes in different forms, and their distribution and analysis are possible through data frameworks like Hadoop and Spark. Large data analytics can also benefit from the use of different types of Artificial Intelligence (AI) like Machine Learning for results. The use of big data tools and AI techniques has changed the phase of data analysis. This paper will examine a big data computing model based on Spark and the application of big data. Additionally, it will highlight big data platforms.

Big Data Platforms

Several platforms form big data, including batch processing, real-time processing and interactive analytics. In batch processing, there are extensive computations that also require time for data processing. The most common form of batch processing is Apache Hadoop. The platform is scalable, cost-effective, flexible and tolerates faults involved in the processing of big data. There are also other Hadoop Platforms like the Hadoop Distributed File System (HDFS), Yet Another Resource Negotiator (YARN) and MapReduce (Rahmani et al., 2021). These platforms can also operate across the big data value chain, which works from aggregating, storing, processing and managing big data. When it comes to stream processing platforms, Apache Spark and Storm are the most popular platforms. Such applications used in stream processing play a big part in weather and transport systems. It is possible to access datasets and use them for various operations remotely. The users can also access the system directly and interact with the data. An example of an interactive analytic platform is the Apache Drill.

Apache Spark Big Data Computing

It is a text-based framework whose origin traces back to the University of Berkeley back in 2009 and has since become a powerful tool capable of macro processing, unlike other tools. The core is the basis of the features that are found in Spark. Spark has three application programming interfaces (APIs), namely Java, Python and Scala, and its core is an API location that acts as a definition for the RDD dataset found in its original programming. Apache Spark supports the building and running of fast and high-level sophisticated applications on Hadoop. Spark engine is responsible for supporting tools that can support SQL, for instance, queries, stream data applications, and manage complex analytics like machine learning and graph algorithms. There are several big data ecosystems based on the Spark platform, including Spark runtime, spark SQL, graph X, Spark streaming and even the MLlib-Algorithm library (Chen, 2021).

The analysis of spark runtime requires the determination of functionality, for instance, the scheduling of tasks and the management of memory. In instances that involve data transmission inside Spark using RDD structure, the idea is for the determination of the core logic data information of Spark. The first will involve the division of data into many subsets and transmitting the subsets to nodes in the cluster to help in processing. The second steps lead to result protection after the calculation. The protection it is possible to have the same file content as the results of the calculation and it also becomes easier to back up the file content. The final step happens only in the case where calculation errors occur at any of the subsets, which will involve the resorting of the subset to do away with the errors.

Shark is where Spark SQL is developed. Shark employs the concepts from HQL parsing, logical execution plan translation, and execution plan optimization in hive to achieve hive compatibility. One could argue that the spark job merely replaces the physical execution plan. Additionally, it is dependent on the past Metastore and Hive series. When it comes to historical compatibility, Spark SQL solely uses the HQL parser, history meta store, and Hive server. Stated differently, spark SQL takes over since HQL is parsed into an abstract syntax tree (AST). The catalyst is in charge of creating and optimizing the execution plan. It is considerably easier to build in Scala thanks to its functional language capabilities, such as pattern matching.

Among the many benefits of Spark, SQL integration is the most noteworthy. Spark SPL seamlessly combines spark programs and SQL queries with spark SQL. Users may query structured data in the spark program using conventional data frame APIs or SQL. It is compatible with R, python scala and Java by employing knowledgeable API to work with the data. This benefit can increase the effectiveness of our research. Running SQL or hiveql queries on pre-existing warehouses with Hive integration is the second benefit. Unified data access offers a third benefit since it connects to any data source in an identical manner. Access to a range of data sources such as hive Avro parquet ORC JSON and JDBC can be facilitated by using data frames and SQL even adding data from different sources is possible. The final advantage comes in the form of unified data access, which allows for standard data connections via JDBC or ODBC connections. Server mode supports industry-standard JDBC and ODBC connections for business intelligence products.

When the calculating algorithm is analyzed using the Mllib algorithm library, it is shown to have a high level of complexity. Iterative computations need a lot of CPU power because all calculations must be stored on disk in order to wait for task processing to begin.

A portion of the work can be completed entirely in memory using the Spark platform. Because the iterative calculation is becoming more efficient, the corresponding iteration portion of the calculation task is moved straight to memory. It can also achieve disk and network operation under specific conditions. In conclusion, Spark offers even more impressive benefits for iterative computing, which can be used as a platform for distributed machine learning. Using a collaborative filtering algorithm is meant to figure out the concept before sharing the idea with the users.

There are steps necessary in the algorithm application, and the description is as follows. System filtering comes first. After choosing the people who share similar interests, the objects are chosen and categorized based on their preferences, creating a new set or sequence. Users in this process can be thought of as neighbors, and it is best to arrange and make use of related users in order to ascertain the best execution strategy. Secondly, cooperative filtering. The integration of user preferences is the essential component of the final suggestion, which is completed by means of the phases of user preferences gathering, similarity analysis, and recommendation based on the computation results. To accurately gather data behavior, the first consideration is choosing a user system and then organizing it based on user behavior and subsequently processing the data before issuing recommendations on what users may like provided their preferences.

The spark system is responsible for expanding spark data and also divides Spark streaming data depending on the time mode and eventually forms RDD. When processing streaming data that has small intervals, there will likely be a processing delay which leads to the time processing system. The advantage of using spark streaming includes its response to a fault, which, as it is, has a strong fault tolerance that makes it capable of handling errors and recovering data. Another benefit of using it is its ability to connect to relevant spark modules. One of its added advantages, apart from completing the flow of data, is the ability to deal with complexity.

Another key component of Spark is Graph X, whose foundation is on the large -scale graph computing with the ability to compute large graph data processing in Spark with the aid of Graph X properties. In the event of analysis of graph x, it is verifiable that it could be a source of rich data operators like the core and optimization operators. Graph X also has the capability of meeting graph operations of several distributed clusters and has enough API interfaces. According to Chen (2021), Graph X plays a critical role in Spark as it improves its data absorption and scale.

Applications of Big Data

Medical Application

Healthcare facilities and medical organizations use big data to anticipate potential health risks and prevent them. The healthcare system acquires large benefits from large-scale data analysis. For example, analytics has made it possible to diagnose diseases in their early stages, like cases of breast cancer. It has also supported the processing of medical images and medical signals for providing high-quality diagnosis and monitoring of patient symptoms. Additionally, tracking chronic diseases such as diabetes has been easier. Other uses are preventing the incidence of contagious diseases, education through social networks, genetic data analysis and personalized medicine. Data types include biomedical data, web data, data from different electronic health records (EHRs), hospital information systems (HISs), and omics data, which includes genomics, transcriptomics, proteomics, and pharmacogenomics. Rich data, such as demographics, test results, diagnoses, and personal information, are included in the EHR and HIS data (Ismail Ebada et al., 2022). Consequently, the significance of health data analysis has been taken into consideration, which has led to knowledge for scholars to control, manage and process big data courtesy of tools like Apache Spark.

Apache Spark's memory calculations enable rapid processing of large healthcare datasets, both unstructured and structured. It can analyze data faster than typical map reduction approaches, approximately 100 times faster. Spark's lambda architecture enables both real-time and batch processing. The tool analyzes streaming healthcare data and includes a library for handling Twitter data streams. Several messages relate to health, and communication is possible through the use of social networks. Twitter, for instance, provides real-time public health statistics that are less expensive to get since real-time disease surveillance entirely depends on mining the information from Twitter API methods (Nair et al., 2018). Spark's machine learning package, MLlib, can help create decision trees for illness prediction. Spark analyzes large amounts of data as it streams. It outperforms Hadoop in streaming due to its memory-based processing capabilities. Data and intermediate outcomes are kept in memory, which helps avoid delays that come in the input-output stage of data that has to move back and forth on the hard disk.

Spark uses resilient distributed datasets (RDDs), which are immutable object collections stored in memory and partitioned among nodes for parallel calculations. Spark is a batch-processing engine that can handle streaming data from several sources, including Twitter. The Spark engine handles small batches of the incoming data streams after division. Discretized streams (DStreams) are a series of RDDs that symbolize a continuous data stream in Spark streaming. The Spark engine computes the fundamental RDD transformations that result from operations on DStreams. Popular learning methods that aid in solving machine learning problems include collaborative filtering, clustering, and classification found in MLlib.

Natural Disaster Prediction and Management

The use of big data is a crucial part of responding to natural disasters. Some of the applications of the tools are in preventing disasters, providing relief in times of disaster and managing disasters in times of emergency. In the era of the Internet of Things (IoT), the complexity of natural disaster management has become simpler. AI Spark Big Model has enabled a new way of thinking, technical system and innovative ability to deal with complex disaster information, resulting in an effective way of dealing with comprehensive disaster prevention and reduction. Big data has strategic significance and, together with its core values, is mainly composed of three concepts. The first is on the level of strategic thinking which defines the point when the disaster problem is beyond the ability of contemporary scientific understanding. Such problems are high-frequency and high-intensity disasters and go beyond the ones that the world defines as basic disasters.

Additionally, comprehensive disaster reduction goes beyond the national strategy. The level of innovation and development of information science and technology in big data refers to the second concept that defines a situation where disaster big data results in challenges in traditional information science and technology. At this point, there is an acceleration in the formation of technology systems and new theories, which have complex information serving as the core. The third concept defines the level of social innovation and development. Here, disaster big data is the core capability of ensuring that the world has Internet and is smartly secure and can reduce disaster using these smart strategies.

The ability to efficiently monitor disaster information, standardize data, predict abnormalities and provide early warnings of signs of disaster, the existing forms assessing disaster risks and all kinds of disasters form disaster cloud computing (Chen, 2021). Since it deals with early warning, analyzes how to manage emergencies, provides emergency information services and analyses the calculation of other models, it is capable of building a large distributed ring in the way of shared infrastructure. It serves the major purpose of providing safe and faster disaster cloud data processing and predicts and issues an early warning. It also provides a disaster risk assessment model, emergency management service model and other related services.

Face Recognition

Face recognition is another application of big data analytics tools like Apache Spark. Businesses are able to use big data analysis systems to detect faces, recognize the faces, recognize gender, age, facial expressions and gazes of people watching the machines advertising and other digital signs that originate from face recognition technology. The information that big data processing technology collects becomes an instrumental part of the analysis of customer preferences. In the end, it helps provide customer-friendly services according to their gender, age and emotional state. It generally facilitates the connection between people and things, hence capable of making recommendations on the advertisements and messages that certain individuals prefer. It is also useful in the non-manual and automatic uses aimed at recognition of face, eyes and any other information that could help in predicting subsequent operations.

Conclusion

The paper addresses various issues, starting with the definition of big data and analyzing the types of platforms that form big data. It also summarizes the applications in Apache Spark by defining Spark SQL, graph x, Mllib algorithm library and Spark streaming and how they operate. It, however, still needs to be determined which platforms are stable, and it is therefore advisable to use the tools that have undergone tests and proven their scalability, flexibility and stability. Additionally, the essay introduces the reader to the areas where big data play crucial roles. The areas include its use in the medical field in the prediction, analysis, prevention and management of various forms of diseases. Other areas include its use in the management and prediction of natural disasters and its use in face recognition, among other areas. Technology is still evolving, and big data modelling could still use some improvements like having an enhanced real-time data processing which other tools like Apache Spark have already made possible courtesy of Twitter APIs. Most industries are also moving toward using big data processing systems that are accurate and precise to enhance their economic and social benefits. Batch processing, which is the most common big data processing, needs to meet the frequency of data processing requirements that these industries need.

References

  1. Chen, S. (2021). Research on Big Data Computing Model based on Spark and Big Data Applications. Journal of Physics: Conference Series, 2082(1), 012017. https://doi.org/10.1088/1742-6596/2082/1/012017
  2. Ismail Ebada, A., Elhenawy, I., Jeong, C.-W., Nam, Y., Elbakry, H., & Abdelrazek, S. (2022). Applying Apache Spark on Streaming Big Data for Health Status Prediction. Computers, Materials & Continua, 70(2), 3511–3527. https://doi.org/10.32604/cmc.2022.019458
  3. Nair, L. R., Shetty, S. D., & Shetty, S. D. (2018). Applying Spark based machine learning model on streaming big data for health status prediction. Computers & Electrical Engineering, 65, 393–399. https://doi.org/10.1016/j.compeleceng.2017.03.009
  4. Rahmani, A. M., Azhir, E., Ali, S., Mohammadi, M., Ahmed, O. H., Yassin Ghafour, M., Hasan Ahmed, S., & Hosseinzadeh, M. (2021). Artificial intelligence approaches and mechanisms for big data analytics: a systematic study. PeerJ Computer Science, 7, e488. https://doi.org/10.7717/peerj-cs.488
Section des commentaires

Pas encore de commentaire, ajoutez le premier.

弦圈热门内容

cover

为什么说外国教材好?国外教材与国内教材的区别

首先,不是所有国外的教材都是好的,也不是所有国内的教材写得不好。但整体上看,绝大多数的国外大学教材,要比国内的要好,而国内的教材好的屈指可数。国内的有些教材往往写得更加冗长和复杂,让人看得云里雾里、似懂非懂。而且封面简陋,排版一般,给学生的体验不太好,编者可能心里并没有将学生放在平等的位置上。这里就不具体列举国内哪些教材不好了😅😅😅。。国外的教材,往往有精美的封面,内容写得清晰明了,有舒服整齐的排版,有的时候会配上精美的图片或图案。国外的教材给人的感觉是大制作,把学生放在重要的位置,阅读体验非常好。有些比较基础的教材,比如说微积分,看教材能感觉到作者想方设法让你能学懂,巴不得背你上去。老师的本职应该是服务学生,如果没有学生来上学,那么学校也没有开的必要了,老师也会丢掉工作。因此,国内外的教育环境差别,通过教材也能撇到冰山一角。以下以国外的《大学物理》教材为例:精美的封面舒适的排版精美的图案清晰详细的内容可见,如果我们上课的时候,能够用上这样的教材,也不至于这么苦逼来啃教材,而是享受阅读。然而,国外的教材大制作,价格往往比国内的教材要贵得多,一本教材换成人民币可能要几百块。但国外的网上教 ...

Django将已经存在的字段改为外键

我有一个Django模型,它之前是这样的class Car(models.Model): manufacturer_id = models.IntegerField()然后还有另一个名为Manufacturer的模型,id字段所指的就是它。然而,后来我意识到使用Django自带的外键功能,会更方便。因此,我将这个模型改为现在这样class Car(models.Model): manufacturer = models.ForeignKey(Manufacturer)这次修改似乎一下就弄好了,查询出来的结果也没有任何报错,但是当我试着运行数据迁移的时候,Django输出了以下结果- Remove field manufacturer_id from car - Add field manufacturer to car执行这个迁移会清除所有已经存在于数据库里的关系,所以我并不想这么做。我其实并不想做任何的迁移,毕竟像Car.objects.get(manufacturer__name="Toyota")这样的查询没有一点问题。我更想要一个恰当的数据库外键限制,但不是高优先级的那种。总的来说,我的问题是:是否存在一种迁移方法或者别的,能让我将一个已经存在的字段转变为外键?我不能使用--fake因为我需要可靠地在开发、生产和同事的电脑上工作。内容来源于 Stack Overflow, 遵循 CCBY-SA 4.0 许可协议进行翻译与使用。原文链接:Django change an existing field to foreign key

网站和APP产品举步艰难,AI产品前途未卜

你抄你的内容,我写我的原创内容,我们都有光明的未来。在如今移动互联网时代后期、生成式ai时代初期,互联网上劣币驱逐良币的现象可以说是越来越严重。😂前有百度封杀,后有谷歌的不合理审查。只能说pc端互联网已经进入了一个存量竞争及其激烈的特殊时期。百度在国内早已是被很多人口诛笔伐,搜索出来的结果被不良广告霸占,找不到好的优质内容。这其实还好,早在09年时候百度就传出恶意封杀网站,后来谷歌退出🇨🇳市场以后,有了垄断地位更是可以为所欲为。而谷歌呢,“不作恶”的谷歌相比于百度还是好那么一些,至少对于新网站,不至于像百度那样一下子摁死,根本不给机会,谷歌还是会给些流量。但是谷歌对于中文互联网的搬运抄袭也是睁一只眼闭一只眼,或者说退出了🇨🇳市场,谷歌早也不想在中文互联网投入过多精力。虽然谷歌明面上是说,会打压搬运抄袭,但实际上有不少网站里面的内容全是一字不差的复制,结果非但不是限流,反而是让他们做起来了,不断给他们推流,甚至谷歌广告都给他挂上了,也不知道谷歌广告的审查为什么这么双标,全是原创内容的网站能说成是低质量内容。其实这也是目前很多搜索引擎面对的通病,对于这种内容农场没有很好的处理和解决,导致一 ...

JSON Parse报错: Unterminated string

我在JSON parse函数中使用转义引号时,遇到了一个常见的问题。如果存在转义引号,在本例中为“test”,则会导致以下错误'SyntaxError: JSON Parse error: Unterminated string'.var information = JSON.parse('[{"-1":"24","0":"","1":"","2":"","3":"0.0000","4":"","5":"0.00","6":"0.00","7":"1.00","8":"0","9":"false","10":"false","11":[""],"12":"","13":"","14":"test\""}]');JSON Lint验证该JSON为有效的。

乘坐超光速飞船,来到距离地球2241光年的位置,能否看到秦始皇登基?

在各方面条件均合适的前提下,理论上来说是有一定概率看到秦始皇登基的。在咱们上中学的时候,可能我们的物理老师就给我们讲过非常有趣的现象:夏天打雷下雨,往往在打雷之前会有一串闪电滑向天空,闪电过后就是雷声,对不对?那么我们为什么会先看到闪电,然后再听到雷声呢?再听到雷声呢原因很简单,因为闪电属于光,它的传递速度是光速。而雷属于声音,它的传播速度是声速。一个是30万公里每秒,一个是340米每秒。从这个理论来出发的话,我们就不能发现,在闪电打雷的过程当中,我们往往是最先看到闪电,然后才能听到打雷的声音。好的,在这样一个理论前提之下,我们会就更容易来理解这个话题了,简而言之:光和闪电本质上来说没有太大的区别,它们都是光的一种形式,而它们在传播的过程当中往往和周边的环境介质都有着密切联系。但是我们把这些通通排除在外的话,当一束光飘向外太空的过程当中,在最短的时间之内,它可能到达一个极远值。但是如果想把这个光传递得更远,这中间就需要时间了,而这个时间我们是以光年来衡量的。这个光年指的是什么呢?常规情况下来说,指的是光在一年内传播的距离。拿地球和太阳当一个引子太阳每天东升西落,我们早已经习惯了这样的一 ...

84个万能生活小常识,家家都能用!(收藏起来慢慢看)

生活里爱护一个人,从不该只有空口白牙承诺,还有这些点点的细心照顾,吉米老师准备了84个万能小常识,希望你遇到的人和你彼此照顾,一起感受生活细水长流。01 厨房篇1、炒菜时,不要加冷水,冷水会使菜变老变硬不好吃,而加开水炒出来的菜又脆又嫩。2、炒藕丝时,一边炒一边加些水,能防止藕变黑。3、炒鸡蛋时,一个蛋加一汤匙温水搅匀,就不会炒老,而且炒出的蛋量多,松软可口。4、豆腐下锅前,可先放在开水里浸渍一刻钟,这样可清除泔水味。5、用冷水炖鱼无腥味,并应一次加足水,若中途再加水,会冲淡原汁的鲜味。6、蒸鱼或蒸肉时待蒸锅的水开了以后再上屉,能使鱼或肉外部突然遇到高温蒸气而立即收缩,内部鲜汁不外流,熟后味道鲜美,有光泽。7、熬骨头汤时,中途切莫加生水,以免汤的温度突然下降导致蛋白质和脂肪迅速凝固,影响营养和味道。8、煎荷包蛋时,在蛋黄即将凝固之际,可浇上一汤匙冷开水,会使蛋熟后又黄又嫩,色味俱佳。9、熬猪油时,先在锅内放入少量水,再将切好的猪油放入,这样熬出来的油,颜色晶亮而无杂质。02 食醋篇1、外出容易晕车,如喝下不很酸的食醋水,可以清爽精神,减轻晕车症状。2、失眠,可将一汤匙食醋倒入冷开水中, ...

宇宙是被精心设计出来的吗?造物主真的存在吗?

我们对宇宙了解得越多,就会越发惊叹宇宙的精巧之处,宇宙中的各种规律,仿佛就是为我们量身定制一般,宇宙的精巧之处有很多很多,这里随便列举几项意思意思。图片来源网络宇宙诞生时膨胀的速度,如果快一点星系就无法形成,慢一点物质又会因为太过密集而重新坍塌。基本粒子形成时,中子的质量必须比质子稍大一点,使得中子可以衰变成为质子,这样宇宙中才可以有大量的氢元素,从而形成恒星。在四大基本力中,如果引力比现在稍强一点,那么宇宙中的恒星就会很快的耗尽自身的燃料,而如果稍弱一点,太阳又不可能点燃核聚变,宇宙空间将变成一片冰冷、黑暗。同样的,如果其他的基本力与现有的数值稍有不同,宇宙就会出现巨大的改变。图片来源网络需要说明的,上述参数都必须设计得非常精准,其精度通常都要求在小数点之后10几位。对于我们来讲,最精巧的设计莫过于我们的地球,与太阳恰到好处的距离、既不厚也不薄的大气层、足够的水资源、完美的磁场……,在地球附近,有月球帮地球稳定倾角(地球才有四季之分),有木星清理对我们威胁巨大的小行星。图片来源网络……总之一句话,宇宙中的任何细节出了一丁点的差错,我们的世界就将不复存在,甚至整个宇宙都不会出现。那么, ...

如果万物皆有意识,那么意识从何而来?石头拥有意识吗?

在人们的普遍认知中,意识是最特殊的存在,是我们认识和改造世界的基础条件。而物质是意识的载体,二者存在哲学意义上相互作用的关系。作为已知唯一的智慧生命体,人类自认为我们的意识是最复杂的。因为目前人类已经能够展开一系列的探索活动,而其他生物甚至都没有表现出意识活动的迹象,这也成为科学家们探索的重点。并非只有高级动物才拥有意识活动究竟意识是怎样的存在呢?我们能够与一些小动物进行情感交流,是不是意味着它们的意识活动与人类存在相似之处……在一部分科学家们的探索过程中,他们惊奇地发现,其实不仅只有高级动物拥有意识活动,植物同样可以进行交流,甚至一块石头都有可能拥有复杂的意识,只是我们的探索方式一直存在问题。从表面上看,一块石头可能存在了亿万年,除了地质环境的变化和人为因素影响它们的状态之外,它们几乎不会出现任何变化。而人们认为意识存在于大脑中,所以石头这样的非生命物质不可能存在意识活动。巴特斯克效应实验证明植物有情感巴特斯克效应实验利用特殊的仪器证明,植物拥有情感,在面对人类和动物的威胁时,它们也能够释放出防御以及害怕等信号给周围的同类。而人们无论如何也不会想到,主张进行该实验的科学家最初只是利用 ...