Real world is trying to bring digitalcurrency into

Real Time Fraud Detection using Apache SparkKetan Navanath Karandex17100062School of ComputingNational College of IrelandDublin,IrelandEmail.id [email protected]—The era of physical credit cards seems like vanishingas we witness the booming of the E-commerce Industry. Todayalmost every country in the world is trying to bring digitalcurrency into the picture with virtual cards and Bitcoins. SocialMedia has become a source for many to express their emotionsand concerns on the public platform. But with all this encryptedcurrency and public information, comes a grave danger of fraudsand Cyber Crime that threats the very concept of safety and personalspace.This paper stresses the concept of Real-Time Frauddetection by using various Stream Processing Engines(SPE) likeApache Spark, Apache Storm, and Apache Flink. This paperhighlights on various Ingestion platforms like Kafka and MapRthat are used in streaming data and performing Data Analyticson the fly by incorporating historical Data from Big Data Storingtechnologies like HDFS along with NoSQL Databases like HBaseand Cassandra.I. INTRODUCTIONThere was a time when Fraud Detection was limited to thedata that was collected and used after a batch of time. ThisDetection was based and derived from the historical data thatwas gathered by the company over a period of time. Datawas taken from the servers and was processed through severalStatistical processes.With a leap in time, the amount of Data collected by thecompanies started increasing which lead to the use of BigData Storing Technologies amongst which Hadoop managedto provide a stable solution. This Data accumulation was aresult of an increase in the online traffic. Due to increase inOnline traffic, the number of online operations also increasedexponentially which lead to the emergence of added need toavoid Online frauds and Cyber Crimes. Due to this suddenincrease in online operations, the concept of fraud detectionafter batch of time also termed as batch processing seems alittle too late to predict crimes.This need for the online operations demands real-time analyticsthat can predict crimes and possibilities of fraud withinmilliseconds after the events. Suppose a man with stolen creditcards tries to buy something really expensive and tops up thecard, then real-time data analytics can stop him from doing soby cutting him at the source i.e by rejecting him a chance tomake the payment immediately after he swipes the card. Thisis only possible through Stream Processing that can work inan almost real-time environment.This paper will focus on the solutions that have been utilized tostop such frauds in real time by implementing various Streamprocess engines in their solutions. The flow of the paper willhave Section II about Difference between Real Time and nearReal-Time Fraud Detection, Section III about the comparisonof various Ingestion Engines, Stream Processing Engines, andStorage Technologies, Section IV about steps to build a FraudDetection Engine, Section V about various solutions of Real-Time Fraud Detection and Section VI about Real-Time DataAnalytics in Social Media PlatformsII. FRAUD DETECTION IN REAL/NEAR REAL TIME1In order to prevent frauds in Banking/Finance Sectors, it isvery important for the solution to be implemented real-timeexactly during the occurrence of the event. This kind ofprevention can only be done through an algorithmic approachthat is put in for the streaming engines to execute. Thisalgorithm can be implemented in four specific ways whichcan be termed under real and near real-time operations.Algorithms in real time operations work before the paymentis authorized while those in near real-time operations workafter the payment is done.The two ways in real time operations are Terminal Controland Terminal Block Rules(TBR). Terminal Control checksfor correct passwords, pin codes and sufficient balance whileTBR sets if-else conditions to ensure that the payments arenot done through untrusted gateways and black markets. Thismethod helps in authentication of payments before they aredone.The two ways in near real-time operations are Scoring Rules(SR) and Data Driven Models(DDM). In SR, scores are givento every transaction from on a scale of 0-1 where higher thescore leads to higher chances of occurrence of fraud in anevent. In DDM, a predictive model is designed using MachineLearning algorithms and is used to predict the probability offraudulent events.This paper delivers various solutions for real-time frauddetection that are based on DDM. The Machine Learningmodels are implemented through streaming processesand real-time warnings are given to the users during thetransactions.III. COMPONENTS FOR REAL TIME DATA ANALYTICSSYTEMSA. Ingestion Engines2Ingestion Engines are used to stream data to the streamingengines to handle the data exchange real time while theevent is occurring. Choosing a correct Ingestion Enginecan help in improving the Scalability and Latency of theprocess.Following are the ingestion Engines that will be usedin further components of the paper.1) Kafka: Kafka is a type of messaging tool that carriescustomer and producer messages to and fro and workswith a parallel processing concept. In Kafka, consumers aresubscribed to specific topics through which they receive theirmessages. It consists of clusters comprising of one or moreservers and this server is termed as brokers. Whenever aproducer sends message to the consumer, the message firsttravels to the broker and then is distributed amongst thecustomers in the same cluster who are subscribed to the topic.It is often used with streaming engines like Apache Spark,Flink, and Storm2) MapR: 3Similar to Kafka, MapR is also a messaging tool that carriesmessages from producers to subscribed customers. But unlikeKafka, MapR doesn’t work in clusters. Each and everymessage was given by the topic is received by the customerirrespective files and tables present. MapR uses Stream whichbasically is referred to a bunch of topics clubbed togetheron basis of there shared policies and different subscribedconsumers. MapR can be integrated with different streamingengines like Apache Spark and link along with its owndatabase called MapRDB which can be connected withMapR-FS, an event storing system with HBase API.B. Stream Processing EnginesBanking, Finance and E-commerce sectors receiveGigabytes and Terabytes of data on an everyday basis. Thisdata needs to be processed in real-time to avoid and preventfrauds. Working with such huge amounts of data in real timerequires stream processing which is not a strong forte ofApache Hadoop. Because of this, there are many ApacheStreaming projects undertaken by various organizations todevelop platforms that can help in stream processing, someof which are discussed in the paper.These streaming enginesare responsible for carrying out data analytics on the data onthe fly in real time.1) Apache Spark: 45Apache Spark is well known for its combined methodology ofbatch processing and streaming using micro-batch processingtechnique. Spark works with RDD, Resilient DistributedDataset which is a read-only set of data distributed througha cluster of different nodes. This makes Spark more faulttolerant. Spark is programmed in Scala and can work withvarious data sources like HDFS, Cassandra, Amazon s3and HBase.Spark has its own Spark Streaming system thatconnects spark with different system APIs. Spark uses MLLibto carry various machine learning algorithms on the data forreal-time data analytics5.2) Apache Flink: 675Apache Flink is dedicated stream Processing Engine Writtenin Java and Scala7. It does not work with micro batcheslike Spark for streaming, but it works on a fine-grained eventarchitecture. Apache Flink can be connected to data storeslike HDFS and HBase and works with Ingestion Engines likeKafka. Similar to Spark, Flink also provides APIs for batchand stream processing and is an effective streaming enginewith low latency and good fault tolerance6. Flink uses FlinkML to carry various machine Learning algorithms on thedata5.3) Apache Storm: 689Apache Storm is written in Clojure and is a highly scalableand fault tolerant system designed to connect streams6.Stormworks in form of topologies that distribute loads across theclusters and processes tuples in such a way that the delayis reduced to milliseconds. Because of this, storm is mostlyused as a stream detection layer with Kafka and Spark withHBase or HDFS8.Storm can do basic Statistical proceduresbut real-time data analytics forms one of its weaker sides9.C. Data StorageModels used to analyze the real-time Big Data are designedfrom the historical data that are stored in different cloudinfrastructures. The latency of this databases to recoverrequested data and make it available for analysis is verycrucial for the performance of the Streaming Engines. Thispaper focuses on two such databases.1) HBase: 8HBase is a java based key-value based NoSQL database thatworks on the principle of HDFS i.e master-slave relations.Its key-value storage supports fast and scalable data sharingand offers API to be connected to multiple stream processingengines.2) Cassandra: 110Cassandra is a distributed database designed to be highlyscalable and fault tolerant. It doesn’t have a master-slavearchitecture and hence it does not has any risk of singlepoint of failures. This no mastery architecture increases theavailability of Cassandra as compared to HBase.3) HBase vs Cassandra: 11A general comparison between databases is done byunderstanding there performance on different types ofworkloads. There are in total 6 different types of workloads.1.Workload AIn this Workload, there is a equal proportion of Read andWrite Operations2.Workload BIn this Workload, Heavy weightage is given to the ReadOperations3.Workload CIn this Workload, All the operations are Read Only operations4.Workload DIn this Workload, the operations are entitled to Read only thelatest changes made in the data5.Workload EIn this Workload only queries are made on records withshorter ranges6.Workload FIn this Workload, there is an interactive process where theread operation is followed by modify and write operationsIt can be clearly seen from figure1 that cassandra canoutperform HBase in all the workloads except Workload F.This signifies that for any business that relies on operationsrelated to Workload A to E can use cassandra.Figure1:HBase vs Cassandra11For Real-Time data analytics data storage can be verycrucial for building new models and this comparison helpsin understanding how cassandra can be the optimum DataStorage and Management System for the task.IV. STEPS TO BUILD A FRAUD DETECTION ENGINE3In the process of Real-Time Fraud Detection, along with theIngestion Engines, Stream Processing, and Storage Devices,one other important aspect stands out and that is the processwhere actual frauds are predicted using different types ofMachine Learning Algorithms. This paper will talk about themost basic classification model that is used to predict anykind of frauds.Logistic Regression is a method where you can predict eventswith binary outcomes. In the case of fraud detection, theseoutcomes are1.The event of transaction is not witnessing any fraud2.The event of transaction is witnessing a fraudUsing Logistic Regression can help in predicting the fraudsbut it also requires certain pre-requisites to do so. Logisticworks on a statistically derived algorithm which takes multipleIndependent variables (X1, X2, X3..Xn) and one categoricalDependent variable (Y). The Independent variables are termedas features and the Dependent variable is termed as a label. Inorder to build a Logistic regression model to deploy it in thestream processing engine, the business must have knowledgeabout the important features that can help in predicting theoutcome. To find this features, Feature Engineering is carriedout. The following diagram explains the features that can beused in predicting a credit card fraud3Figure2:Features in Feature Engineering3So Feature Engineering becomes the first part of anypredictive model that is to be deployed in the StreamProcessing Engines.In the Logistics Regression model, the entire Historical datais divided into two parts with a split of 60:40 or 70:30 where60 percent of the total data is termed as Training Data and 40percent of the remaining data is termed as Testing Data. Thetraining data is used to train the data with the logistic modeland then is used on the testing data to check its accuracy.The process can be clearly seen from the following figure3Figure3:Training and Testing Model3Along with the Classification Techniques, other techniqueslike Regression, Clustering and Dimensionality Reductionare also used for prediction of the fraudulent events in theStream Processing Engines.V. SOLUTIONS FOR REAL-TIME FRAUDDETECTIONThis paper focuses on the two solutions that have beenimplemented for reducing fraudulent events before happening.The solutions use the following components of Real-TimeData AnalyticsSolution 13 Ingestion Engine: MapR via Kafka APIStream Processing System: Apache SparkData Storage: MapR-DB and HBaseThis solution uses a new distributed messaging system namedMapR that uses Apache Kafka API to process the events inReal-Time. MapR is responsible for carrying the messagesfrom consumers to producers and vice-a-versa in forms ofdistribution. Basically, the messages are categorized into 3categoriesa. Raw Trans is where the transaction done with the card isrecordedb. Enriched Events is when the Machine Learning Modelpredicts the transaction event as genuinec. Fraud Alerts is when the event is predicted as a fraud inprocessAfter this process, the messages are transferred with theunderlying feature information into the Apache SparkStreaming Engine where the model is deployed on thefeatures. This model uses similar features that are fetchedfrom the historical data for training the model before testingit on the messages received from MapR.The historical data with similar features are supplied toApache Spark by MapR-DB that uses HBase API to convertthe data into key-value pairs for faster transfer of data toSpark. Apache Spark then converts the Training and testingdata into RDDs for faster and parallel processing and deploysthe models which send one of the messages from 3 categoriesto the device via MapR messaging.During this process, MapR-FS temporarily stores the newlyacquired data which when required, can be made availablefor training the Machine Learning Model. In case this dataremains unused, the data is forwarded to HDFS where it isstored in Distributed Files that work in parallel processingmethodology controlled by a single Master Node Server.This solution can be explained in the following figure3Figure4:Solution Workflow3Solution 21Ingestion Engine: KafkaStream Processing System: Apache SparkData Storage: CassandraFigure5:Solution Workflow1This Solution as portrayed in the above figure talksabout the scalable framework that streams data in real timeusing spark as the streaming engine and Cassandra as thestorage system. Cassandra does not work on the principle ofMaster-Slave Architecture and hence is faster as compared toHBase and is more reliable since there is no single point offailure that can hinder the process of Data retrieval duringthe real-time data analytics.In this system, a web server that undergoes a bash scriptat the backend fetches data for training the data in ApacheSpark. The machine learning tool MLlib in spark is used todeploy the Machine Learning models on the data. This makesthe analytical process easy since the MLlib comes with presetML algorithms for Dimension Reduction and Classificationthat helps in the streaming process.This solution uses Kafka as a messaging system that ingeststhe data in Apache Spark for model deployment. Despite thecombination of Kafka with Apache Flink considered to bethe best, the integration of Cassandra with Spark and Kafkain the system improves the Scalability and Fault Tolerancewith higher precision. This combination results in processingof 200 transactions per second as compared to the genericapproach of using HBase instead of Cassandra for Datastoring that can process up to 2.4 transactions per second inthe experimental conditions.This kind of unique combination does increase in loweringthe latency and improving the analytical process speedconsiderably to meet the expectations and ever-increasingdemand for real-time Data Analytics in the Big Dataenvironment.VI. REAL-TIME DATA ANALYTICS IN SOCIALMEDIA PLATFORMS9Social media has become one of the biggest sources of dataacquisition and hence the problem of storage can be prettycritical for global platforms like Twitter. There is Terabytesof data that get stored on Twitter on a daily basis which isprocessed in real-time to extract the valuable information.Twitter started by using HDFS along with different NoSQLdatabases like HBase to store the massive amount of data thatis collected on a daily basis. This data is collected in formsof distributed files that run parallel when a Data Retrievalrequest if feed to the system.However, this data can also be used to predict events in realtime. As mentioned in 9, Twitter is used to run a behavioralanalysis during the different event to predict the intensity andoutcome of the data. In this paper, the aspect of integratingReal-Time Analytics in social media platforms is discussedto stress on the importance of various Stream ProcessingSystems and there role in the world of Real-Time Big Dataanalytics which is just not limited to Fraud Detection.Twitter first started using Apache Storm as a StreamProcessing Engine. Storm does stream processes in batches,unlike Spark that uses Micro batches. One of the mostcrucial parts of Real-Time data processing is Analytics. Thisis where the actual modeling of the data is done. ApacheStorm although being good in real-time streaming is limitedto basic statistical operations and cannot perform MachineLearning Algorithms. For a real-time analytics to work, itis important for the stream processing engine to process thedata with different machine learning algorithms on the fly inorder to deduce uncommon events. Because of this, Twitterstreaming was later shifted to Apache Spark by sacrificingthe no latency feature of Storm.As an Ingestion engine, Twitter uses its stream processingAPIs to feed the data to Kafka which then distributes thedata in batches in sends it further for processing in Spark.By integrating the YARN and HDFS methodologies, thehistorical data is feed to spark for training the model andis then tested on the data coming through the APIs. Thisprocess makes the Real-Time data analytics doable for sucha huge Big Data storing giant.VII. CONCLUSIONThis paper highlights different types of Big Data Storage,Data Ingestion, and Data Streaming systems were discussed.There are many more solutions that are on a verge ofimplementation to solve the problems underlying in the fieldof Real-Time analytics for Fraud Detection. However, theapproach of using Apache Spark as a Stream ProcessingEngine and Hbase or Cassandra for Data Storage does appearto be the popular ones. Many other Engines especially ApacheFlink which is designed specifically for stream processingis taking its roots in the latest solutions that are to beimplemented in the field. With all the work that is being putup by the Apache community, the future scope of Real-Timefraud detection does seem to end up on a higher node.REFERENCES1 F. Carcillo, A. D. Pozzolo, Y.-A. L. Borgne, O. Caelen, Y. Mazzer, andG. Bontempi, “Scarff: A scalable framework for streaming credit cardfraud detection with spark,” Information Fusion 41, pp. 182–194, 2018.2 “Credit card fraud detection using apache spark streaming and kafka.”Online. Available: https://code.likeagirl.io/apache-spark-streamingand-kafka-use-case-2a88024727783 C. McDonald, “Real time credit card fraud detectionwith apache spark and event streaming,” 2016. Online.Available: https://mapr.com/blog/real-time-credit-card-fraud-detectionapache-spark-and-event-streaming/4 L. Affetti, R. Tommasini, A. Margara, G. Cugola, and E. D. Valle,”Defining the execution semantics of stream processing engines,” Journalof Big Data 4(1),12, 2017.5 S. Ramrez-Gallego, A. Fernndeza, S. Garcaa, M. Chenb, and F. Herreraa,”Big data: Tutorial and guidelines on information and process fusion foranalytics algorithms with mapreduce,” Information Fusion 42, pp. 51–61, 2018.6 J. MSV, “All the apache streaming projects: An exploratory guide,” 2016.Online. Available: https://thenewstack.io/apache-streaming-projectsexploratory-guide/7 M.Sathyapriya1 and V.Thiagarasu, “Big data analytics techniques forcredit card fraud detection: A review,” International Journal of Scienceand Research (IJSR), 2015.8 Y. Dai, J. Yan, X. Tang, H. Zhao, and M. Guo, “Online credit card frauddetection: A hybrid framework with big data technologies,” Proceedings- 15th IEEE International Conference on Trust, Security and Privacy inComputing and Communications, 10th IEEE International Conferenceon Big Data Science and Engineering and 14th IEEE InternationalSymposium on Parallel and Distributed Processing with Applications,IEEE TrustCom/BigDataSE/ISPA 2016 7847136, pp. 1644–1651, 2016.9 B. Yadranjiaghdam, S. Yasrobi, and N. Tabrizi, “Developing a real-timedata analytics framework for twitter streaming data,” Proceedings – 2017IEEE 6th International Congress on Big Data, BigData Congress 20178029342, pp. 329–336, 2017.10 GemaBello-Orgaz, JasonJ.Jungb, and D. Camachoa, “Social big data:Recent achievements and new challenges,” Information Fusion 28, pp.45–59, 2016.11 W. Xu, “Benchmark apache hbase vs apache cassandra onssd in a cloud environment,” 2017. Online. Available:https://hortonworks.com/blog/hbase-cassandra-benchmark/