Fraud Detection with Cloudera Stream Processing Half 1

In a earlier weblog of this collection, Turning Streams Into Knowledge Merchandise, we talked in regards to the elevated want for decreasing the latency between knowledge era/ingestion and producing analytical outcomes and insights from this knowledge. We mentioned how Cloudera Stream Processing (CSP) with Apache Kafka and Apache Flink might be used to course of this knowledge in actual time and at scale. On this weblog we’ll present an actual instance of how that’s carried out, how we will use CSP to carry out real-time fraud detection.

Constructing real-time streaming analytics knowledge pipelines requires the flexibility to course of knowledge within the stream. A vital prerequisite for in-stream processing is having the aptitude to gather and transfer the information as it’s being generated on the level of origin. That is what we name the first-mile drawback. This weblog shall be printed in two components. Partially one we’ll look into how Cloudera DataFlow powered by Apache NiFi solves the first-mile drawback by making it straightforward and environment friendly to purchase, rework, and transfer knowledge in order that we will allow streaming analytics use instances with little or no effort. We may also briefly focus on some great benefits of operating this stream in a cloud-native Kubernetes deployment of Cloudera DataFlow.

Partially two we’ll discover how we will run real-time streaming analytics utilizing Apache Flink, and we’ll use Cloudera SQL Stream Builder GUI to simply create streaming jobs utilizing solely SQL language (no Java/Scala coding required). We may also use the data produced by the streaming analytics jobs to feed completely different downstream programs and dashboards. 

The use case

Fraud detection is a superb instance of a time-critical use case for us to discover. All of us have been via a scenario the place the small print of our bank card, or the cardboard of somebody we all know, has been compromised and illegitimate transactions had been charged to the cardboard. To reduce the harm in that scenario, the bank card firm should be capable to determine potential fraud instantly in order that it will possibly block the cardboard and get in touch with the person to confirm the transactions and presumably problem a brand new card to exchange the compromised one.

The cardboard transaction knowledge normally comes from event-driven sources, the place new knowledge arrives as card purchases occur in the actual world. Moreover the streaming knowledge although, we even have conventional knowledge shops (databases, key-value shops, object shops, and so forth.) containing knowledge which will have for use to complement the streaming knowledge. In our use case, the streaming knowledge doesn’t include account and person particulars, so we should be part of the streams with the reference knowledge to supply all the data we have to test towards every potential fraudulent transaction.

Relying on the downstream makes use of of the data produced we could must retailer the information in several codecs: produce the checklist of potential fraudulent transactions to a Kafka matter in order that notification programs can motion them immediately; save statistics in a relational or operational dashboard, for additional analytics or to feed dashboards; or persist the stream of uncooked transactions to a sturdy long-term storage for future reference and extra analytics.

Our instance on this weblog will use the performance inside Cloudera DataFlow and CDP to implement the next:

  1. Apache NiFi in Cloudera DataFlow will learn a stream of transactions despatched over the community.
  2. For every transaction, NiFi makes a name to a manufacturing mannequin in Cloudera Machine Studying (CML) to attain the fraud potential of the transaction.
  3. If the fraud rating is above a sure threshold, NiFi instantly routes the transaction to a Kafka matter that’s subscribed by notification programs that may set off the suitable actions.
  4. The scored transactions are written to the Kafka matter that may feed the real-time analytics course of that runs on Apache Flink.
  5. The transaction knowledge augmented with the rating can be continued to an Apache Kudu database for later querying and feed of the fraud dashboard.
  6. Utilizing SQL Stream Builder (SSB), we use steady streaming SQL to investigate the stream of transactions and detect potential fraud based mostly on the geographical location of the purchases.
  7. The recognized fraudulent transactions are written to a different Kafka matter that feeds the system that may take the required actions.
  8. The streaming SQL job additionally saves the fraud detections to the Kudu database.
  9. A dashboard feeds from the Kudu database to point out fraud abstract statistics.

Buying with Cloudera DataFlow

Apache NiFi is a part of Cloudera DataFlow that makes it straightforward to amass knowledge on your use instances and implement the required pipelines to cleanse, rework, and feed your stream processing workflows. With greater than 300 processors obtainable out of the field, it may be used to carry out common knowledge distribution, buying and processing any sort of information, from and to just about any sort of supply or sink.

On this use case we created a comparatively easy NiFi stream that implements all of the operations from steps one via 5 above, and we’ll describe these operations in additional element beneath.

In our use case, we’re processing monetary transaction knowledge from an exterior agent. This agent is sending every transaction because it occurs to a community deal with. Every transaction comprises the next info:

  • The transaction time stamp
  • The ID of the related account
  • A novel transaction ID
  • The transaction quantity
  • The geographical coordinates of the place the transaction occurred (latitude and longitude)

The transaction message is in JSON format as appears to be like like the instance beneath:


  "ts": "2022-06-21 11:17:26",

  "account_id": "716",

  "transaction_id": "e933787c-f0ff-11ec-8cad-acde48001122",

  "quantity": 1926,

  "lat": -35.40439536601375,

  "lon": 174.68080620053922


NiFi is ready to create community listeners to obtain knowledge coming over the community. For this instance we will merely drag and drop a ListenUDP processor into the NiFi canvas and configure it with the specified port. It’s doable to parameterize the configuration of processors to make flows reusable. On this case we outlined a parameter known as #{enter.udp.port}, which we will later set to the precise port we’d like.


Describing the information with a schema

A schema is a doc that describes the construction of the information. When sending and receiving knowledge throughout a number of purposes in your setting and even processors in a NiFi stream, it’s helpful to have a repository the place the schema for all various kinds of knowledge are centrally managed and saved. This makes it simpler for purposes to speak to one another.

Cloudera Knowledge Platform (CDP) comes with a Schema Registry service. For our pattern use case, we now have saved the schema for our transaction knowledge within the Schema Registry service and have configured our NiFi stream to make use of the right schema identify. NiFi is built-in with Schema Registry and it’ll mechanically hook up with it to retrieve the schema definition each time wanted all through the stream.

The trail that the information takes in a NiFi stream is set by visible connections between the completely different processors. Right here, for instance, the information obtained beforehand by the ListenUDP processor is “tagged” with the identify of the schema that we need to use: “transaction.”

Scoring and routing transactions

We skilled and constructed a machine studying (ML) mannequin utilizing Cloudera Machine Studying (CML) to attain every transaction in line with their potential to be fraudulent. CML gives a service with a REST endpoint that we will use to carry out scoring. As the information flows via the NiFi knowledge stream, we need to name the ML mannequin service for knowledge factors to get the fraud rating for every one in every of them.

We use the NiFi’s LookupRecord for this, which permits lookups towards a REST service. The response from the CML mannequin comprises a fraud rating, represented by an actual quantity between zero and one.

The output of the LookupRecord processor, which comprises the unique transaction knowledge merged with the response from the ML mannequin, was then linked to a really helpful processor in NiFi: the QueryRecord processor.

The QueryRecord processor lets you outline a number of outputs for the processor and affiliate a SQL question with every of them. It applies the SQL question to the information that’s streaming via the processor and sends the outcomes of every question to the related output.

On this stream we outlined three SQL queries to run concurrently on this processor:


Word that some processors additionally outline further outputs, like “failure,” “retry,” and so forth., so to outline your individual error-handling logic on your flows.

Feeding streams to different programs

At this level of the stream we now have already enriched our stream with the ML mannequin’s fraud rating and reworked the streams in line with what we’d like downstream. All that’s left to finish our knowledge ingestion is to ship the information to Kafka, which we’ll use to feed our real-time analytical course of, and save the transactions to a Kudu desk, which we’ll later use to feed our dashboard, in addition to for different non-real-time analytical processes down the road.

Apache Kafka and Apache Kudu are additionally a part of CDP, and it’s quite simple to configure the Kafka- and Kudu-specific processors to finish the duty for us.

Operating the information stream natively on the cloud

As soon as the NiFi stream is constructed it may be executed in any NiFi deployment you may need. Cloudera DataFlow for the Public Cloud (CDF-PC) gives a cloud-native elastic stream runtime that may run flows effectively.

In comparison with fixed-size NiFi clusters, the CDF’s cloud-native stream runtime has an a variety of benefits:

  • You don’t must handle NiFi clusters. You may merely hook up with the CDF console, add the stream definition, and execute it. The mandatory NiFi service is mechanically instantiated as a Kubernetes service to execute the stream, transparently to the person.
  • It gives higher useful resource isolation between flows.
  • Stream executions can auto-scale up and down to make sure the correct quantity of assets to deal with the present quantity of information being processed. This avoids useful resource hunger and in addition saves prices by deallocating pointless assets when they’re not used.
  • Constructed-in monitoring with user-defined KPIs that may be tailor-made to every particular stream are completely different granularities (system, stream, processor, connection, and so forth.).

Safe inbound connections

Along with the above, configuring safe community endpoints to behave as ingress gateways is a notoriously tough drawback to unravel within the cloud, and the steps fluctuate with every cloud supplier. 

It requires establishing load balancers, DNS information, certificates, and keystore administration. 

CDF-PC abstracts away these complexities with the inbound connections characteristic, which permits the person to create an inbound connection endpoint by simply offering the specified endpoint identify and port quantity.

Parameterized and customizable deployments

Upon the stream deployment you may outline parameters for the stream execution and in addition select the scale and auto-scaling traits of the stream:


Native monitoring and alerting

Customized KPIs could be outlined to observe the elements of the stream which can be essential to you. Alerts could be additionally outlined to generate notifications when the configured thresholds are crossed:

After the deployment the metrics collected for the outlined KPI could be monitored on the CDF dashboard:

Cloudera DataFlow additionally gives direct entry to the NiFi canvas for the stream so to test particulars of the execution or troubleshoot points, if essential. All of the performance from the GUI can be obtainable programmatically, both via the CDP CLI or the CDF API. The method of making and managing stream could be totally automated and built-in with CD/CI pipelines.


Gathering knowledge on the level of origination because it will get generated, and rapidly making it obtainable on the analytical platform, is vital for the success of any mission that requires knowledge streams to be processed in actual time. On this weblog we confirmed how Cloudera DataFlow makes it straightforward to create, check, and deploy knowledge pipelines within the cloud.

Apache NiFi’s graphical person interface and richness of processors permits customers to create easy and sophisticated knowledge flows with out having to put in writing code. The interactive expertise makes it very straightforward to check and troubleshoot flows through the improvement course of.

Cloudera DataFlow’s stream runtime provides robustness and effectivity to the execution of the flows in manufacturing in a cloud-native and elastic setting, which allows it to develop and shrink to accommodate the workload demand.

Within the half two of this weblog we’ll have a look at how Cloudera Stream Processing (CSP) can be utilized to finish the implementation of our fraud detection use case, performing real-time streaming analytics on the information that we now have simply ingested.

What’s the quickest technique to be taught extra about Cloudera DataFlow and take it for a spin? First, go to our new Cloudera DataFlow residence web page. Then, take our interactive product tour or join a free trial

Leave a Reply