Questions tagged [apache-spark]

1 votes
1 replies
Is there a way to add multiple columns to a dataframe calculated from moving averages from different columns and/or over different duration
I have a dataframe with time-series data and I am trying to add a lot of moving average columns to it with different windows of various ranges. W...
asked 1 month ago
1 votes
1 replies
how can I optimize schema inference on a remote file for CSV with Spark
I have a remote file in S3 (or other) and I need the schema of the file. I did not find an option to sample the data as for JSON (e.g. read.optio...
asked 1 month ago
-1 votes
1 replies
Pair Wise comparison on DataFrame Elements
How to do a Pairwise Iterate Columns to find Similarities. For All the Elemets from All The Colunms of one Data Frame, to be compared with all t...
asked 1 month ago
-2 votes
1 replies
Sorting large datasets by any column/attribute
I have a MySQL database with ~20M entries (and growing) distributed in some tables. My system has a feature where this information is shown in pa...
0 votes
0 replies
How two use spark.sql select two tables based on columns on every row
Basically, I have two tables, schemas given below: root |-- machine_id: string (nullable = true) |-- time_stamp: double (nullable = true) sca...
asked 1 month ago
2 votes
1 replies
How can I upgrade Apache Hive to version 3 on GCP Apache Spark Dataproc Cluster
For one reason or another, I want to upgrade the version of Apache Hive from 2.3.4 to 3 on Google Cloud Dataproc(1.4.3) Spark Cluster. How can I...
-1 votes
0 replies
how to update node properties by neo4j-spark-connector
I can create a node by neo4j-spark-connector: val rows = sc.makeRDD(Seq(Row("Laurence", "Fishburne"))) val schema = StructType(Seq(StructFie...
asked 1 month ago
2 votes
2 replies
Custom sorting in Spark Using Java/scala API
I have following data:- +-------------+ | card type| +-------------+ |ColonialVoice| | SuperiorCard| | Vista| | Distinguish| +------...
asked 1 month ago
-1 votes
0 replies
Sparklyr Split column into separate rows
I have a problem with spark tables. My table is; # Source: spark<?> [?? x 4] AssetConnectDeviceKey CreateDate FaultStatus D...
0 votes
0 replies
Error “Task attempt 0 is already registered” with RDD from Kafka Consumer in Spark-Streaming
App running on IntelliJ with Spark local mode. In the loop when Consumer from Kafka topic by Spark-Streaming: if ((_rdd != null) && (_r...
1 votes
1 replies
How to update Spark dataframe based on Column from other dataframe with many entries in Scala?
I am working with Spark dataframes and want to update a column column_to_be_updated in a hive-table using spark-sql in Scala. My code so far doe...
-1 votes
0 replies
What would be the right spark configuration setting, if I were to process 500 MB of gz file?
I am a newbie to spark and I have a 500 mb .gz file that I want to analyse. I am trying out a filter algorithm using 3 node cluster (4 vCores and...
0 votes
0 replies
Not getting messages using Spark streaming program
I have one kafka instance running on cluster publishing messages to topic. When I am triggering command ./bin/kafka-console-consumer.sh --bootst...
1 votes
0 replies
spark sql failed to recognize hive partition columns
I have a partitioned table event_fact. The partition columns are dt, type. And then I create a view on top of that table. create view event_fac...
asked 1 month ago
0 votes
1 replies
Spark, how to print the query?
I'm using pyspark df = self.sqlContext.read.option( "es.resource", indexes ).format("org.elasticsearch.spark.sql").load()...
asked 1 month ago
-1 votes
0 replies
How to handle large number of count distinct in spark sql
I am using Spark 2.2.2. I have a table t1 with column c0,c1,c2,c3...cn. And SQL like: Select c0, count(distinct if(condition(c1_1),c0,n...
asked 1 month ago
0 votes
1 replies
Job failure with no more details. I used a simple rdd.map, convert to DF and show()
I'm super begginer with pyspark. Just trying some code to process my documents in Databricks Community. I have a lot of html pages in a Dataframe...
0 votes
0 replies
Why the master specified memory does not correspond to the requested one in slurm script?
I'm using the following slurm script to run spark 2.3.0. #!/bin/bash #SBATCH --account=def-hmcheick #SBATCH --nodes=2 #SBATCH --time=00:10:00 #...
asked 1 month ago
1 votes
1 replies
How to represent nulls in DataSets consisting of list of case classes
I have a case class final case class FieldStateData( job_id: String = null,...
-2 votes
1 replies
Matching values of column within dataframe
I have a dataframe that looks like this: Market Price date outtime intime ttype ATLJFKJFKATL 150 20190403 0215 0600...
asked 1 month ago
0 votes
2 replies
Scala object apply method never called in Spark Job
I am trying to decouple my logic in spark app. I created seperate class for UDF definitions and UDF declarations: UDF Declaration: import OPXUd...
1 votes
1 replies
Pyspark: How to deal with null values in python user defined functions
I want to use some string similarity functions that are not native to pyspark such as the jaro and jaro-winkler measures on dataframes. These are...
1 votes
0 replies
Did I lose parquet files? Why isn't part-<file-number> incremental?
I have a large (data) job, wrote the output to hdfs. The parquet file output is not incremental. The cluster later (I think) lost an executor,...
asked 1 month ago
1 votes
1 replies
Two DataFrame nested for Each Loop
The foreach Loop nested iteration of DataFrams throws a NullPointerException: def nestedDataFrame(leftDF: DataFrame, riteDF: DataFrame): Unit =...
asked 1 month ago
-2 votes
1 replies
How to split string for all fields in a Spark column and grab the length of the returned split string list?
I'm currently attempting the grab the amount of services a specific IP is running, and the services are in a service column, stored as a StringTy...
asked 1 month ago