Spark DataSkew Problem
What is your audience?
Warehouse architectors, data engineers and spark developers with intermediate level of experience how whant to improve performence of data processing.
What are the requirements?
spark, hadoop, bigdata
What's your ETA?
Article draft at google docs: /DataLake/SparkChallenges/DataSkew/ArticleDraft (link)
The performance of the big data systems is directly linked to uniform distribution of the processing data across all of the workes. When you have a database table and then take the data from it to processing, the rows of the data should be distributed uniformly among all the data workers. If some data slices have more rows than others, the workers with more data should work harder, longer, and need more resources and time to complete their jobs. These data slices and the workers that manage them become a performance bottleneck for whole data processing task. Uneven distribution of data is called skew and an optimal data distribution has no skew.
- Problem description
- What is data skew?
- How it influence on data processing in distributed systems?
- Map-reduce programming model and data skew
- Search algorithms on skewed data
- Problem solutions
- From skew to uniform distribution
- Storage layer solutions (bucketing, salt key, ...)
- Processing layer solutions (shuffling, custom partitioning, ...)
- Practical experience
- original (skewed) data processing
- modified data processing
- OpenData resources and data skew problem
Before creating a new issue, please search for existing ones to avoid creating duplicates.
The issue title is the title of the article you're proposing to write about.
Please visit the webpage for an overview of the Community Writers Program.
Once all the fields of this template are completed, please add a comment to the thread:
@vbrevus I would like to write about this subject and I accept the [terms and conditions](https://dataengi.com/terms/) of the Community Writers Program.