'DE/spark' 카테고리의 글 목록

DE/spark 2023. 5. 12. 14:16

spark.sql.hive.caseSensitiveInferenceMode 기본적으로 HIVE는 스키마에 소문자만, 그러나 PARQUET는 대문자 소문자 둘다 허용한다. 그래서 SPARK에서 파케이로 만든 테이블을 SQL질의할 경우 대문자 필드를 가진 열의 경우 제대로 가져와지지 않을 수 있기 떄문에 이 conf를 적용 ref : https://hellomuzi.tistory.com/56

StandardScaler

DE/spark 2022. 10. 31. 15:27

official documentation : https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.StandardScaler.html - with mean = False 가 default라 만약 mean도 표준화 시켜주려면 True로 설정해야함. - with std = True라 표준편차를 1로 맞춰줌 - numpy , sklearn과 값이 조금 차이가 있다는데 spark는 sample variance 사용하여 unbiased. numpy에서 그냥 쓰면 population std 사용하게 됨. https://stackoverflow.com/questions/51753088/standardscaler-in-spark..

clustering

DE/spark 2022. 10. 31. 14:04

공식 documentation : - https://spark.apache.org/docs/latest/ml-clustering.html - https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.KMeans.html#pyspark.ml.clustering.KMeans.weightCol - evaluate : https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.ClusteringEvaluator.html

libSVM format

DE/spark 2022. 10. 31. 13:54

- sparse 한 데이터 저장 ##for observation 1 to n, feature 1 to i {target column_1} {feature index 1} : {feature value 1} ... {feature index k} : {feature value k} ... {target column_n} {feature index 2} : {feature value 2} ... {feature index i} : {feature value i} https://stackoverflow.com/questions/44965186/how-to-understand-the-format-type-of-libsvm-of-spark-mllib How to understand the format type o..

cache vs persist

DE/spark 2022. 10. 31. 10:41

cache 나 persist 의 필요성 : action을 할때 두번 연산하지 않기 위해. action반복하면 필요 https://stackoverflow.com/questions/28981359/why-do-we-need-to-call-cache-or-persist-on-a-rdd cache vs persist 차이 : 저장공간 설정의 차이인듯하다. https://jhleeeme.github.io/spark-caching/ - pyspark에서 캐싱 여부 확인 저장공간 확인 print(df.is_cached) -- caching 확인 print(df.storageLevel) -- 저장공간 확인 df.cache() -- 정해진(preset) 저장공간 위치에만 가능 print(df.is_cached) -- ..

dataframe 보기

DE/spark 2022. 10. 27. 17:58

둘다 action임 - show(n) 평소에 파이썬에서 보던 형태와 비슷 - take(n) 한 행으로 줄지어 나옴. take는 새로운 dataframe 생성에 활용될 수 있다고 함 편의성은 show가 더 좋기는 한듯? limit, head 에 관한 건 여기에 참고 : https://medium.com/@deepa.account/spark-datasets-take-head-and-limit-c85f50d9be75

SparkContext vs. SparkSession

DE/spark 2022. 8. 17. 15:22

spark session이 최신에 나온 거고, spark context에서 실행가능하면 spark session에서도 가능 spark context를 공유하는 세션가능(???) Reference: https://velog.io/@6v6/SparkSession-SparkContext-차이

ABOUT ME

틀려도일단 틀려도일단

티스토리툴바