DE/spark
-
StandardScalerDE/spark 2022. 10. 31. 15:27
official documentation : https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.StandardScaler.html - with mean = False 가 default라 만약 mean도 표준화 시켜주려면 True로 설정해야함. - with std = True라 표준편차를 1로 맞춰줌 - numpy , sklearn과 값이 조금 차이가 있다는데 spark는 sample variance 사용하여 unbiased. numpy에서 그냥 쓰면 population std 사용하게 됨. https://stackoverflow.com/questions/51753088/standardscaler-in-spark..
-
clusteringDE/spark 2022. 10. 31. 14:04
공식 documentation : - https://spark.apache.org/docs/latest/ml-clustering.html - https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.KMeans.html#pyspark.ml.clustering.KMeans.weightCol - evaluate : https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.ClusteringEvaluator.html
-
libSVM formatDE/spark 2022. 10. 31. 13:54
- sparse 한 데이터 저장 ##for observation 1 to n, feature 1 to i {target column_1} {feature index 1} : {feature value 1} ... {feature index k} : {feature value k} ... {target column_n} {feature index 2} : {feature value 2} ... {feature index i} : {feature value i} https://stackoverflow.com/questions/44965186/how-to-understand-the-format-type-of-libsvm-of-spark-mllib How to understand the format type o..
-
cache vs persistDE/spark 2022. 10. 31. 10:41
cache 나 persist 의 필요성 : action을 할때 두번 연산하지 않기 위해. action반복하면 필요 https://stackoverflow.com/questions/28981359/why-do-we-need-to-call-cache-or-persist-on-a-rdd cache vs persist 차이 : 저장공간 설정의 차이인듯하다. https://jhleeeme.github.io/spark-caching/ - pyspark에서 캐싱 여부 확인 저장공간 확인 print(df.is_cached) -- caching 확인 print(df.storageLevel) -- 저장공간 확인 df.cache() -- 정해진(preset) 저장공간 위치에만 가능 print(df.is_cached) -- ..