[Python] 대용량 데이터 다루기

▶ 대용량 데이터 다루기

최근 sha1 해싱에 대한 사전공격을 수행하면서 대용량 데이터를 다루는 법에 미흡함을 느끼고 이를 보충하기 위해 대용량 데이터 다루는 법에 대해서 글을 남긴다.

▶ Pandas

[1] CSV 데이터 청크 크기로 불러오기

100만개 이상의 row를 가진 데이터를 가져올 경우, 데이터가 너무 무거워져서 속도가 느려짐(나는 메모리가 이겨내지 못하고 데이터를 전부 날렸었다). pandas.read_csv에서 chunksize라는 매개 변수를 활용가능하다.

로컬 메모리에 맞추기 위해서 한 번에 DataFrame으로 읽어 올 행의 수를 지정할 수 있다.

1

df_chunk = pd.read_csv(r'../input/data.csv', chunksize=1000000)

cs

[2] Column 타입으로 데이터 줄이는 법.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

def check_dtypes(file_path):
    print(file_path)
    tmp = pd.read_csv(file_path, nrows=0)
    col_dtypes = {}
    for col in tmp.columns:
        df = pd.read_csv(file_path, usecols=[col])
        dtype = str(df[col].dtype)
 
        if "int" in dtype or "float" in dtype:
            c_min = df[col].min()
            c_max = df[col].max()
        elif dtype == "object":
            n_unique = df[col].nunique()
            threshold = n_unique / df.shape[0]
 
        if "int" in dtype:
            if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                col_dtype = "int8"
            elif c_min > np.iinfo(np.uint8).min and c_max < np.iinfo(np.uint8).max:
                col_dtype = "uint8"
            elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                col_dtype = "int16"
            elif c_min > np.iinfo(np.uint16).min and c_max < np.iinfo(np.uint16).max:
                col_dtype = "uint16"
            elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                col_dtype = "int32"
            elif c_min > np.iinfo(np.uint32).min and c_max < np.iinfo(np.uint32).max:
                col_dtype = "uint32"
            elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                col_dtype = "int64"
            elif c_min > np.iinfo(np.uint64).min and c_max < np.iinfo(np.uint64).max:
                col_dtype = "uint64"
            else:
                col_dtype = "uint64"
 
        elif "float" in dtype:
            if c_min > np.iinfo(np.float32).min and c_max < np.iinfo(np.float32).max:
                col_dtype = "float32"
            else:
                col_dtype = "float64"
 
        elif dtype == "object":
            if threshold > 0.7:
                col_dtype = "object"
            else:
                col_dtype = "category"
 
        col_dtypes[col] = col_dtype
 
    return col_dtypes
 
 
file_path = r"../test.csv"
data_types = check_dtypes(file_path)
df = pd.read_csv(file_path, dtype=data_types)
df
Colored by Color Scripter

cs

 

▶ Modin

[1] Modin import

1
2

import modin.pandas as pd
df = pd.read_csv("불러올 데이터 경로")

cs

// 사용법은 pandas와 동일

참조 : https://dodonam.tistory.com/360

https://chancoding.tistory.com/204

저작자표시 (새창열림)

'Python' 카테고리의 다른 글

[Python] 파이썬 기초 - 함수 (0)	2022.11.19
[Python] 파이썬 기초 - 제어문 (0)	2022.11.18
[Python] 파이썬 기초 - 자료형2 (리스트, 튜플, 딕셔너리, 집합, 불) (0)	2022.11.17
[Python] 파이썬 기초 - 자료형1 (숫자형, 문자열) (0)	2022.11.15
[Python] 파이썬 기초 연습 (0)	2022.10.30

공부로그

[Python] 대용량 데이터 다루기

▶ 대용량 데이터 다루기

▶ Pandas

▶ Modin

'Python' 카테고리의 다른 글

댓글

티스토리툴바

[Python] 대용량 데이터 다루기

▶ 대용량 데이터 다루기

▶ Pandas

▶ Modin

'Python' 카테고리의 다른 글

관련글

댓글

티스토리툴바