15 ロングとワイドの変換

以前、世界銀行のデータを読み込みましたが、GapminderやPolityのデータと形が違うことに気づいたでしょうか。GapminderやPolityではそれぞれの行が「ある年のある国の情報」であり、例えばこのような形をしています。データが増えていくと縦に伸びていくので、ロング・データと言います。

特に、複数時点かつ複数の観察個体からなるデータ（つまり、複数の年の複数の国など）をパネル・データや時系列横断データと呼びます。

それに対して、世界銀行のデータではそれぞれの行は「ある国の特定の期間の情報」でした。つまり、世界銀行のデータはこのような形をしており、データが増えていくと横に伸びていくので、ワイド・データと呼びます。

多くの場合、ロング・データのほうが分析に適しているので、今回はワイド・データをロング・データに変換する方法を学びます。なお、データ分析に適切なデータの形式については次の資料を参照してください。

15.1 tidyverse

まずはデータを読み込みます。同時に下ごしらえとして、使用する変数を選択し、ついでに変数名を変えておきます。

R (tidyverse)

library(tidyverse)

df_pop_fem <- read_csv("data/wb_pop_fem.csv", skip = 4) |> 
  select(country_code = "Country Code", "1960":"2022")

head(df_pop_fem)

# A tibble: 6 × 64
  country_code   `1960`  `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968`
  <chr>           <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 ABW             27773  2.84e4 2.88e4 2.92e4 2.96e4 2.99e4 3.01e4 3.03e4 3.02e4
2 AFE          65853220  6.76e7 6.95e7 7.14e7 7.34e7 7.55e7 7.76e7 7.98e7 8.21e7
3 AFG           4145945  4.23e6 4.33e6 4.42e6 4.53e6 4.63e6 4.75e6 4.86e6 4.98e6
4 AFW          48802898  4.99e7 5.09e7 5.20e7 5.32e7 5.44e7 5.56e7 5.69e7 5.82e7
5 AGO           2670229  2.70e6 2.74e6 2.78e6 2.81e6 2.84e6 2.86e6 2.87e6 2.88e6
6 ALB            785048  8.09e5 8.34e5 8.58e5 8.82e5 9.06e5 9.29e5 9.53e5 9.80e5
# ℹ 54 more variables: `1969` <dbl>, `1970` <dbl>, `1971` <dbl>, `1972` <dbl>,
#   `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>,
#   `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>,
#   `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>,
#   `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>,
#   `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>,
#   `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, …

"1960":"2022"という表記で1960から2022までの変数を一括して選択できます。

pivot_longer()という関数でロング・データにします。関数の中ではロングにしたい（ワイド・データにおける）変数名を指定します。

R (tidyverse)

df_pop_fem <- df_pop_fem |> 
  pivot_longer("1960":"2022")

head(df_pop_fem)

# A tibble: 6 × 3
  country_code name  value
  <chr>        <chr> <dbl>
1 ABW          1960  27773
2 ABW          1961  28380
3 ABW          1962  28820
4 ABW          1963  29218
5 ABW          1964  29570
6 ABW          1965  29875

もともと変数名だったものがname、変数の値だったものはvalueになっているので、それぞれ適当な名前に変えます。

R (tidyverse)

df_pop_fem <- df_pop_fem |> 
  rename(year = name, pop_fem = value)

head(df_pop_fem)

# A tibble: 6 × 3
  country_code year  pop_fem
  <chr>        <chr>   <dbl>
1 ABW          1960    27773
2 ABW          1961    28380
3 ABW          1962    28820
4 ABW          1963    29218
5 ABW          1964    29570
6 ABW          1965    29875

また、よく見るとyearの下に<chr>とあります。これはcharacterの略で、プログラミングにおいては文字列を意味します。したがって、yearを文字列から数値に変換します。

R (tidyverse)

df_pop_fem <- df_pop_fem |> 
  mutate(year = parse_number(year))

head(df_pop_fem)

# A tibble: 6 × 3
  country_code  year pop_fem
  <chr>        <dbl>   <dbl>
1 ABW           1960   27773
2 ABW           1961   28380
3 ABW           1962   28820
4 ABW           1963   29218
5 ABW           1964   29570
6 ABW           1965   29875

yearの下に<dbl>とあります。これはdoubleの略で、数値を意味します。

以上で、ワイド・データをロング・データに変換できました。

15.2 pandas

同様の作業をpandasでも行いたいと思います。

Python (pandas)

import pandas as pd

df_pop_fem = pd.read_csv("data/wb_pop_fem.csv", skiprows=4)
df_pop_fem = df_pop_fem.rename(columns={"Country Code": "country_code"})
df_pop_fem = df_pop_fem.drop(columns=df_pop_fem.columns[[0, 2, 3, -1, -2]])

df_pop_fem.head()

  country_code        1960        1961  ...         2020         2021         2022
0          ABW     27773.0     28380.0  ...      56373.0      56330.0      56272.0
1          AFE  65853220.0  67606287.0  ...  345889868.0  354855221.0  363834524.0
2          AFG   4145945.0   4233771.0  ...   19279930.0   19844584.0   20362329.0
3          AFW  48802898.0  49850088.0  ...  231877590.0  237813580.0  243821774.0
4          AGO   2670229.0   2704394.0  ...   16910989.0   17452283.0   17998220.0

[5 rows x 64 columns]

df_pop_fem.columns[[0, 2, 3, -1, -2]]はデータフレームの1番目、3番目、4番目および最後と最後から2番目の変数名を取り出しています。

pd.wide_to_longe()という関数でロング・データにします。

Python (pandas)

df_pop_fem = pd.wide_to_long(df_pop_fem, stubnames="", i="country_code", j="year")

df_pop_fem.head()

                             
country_code year            
ABW          1960     27773.0
AFE          1960  65853220.0
AFG          1960   4145945.0
AFW          1960  48802898.0
AGO          1960   2670229.0

iで個体を示す変数を、jで時間を示す変数を指定します。
stubnames=""はワイド・データの変数でロングにしたい変数の名前の共通部分を指定します。今回はそのような文字はないので、なにも指定しません。

新しい変数名が空欄なので、変数名を変えます。

Python (pandas)

df_pop_fem = df_pop_fem.rename(columns={"": "pop_fem"})

df_pop_fem.head()

                      pop_fem
country_code year            
ABW          1960     27773.0
AFE          1960  65853220.0
AFG          1960   4145945.0
AFW          1960  48802898.0
AGO          1960   2670229.0

分かりにくいですが、yearとcountry_codeは変数ではなく、インデックスになっているので、変数にします。ついでに、yearを数値に変えます。

Python (pandas)

df_pop_fem = df_pop_fem.reset_index()
df_pop_fem["year"] = df_pop_fem["year"].astype("int64")

df_pop_fem.head()

  country_code  year     pop_fem
0          ABW  1960     27773.0
1          AFE  1960  65853220.0
2          AFG  1960   4145945.0
3          AFW  1960  48802898.0
4          AGO  1960   2670229.0

15.3 polars

polarsでもやってみます。

Python (polars)

import polars as pl

df_pop_fem = pl.read_csv("data/wb_pop_fem.csv", skip_rows=4)
df_pop_fem = df_pop_fem.rename({"Country Code": "country_code"})
df_pop_fem = df_pop_fem.select(pl.col("country_code"), pl.col("^[0-9]+$").exclude("2023"))

df_pop_fem.head()

shape: (5, 64)

country_code	1960	1961	1962	1963	1964	1965	1966	1967	1968	1969	1970	1971	1972	1973	1974	1975	1976	1977	1978	1979	1980	1981	1982	1983	1984	1985	1986	1987	1988	1989	1990	1991	1992	1993	1994	1995	1996	1997	1998	1999	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022
str	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64
"ABW"	27773	28380	28820	29218	29570	29875	30135	30253	30232	30166	30063	29927	29953	30229	30595	30972	31245	31416	31584	31749	31909	32121	32389	32659	32886	33008	33007	32904	32788	32892	33480	34657	35941	37137	38437	39724	41014	42336	43688	45050	46269	47178	47831	48374	48877	49414	50016	50636	51272	51919	52484	52980	53480	53953	54403	54828	55224	55591	55935	56254	56373	56330	56272
"AFE"	65853220	67606287	69457112	71375643	73386167	75478396	77610073	79810945	82111287	84493601	86968714	89504801	92051419	94694181	97478670	100339888	103289004	106237590	109415983	112834021	116060576	119525759	123410049	127333314	131344567	135563206	139816011	144066893	148288335	152522362	156942214	161298074	165609524	170167926	174762745	179486372	184468529	189280003	194009070	198959676	204048614	209257664	214635664	220167814	225898442	231830259	237997868	244435307	251105628	257956460	265000967	272174714	279546577	287224924	295089133	303195897	311387401	319637365	328159112	336930970	345889868	354855221	363834524
"AFG"	4145945	4233771	4326881	4424511	4526691	4634341	4745981	4861918	4983086	5108507	5239568	5372747	5509781	5655304	5803603	5948268	6083166	6214979	6342838	6373547	6136856	5490160	4973968	4916351	5074600	5225679	5207273	5152650	5188060	5334609	5346409	5372208	6028939	7000119	7722096	8199445	8537421	8871958	9217591	9595036	9727541	9793166	10438055	11247647	11690825	12109086	12614497	12835340	13088192	13557331	13949295	14468875	15067373	15594637	16172321	16682054	17115346	17614722	18136922	18679089	19279930	19844584	20362329
"AFW"	48802898	49850088	50928609	52044907	53196730	54389295	55621877	56890201	58204276	59560501	60963620	62404746	63900687	65482730	67160750	68943269	70786681	72718234	74770813	76909670	79104037	81359426	83714354	85996392	88238093	90605997	93047700	95556172	98139171	100824162	103478502	106184462	109071980	111968903	114896750	117979287	121143186	124399328	127775360	131211380	134795501	138546839	142408066	146370538	150463219	154696476	159035017	163481052	168058833	172782717	177645233	182657978	187755307	192900081	198163527	203513873	208980433	214578994	220253839	226004857	231877590	237813580	243821774
"AGO"	2670229	2704394	2742689	2779473	2812590	2838939	2856740	2867926	2879001	2902120	2953347	3032948	3132441	3244749	3362438	3483416	3606782	3735823	3872130	4014347	4164145	4321167	4485276	4656894	4834820	5018620	5206761	5396035	5588733	5787505	5991207	6199060	6408303	6621767	6845622	7077381	7315200	7561436	7813123	8071413	8339311	8619083	8912191	9219638	9545020	9886765	10244381	10620174	11013001	11422969	11853530	12303450	12770743	13252938	13746371	14248799	14764575	15293335	15828040	16370553	16910989	17452283	17998220

pl.col("^[0-9]+$").exclude("2023")の"^[0-9]+$"は正規表現と呼ばれるもので、0から9までのどれかから始まり、どれかで終わる文字列を意味します。
exclude("2023")は2023はデータが含まれていないので除外しています。

melt()メソッドでロング・データにします。

Python (polars)

df_pop_fem = df_pop_fem.melt(id_vars="country_code")

df_pop_fem.head()

shape: (5, 3)

country_code	variable	value
str	str	i64
"ABW"	"1960"	27773
"AFE"	"1960"	65853220
"AFG"	"1960"	4145945
"AFW"	"1960"	48802898
"AGO"	"1960"	2670229

変数名を変更し、年を文字列から数値に変換します。

Python (polars)

df_pop_fem = df_pop_fem.rename({"variable": "year", "value": "pop_fem"})
df_pop_fem = df_pop_fem.with_columns(year=pl.col("year").str.to_integer())

df_pop_fem.head()

shape: (5, 3)

country_code	year	pop_fem
str	i64	i64
"ABW"	1960	27773
"AFE"	1960	65853220
"AFG"	1960	4145945
"AFW"	1960	48802898
"AGO"	1960	2670229

str.to_integer()で文字列を整数に変換できます。