whatstk.whatsapp¶
WhatsApp parser.
whatstk.whatsapp.objects¶
Library WhatsApp objects.
Classes:
|
Load and process a WhatsApp chat file. |
- class whatstk.whatsapp.objects.WhatsAppChat(df: DataFrame)[source]¶
Bases:
BaseChatLoad and process a WhatsApp chat file.
- Parameters
df (pandas.DataFrame) – Chat.
Example
This simple example loads a chat using
WhatsAppChat. Once loaded, we can access its attributedf, which contains the loaded chat as a DataFrame.>>> from whatstk.whatsapp.objects import WhatsAppChat >>> from whatstk.data import whatsapp_urls >>> chat = WhatsAppChat.from_source(filepath=whatsapp_urls.POKEMON) >>> chat.df.head(5) date username message 0 2016-08-06 13:23:00 Ash Ketchum Hey guys! 1 2016-08-06 13:25:00 Brock Hey Ash, good to have a common group! 2 2016-08-06 13:30:00 Misty Hey guys! Long time haven't heard anything fro... 3 2016-08-06 13:45:00 Ash Ketchum Indeed. I think having a whatsapp group nowada... 4 2016-08-06 14:30:00 Misty Definetly
Optionally, you can use the argument extra_metadata to add additional metadata to the chat:
>>> chat = WhatsAppChat.from_source(filepath=whatsapp_urls.POKEMON, extra_metadata=True) >>> chat.name 'Pokemon Chat' >>> chat.df_system date message 0 2016-04-15 15:04:00 Messages and calls are end-to-end encrypted. N... >>> chat.df.head() date username message 0 2016-08-06 13:23:00 Ash Ketchum Hey guys! 1 2016-08-06 13:25:00 Brock Hey Ash, good to have a common group! 2 2016-08-06 13:30:00 Misty Hey guys! Long time haven't heard anything fro... 3 2016-08-06 13:45:00 Ash Ketchum Indeed. I think having a whatsapp group nowada... 4 2016-08-06 14:30:00 Misty Definetly
Attributes:
Chat as DataFrame.
Chat as DataFrame.
Chat end date.
True if the chart is a group.
Name of the chat.
Chat starting date.
List with users.
Methods:
from_source(filepath[, extra_metadata])Create an instance from a chat text file.
from_sources(filepaths[, auto_header, ...])Load a WhatsAppChat instance from multiple sources.
merge(chat[, rename_users])Merge current instance with
chat.rename_users(mapping)Rename users.
to_csv(filepath)Save chat as csv.
to_txt(filepath[, hformat, encoding])Export chat to a text file.
to_zip(filepath[, hformat, encoding])Export chat to a zip file.
- property df: DataFrame¶
Chat as DataFrame.
- Returns
pandas.DataFrame
- property df_system: DataFrame¶
Chat as DataFrame.
- Returns
pandas.DataFrame
- property end_date: Union[str, datetime]¶
Chat end date.
- Returns
datetime
- classmethod from_source(filepath: str, extra_metadata: Optional[bool] = None, **kwargs: Any) WhatsAppChat[source]¶
Create an instance from a chat text file.
- Parameters
filepath (str) –
Path to the file. Accepted sources are:
Local file, e.g. ‘path/to/file.txt’ or ‘path/to/file.zip’ (iOS).
URL to a remote hosted file, e.g. ‘http://www.url.to/file.txt’.
Link to Google Drive file, e.g. ‘gdrive://35gKKrNk-i3t05zPLyH4_P1rPdOmKW9NZ’. The format is expected to be ‘gdrive://[FILE-ID]’. Note that in order to load a file from Google Drive you first need to run
gdrive_init.
**kwargs – Refer to the docs from
df_from_whatsappfor details on additional arguments.extra_metadata (bool) – This is experimental. If True, additional metadata will be added to the DataFrame. This includes class attributes such as chat.name, chat.df_system (DataFrame with only system messages). Note that this attribute only works on group chats.
- Returns
WhatsAppChat – Class instance with loaded and parsed chat.
- classmethod from_sources(filepaths: str, auto_header: Optional[bool] = None, hformat: Optional[str] = None, encoding: str = 'utf-8') WhatsAppChat[source]¶
Load a WhatsAppChat instance from multiple sources.
- Parameters
filepaths (list) – List with filepaths.
auto_header (bool, optional) – Detect header automatically (applies to all files). If None, attempts to perform automatic header detection for all files. If False,
hformatis required.hformat (list, optional) – List with the header format to be used for each file. The list must be of length equal to
len(filenames). A valid header format might be ‘[%y-%m-%d %H:%M:%S] - %name:’.encoding (str) – Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python standard encodings.
- Returns
WhatsAppChat – Class instance with loaded and parsed chat.
See also
Example
Load a chat using two text files. In this example, we use sample chats (available online, see urls in source code
whatstk.data).>>> from whatstk.whatsapp.objects import WhatsAppChat >>> from whatstk.data import whatsapp_urls >>> filepath_1 = whatsapp_urls.LOREM1 >>> filepath_2 = whatsapp_urls.LOREM2 >>> chat = WhatsAppChat.from_sources(filepaths=[filepath_1, filepath_2]) >>> chat.df.head(5) date username message 0 2019-10-20 10:16:00 John Laborum sed excepteur id eu cillum sunt ut. 1 2019-10-20 11:15:00 Mary Ad aliquip reprehenderit proident est irure mo... 2 2019-10-20 12:16:00 +1 123 456 789 Nostrud adipiscing ex enim reprehenderit minim... 3 2019-10-20 12:57:00 +1 123 456 789 Deserunt proident laborum exercitation ex temp... 4 2019-10-20 17:28:00 John Do ex dolor consequat tempor et ex.
- property is_group: bool¶
True if the chart is a group.
A chat is detected as a group if it has more than 2 users (including the ‘system’). Groups with one person will not be detected as groups.
- Returns
bool
- merge(chat: BaseChat, rename_users: Optional[Dict[str, str]] = None) BaseChat¶
Merge current instance with
chat.- Parameters
chat (WhatsAppChat) – Another chat.
rename_users (dict) – Dictionary mapping old names to new names. Example: {‘John’:[‘Jon’, ‘J’], ‘Ray’: [‘Raymond’]} will map ‘Jon’ and ‘J’ to ‘John’, and ‘Raymond’ to ‘Ray’. Note that old names must come as list (even if there is only one).
- Returns
BaseChat – Merged chat.
See also
Example
Merging two chats can become handy when you have exported a chat in different times with your phone and hence each exported file might contain data that is unique to that file.
In this example however, we merge files from different chats.
>>> from whatstk.whatsapp.objects import WhatsAppChat >>> from whatstk.data import whatsapp_urls >>> filepath_1 = whatsapp_urls.LOREM1 >>> filepath_2 = whatsapp_urls.LOREM2 >>> chat_1 = WhatsAppChat.from_source(filepath=filepath_1) >>> chat_2 = WhatsAppChat.from_source(filepath=filepath_2) >>> chat = chat_1.merge(chat_2)
- property name: Optional[str]¶
Name of the chat.
Returns None if no name could be found. The name is extracted from the username of with the first system message in the chat.
- Returns
list
- rename_users(mapping: Dict[str, str]) BaseChat¶
Rename users.
This might be needed in multiple occations:
Change typos in user names stored in phone.
- If a user appears multiple times with different usernames, group these under the same name (this might
happen when multiple chats are merged).
- Parameters
mapping (dict) – Dictionary mapping old names to new names, example: {‘John’: [‘Jon’, ‘J’], ‘Ray’: [‘Raymond’]} will map ‘Jon’ and ‘J’ to ‘John’, and ‘Raymond’ to ‘Ray’. Note that old names must come as list (even if there is only one).
- Returns
pandas.DataFrame – DataFrame with users renamed according to mapping.
- Raises
ValueError – Raised if mapping is not correct.
Examples
Load LOREM2 chat and rename users Maria and Maria2 to Mary.
>>> from whatstk.whatsapp.objects import WhatsAppChat >>> from whatstk.data import whatsapp_urls >>> chat = WhatsAppChat.from_source(filepath=whatsapp_urls.LOREM2) >>> chat.users ['+1 123 456 789', 'Giuseppe', 'John', 'Maria', 'Maria2'] >>> chat = chat.rename_users(mapping={'Mary': ['Maria', 'Maria2']}) >>> chat.users ['+1 123 456 789', 'Giuseppe', 'John', 'Mary']
- property start_date: Union[str, datetime]¶
Chat starting date.
- Returns
datetime
- to_csv(filepath: str) None¶
Save chat as csv.
- Parameters
filepath (str) – Name of file.
- to_txt(filepath: str, hformat: Optional[str] = None, encoding: str = 'utf8') None[source]¶
Export chat to a text file.
Usefull to export the chat to different formats (i.e. using different hformats).
- Parameters
filepath (str) – Name of the file to export (must be a local path).
hformat (str, optional) – Header format. Defaults to ‘%y-%m-%d, %H:%M - %name:’.
encoding (str, optional) –
Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python standard encodings.
- to_zip(filepath: str, hformat: Optional[str] = None, encoding: str = 'utf8') None[source]¶
Export chat to a zip file.
Usefull to export the chat to different formats (i.e. using different hformats).
- Parameters
filepath (str) – Name of the file to export (must be a local path).
hformat (str, optional) – Header format. Defaults to ‘%y-%m-%d, %H:%M - %name:’.
encoding (str, optional) –
Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python standard encodings.
- property users: List[str]¶
List with users.
- Returns
list
whatstk.whatsapp.parser¶
Parser utils.
Functions:
|
Alias for |
|
Load chat as a DataFrame. |
|
Generate regular expression from hformat. |
- whatstk.whatsapp.parser.df_from_txt_whatsapp(filepath: str, **kwargs: Any) DataFrame[source]¶
Alias for
df_from_whatsapp.
- whatstk.whatsapp.parser.df_from_whatsapp(filepath: str, auto_header: bool = True, hformat: Optional[str] = None, encoding: str = 'utf-8', message_type: Optional[bool] = None) DataFrame[source]¶
Load chat as a DataFrame.
- Parameters
filepath (str) –
Path to the file. Accepted sources are:
Local file, e.g. ‘path/to/file.txt’ OR ‘path/to/_chat.zip’ (e.g. iOS export).
URL to a remote hosted file, e.g. ‘http://www.url.to/file.txt’.
Link to Google Drive file, e.g. ‘gdrive://35gKKrNk-i3t05zPLyH4_P1rPdOmKW9NZ’. The format is expected to be ‘gdrive://[FILE-ID]’. Note that in order to load a file from Google Drive you first need to run
gdrive_init.
auto_header (bool, optional) – Detect header automatically. If False,
hformatis required.hformat (str, optional) –
Format of the header, e.g.
'[%y-%m-%d %H:%M:%S] - %name:'. Use following keywords:'%y': for year ('%Y'is equivalent).'%m': for month.'%d': for day.'%H': for 24h-hour.'%I': for 12h-hour.'%M': for minutes.'%S': for seconds.'%P': for “PM”/”AM” or “p.m.”/”a.m.” characters.'%name': for the username.
Example 1: For the header ‘12/08/2016, 16:20 - username:’ we have the
'hformat='%d/%m/%y, %H:%M - %name:'.Example 2: For the header ‘2016-08-12, 4:20 PM - username:’ we have
hformat='%y-%m-%d, %I:%M %P - %name:'.encoding (str, optional) –
Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python standard encodings.
message_type (bool, optional) – Label for the message type. Can be ‘user’ or ‘system’, based on who sent the message.
- Returns
WhatsAppChat – Class instance with loaded and parsed chat.
Example
Read a chat
>>> from whatstk import df_from_whatsapp >>> from whatstk.data import whatsapp_urls >>> df = df_from_whatsapp(filepath=whatsapp_urls.LOREM) >>> df.head(5) date username message message_type 0 2020-01-15 02:22:56 Mary Nostrud exercitation magna id. system 1 2020-01-15 03:33:01 Mary Non elit irure irure pariatur exercitation. 🇩🇰 user 2 2020-01-15 04:18:42 +1 123 456 789 Exercitation esse lorem reprehenderit ut ex ve... user 3 2020-01-15 06:05:14 Giuseppe Aliquip dolor reprehenderit voluptate dolore e... user 4 2020-01-15 06:56:00 Mary Ullamco duis et commodo exercitation. user
Read a chat, labelling each message as ‘user’ or ‘system’. ‘system’ messages are those sent by the chat itself (creation of chat, etc.)
>>> from whatstk import df_from_whatsapp >>> from whatstk.data import whatsapp_urls >>> df = df_from_whatsapp(filepath=whatsapp_urls.POKEMON, message_type=True) >>> df.head() date username message message_type 0 2016-04-15 15:04:00 Pokemon Chat Messages and calls are end-to-end encrypted. N... system 1 2016-08-06 13:23:00 Ash Ketchum Hey guys! user 2 2016-08-06 13:25:00 Brock Hey Ash, good to have a common group! user 3 2016-08-06 13:30:00 Misty Hey guys! Long time since heard anything from you user
- whatstk.whatsapp.parser.generate_regex(hformat: str) Tuple[str, str][source]¶
Generate regular expression from hformat.
- Parameters
hformat (str) – Simplified syntax for the header, e.g.
'%y-%m-%d, %H:%M:%S - %name:'.- Returns
str – Regular expression corresponding to the specified syntax.
Example
Generate regular expression corresponding to
'hformat=%y-%m-%d, %H:%M:%S - %name:'.>>> from whatstk.whatsapp.parser import generate_regex >>> generate_regex('%y-%m-%d, %H:%M:%S - %name:') ('(?P<year>\\d{2,4})-(?P<month>\\d{1,2})-(?P<day>\\d{1,2}), (?P<hour>\\d{1,2}):(?P<minutes>\\d{2}):(? P<seconds>\\d{2}) - (?P<username>[^:]*): ', '(?P<year>\\d{2,4})-(?P<month>\\d{1,2})-(?P<day>\\d{1,2}), (? P<hour>\\d{1,2}):(?P<minutes>\\d{2}):(?P<seconds>\\d{2}) - ')
whatstk.whatsapp.auto_header¶
Detect header from chat.
Functions:
|
Extract header from text. |
- whatstk.whatsapp.auto_header.extract_header_from_text(text: str, encoding: str = 'utf-8') Optional[str][source]¶
Extract header from text.
- Parameters
text (str) – Loaded chat as string (whole text).
encoding (str) –
Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python standard encodings.
- Returns
str – Format extracted. None if no header was extracted.
Example
Load a chat using two text files. In this example, we use sample chats (available online, see urls in source code
whatstk.data).>>> from whatstk.whatsapp.parser import extract_header_from_text >>> from urllib.request import urlopen >>> from whatstk.data import whatsapp_urls >>> filepath_1 = whatsapp_urls.POKEMON >>> with urlopen(filepath_1) as f: ... text = f.read().decode('utf-8') >>> extract_header_from_text(text) '%d.%m.%y, %H:%M - %name:
whatstk.whatsapp.generation¶
Automatic generation of chat using Lorem Ipsum text and time series statistics.
Classes:
|
Generate a chat. |
Functions:
|
Generate a chat and export using given header format. |
- class whatstk.whatsapp.generation.ChatGenerator(size: int, users: Optional[List[str]] = None, seed: int = 100)[source]¶
Bases:
objectGenerate a chat.
- Parameters
size (int) – Number of messages to generate.
users (list, optional) – List with names of the users. Defaults to module variable USERS.
seed (int, optional) – Seed for random processes. Defaults to 100.
Examples
This simple example loads a chat using
WhatsAppChat. Once loaded, we can access its attributedf, which contains the loaded chat as a DataFrame.>>> from whatstk.whatsapp.generation import ChatGenerator >>> from datetime import datetime >>> from whatstk.data import whatsapp_urls >>> chat = ChatGenerator(size=10).generate(last_timestamp=datetime(2020, 1, 1, 0, 0)) >>> chat.df.head(5) date username message 0 2019-12-31 09:43:04.000525 Giuseppe Nisi ad esse cillum. 1 2019-12-31 10:19:21.980039 Giuseppe Tempor dolore sint in eu lorem veniam veniam. 2 2019-12-31 13:56:45.575426 Giuseppe Do quis fugiat sint ut ut, do anim eu est qui ... 3 2019-12-31 15:47:29.995420 Giuseppe Do qui qui elit ea in sed culpa, aliqua magna ... 4 2019-12-31 16:23:00.348542 Mary Sunt excepteur mollit voluptate dolor sint occ...
Methods:
generate([filepath, hformat, last_timestamp])Generate random chat as
WhatsAppChat.- generate(filepath: Optional[str] = None, hformat: Optional[str] = None, last_timestamp: Optional[datetime] = None) str[source]¶
Generate random chat as
WhatsAppChat.- Parameters
filepath (str) – If given, generated chat is saved with name
filepath(must be a local path).hformat (str, optional) – Format of the header, e.g.
'[%y-%m-%d %H:%M:%S] - %name:'.last_timestamp (datetime, optional) – Datetime of last message. If None, defaults to current date.
- Returns
WhatsAppChat – Chat with random messages.
See also
- whatstk.whatsapp.generation.generate_chats_hformats(output_path: str, size: int = 2000, hformats: Optional[str] = None, filepaths: Optional[str] = None, last_timestamp: Optional[datetime] = None, seed: int = 100, verbose: bool = False, export_as_zip: bool = False) None[source]¶
Generate a chat and export using given header format.
If no hformat specified, chat is generated & exported using all supported header formats.
- Parameters
output_path (str) – Path to directory to export all generated chats as txt.
size (int, optional) – Number of messages of the chat. Defaults to 2000.
hformats (list, optional) – List of header formats to use when exporting chat. If None, defaults to all supported header formats.
filepaths (list, optional) – List with filepaths (only txt files). If None, defaults to whatstk.utils.utils._map_hformat_filename(filepath).
last_timestamp (datetime, optional) – Datetime of last message. If None, defaults to current date.
seed (int, optional) – Seed for random processes. Defaults to 100.
verbose (bool) – Set to True to print runtime messages.
export_as_zip (bool) – Set to True to export the chat(s) zipped, additionally.
See also
whatstk.whatsapp.hformat¶
Header format utils.
Example: Check if header is available.
>>> from whatstk.utils.hformat import is_supported >>> is_supported('%y-%m-%d, %H:%M:%S - %name:') (True, True)
Functions:
|
Get dictionary with supported formats and relevant info. |
|
Get list of supported formats. |
|
Check if header hformat is currently supported. |
|
Check if header hformat is currently supported (both manually and using auto_header). |
- whatstk.whatsapp.hformat.get_supported_hformats_as_dict(encoding: str = 'utf8') Dict[str, int][source]¶
Get dictionary with supported formats and relevant info.
- Parameters
encoding (str, optional) –
Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python standard encodings.
- Returns
dict –
- Dict with two elements:
format: Header format. All formats appearing are supported.auto_header: 1 if auto_header is supported), 0 otherwise.
- whatstk.whatsapp.hformat.get_supported_hformats_as_list(encoding: str = 'utf8') List[str][source]¶
Get list of supported formats.
- Returns
list – List with supported formats (as str). encoding (str, optional): Encoding to use for UTF when reading/writing (ex. ‘utf-8’).
- whatstk.whatsapp.hformat.is_supported(hformat: str, encoding: str = 'utf8') Tuple[bool, bool][source]¶
Check if header hformat is currently supported.
- Parameters
hformat (str) – Header format.
encoding (str, optional) –
Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python standard encodings.
- Returns
tuple – * bool: True if header is supported. * bool: True if header is supported with auto_header feature.
- whatstk.whatsapp.hformat.is_supported_verbose(hformat: str) str[source]¶
Check if header hformat is currently supported (both manually and using auto_header).
Result is shown as a string.
- Parameters
hformat (str) – Information message.
Example
Check if format
'%y-%m-%d, %H:%M - %name:'is supported.>>> from whatstk.whatsapp.hformat import is_supported_verbose >>> is_supported_verbose('%y-%m-%d, %H:%M - %name:') "The header '%y-%m-%d, %H:%M - %name:' is supported. `auto_header` for this header is supported."