Commit 52ee7522 authored by Himanshu Dabas's avatar Himanshu Dabas Committed by GitHub

fix(profile): ported user profile to v2 API endpoint (#955)

* fix for deprecation of v1.1 endpoints

* fix for cashtags

* typo

* fix(datetime): _formatDateTime tries %d-%m-%y

* fix(pandas): use new str-format Tweet.datetime data rep

* fix(pandas datetime): use ms

* fix(cashtags unwind): undo PRs field removals

* Revert "fix(cashtags unwind): undo PRs field removals"

This reverts commit dfa57c20186a969aa2bf010fbe198f5e0bbbbd01.

* fix(pandas): remove broken fields

* fix(cash): use provided field as suggested by pr review

* fix (cashtags): re enable cashtags in output

* fix(db): remove broken fields

* fix(datetime): Y-m-d and factored out

* fixes #947

* fix(get.py): json exception in User

* to-do: added to-do tasks

added to-do tasks for --profile-full feature

* chore(test): PEP8 formatting

* fix(profile): ported user profile to v2 API

fixed user profile feature which was broken since v1 endpoints were deprecated

* updated Readme

* fix: fixes #965 inconsistent timezones

* fix: handle tombstone tweets

tombstone tweets are those tweets which are flagged by Twitter for being inappropriate, misleading, graphic etc.

* fixes #976: saving tweets to csv

This patch fixes the issue caused by #967, which broke the functionality of saving the retrieved data into a csv file.

* feature: port Lookup to v2 endpoint

fixes #970, lookup is ported to v2 endpoint. this can now be used to lookup a certain profile.
Co-authored-by: default avatarSiegfriedWagner <mateus.chojnowski@gmail.com>
Co-authored-by: default avatarlmeyerov <leo@graphistry.com>
parent ae5e7e11
......@@ -111,3 +111,5 @@ ENV/
*.csv
*.json
*.txt
test_twint.py
......@@ -28,6 +28,7 @@ Twitter limits scrolls while browsing the user timeline. This means that with `.
- aiodns;
- beautifulsoup4;
- cchardet;
- dataclasses
- elasticsearch;
- pysocks;
- pandas (>=0.23.0);
......@@ -65,7 +66,7 @@ pipenv install git+https://github.com/twintproject/twint.git#egg=twint
## CLI Basic Examples and Combos
A few simple examples to help you understand the basics:
- `twint -u username` - Scrape all the Tweets from *user*'s timeline.
- `twint -u username` - Scrape all the Tweets of a *user* (doesn't include **retweets** but includes **replies**).
- `twint -u username -s pineapple` - Scrape all Tweets from the *user*'s timeline containing _pineapple_.
- `twint -s pineapple` - Collect every Tweet containing *pineapple* from everyone's Tweets.
- `twint -u username --year 2014` - Collect Tweets that were tweeted **before** 2014.
......@@ -83,7 +84,7 @@ A few simple examples to help you understand the basics:
- `twint -u username --following` - Scrape who a Twitter user follows.
- `twint -u username --favorites` - Collect all the Tweets a user has favorited (gathers ~3200 tweet).
- `twint -u username --following --user-full` - Collect full user information a person follows
- `twint -u username --profile-full` - Use a slow, but effective method to gather Tweets from a user's profile (Gathers ~3200 Tweets, Including Retweets).
- `twint -u username --timeline` - Use an effective method to gather Tweets from a user's profile (Gathers ~3200 Tweets, including **retweets** & **replies**).
- `twint -u username --retweets` - Use a quick method to gather the last 900 Tweets (that includes retweets) from a user's profile.
- `twint -u username --resume resume_file.txt` - Resume a search starting from the last saved scroll-id.
......
......@@ -2,7 +2,6 @@
from setuptools import setup
import io
import os
import sys
# Package meta-data
NAME = 'twint'
......@@ -15,52 +14,52 @@ VERSION = None
# Packages required
REQUIRED = [
'aiohttp', 'aiodns', 'beautifulsoup4', 'cchardet',
'elasticsearch', 'pysocks', 'pandas', 'aiohttp_socks',
'schedule', 'geopy', 'fake-useragent', 'googletransx'
]
'aiohttp', 'aiodns', 'beautifulsoup4', 'cchardet', 'dataclasses',
'elasticsearch', 'pysocks', 'pandas', 'aiohttp_socks',
'schedule', 'geopy', 'fake-useragent', 'googletransx'
]
here = os.path.abspath(os.path.dirname(__file__))
with io.open(os.path.join(here, 'README.md'), encoding='utf-8') as f:
long_description = '\n' + f.read()
long_description = '\n' + f.read()
# Load the package's __version__.py
about = {}
if not VERSION:
with open(os.path.join(here, NAME, '__version__.py')) as f:
exec(f.read(), about)
with open(os.path.join(here, NAME, '__version__.py')) as f:
exec(f.read(), about)
else:
about['__version__'] = VERSION
about['__version__'] = VERSION
setup(
name=NAME,
version=about['__version__'],
description=DESCRIPTION,
long_description=long_description,
long_description_content_type="text/markdown",
author=AUTHOR,
author_email=EMAIL,
python_requires=REQUIRES_PYTHON,
url=URL,
packages=['twint', 'twint.storage'],
entry_points={
'console_scripts':[
'twint = twint.cli:run_as_command',
],
},
install_requires=REQUIRED,
dependency_links=[
'git+https://github.com/x0rzkov/py-googletrans#egg=googletrans'
],
license='MIT',
classifiers=[
'License :: OSI Approved :: MIT License',
'Programming Language :: Python',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: Implementation :: CPython',
],
name=NAME,
version=about['__version__'],
description=DESCRIPTION,
long_description=long_description,
long_description_content_type="text/markdown",
author=AUTHOR,
author_email=EMAIL,
python_requires=REQUIRES_PYTHON,
url=URL,
packages=['twint', 'twint.storage'],
entry_points={
'console_scripts': [
'twint = twint.cli:run_as_command',
],
},
install_requires=REQUIRED,
dependency_links=[
'git+https://github.com/x0rzkov/py-googletrans#egg=googletrans'
],
license='MIT',
classifiers=[
'License :: OSI Approved :: MIT License',
'Programming Language :: Python',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: Implementation :: CPython',
],
)
......@@ -5,21 +5,25 @@ import os
Test.py - Testing TWINT to make sure everything works.
'''
def test_reg(c, run):
print("[+] Beginning vanilla test in {}".format(str(run)))
run(c)
def test_db(c, run):
print("[+] Beginning DB test in {}".format(str(run)))
c.Database = "test_twint.db"
run(c)
def custom(c, run, _type):
print("[+] Beginning custom {} test in {}".format(_type, str(run)))
c.Custom['tweet'] = ["id", "username"]
c.Custom['user'] = ["id", "username"]
run(c)
def test_json(c, run):
c.Store_json = True
c.Output = "test_twint.json"
......@@ -27,6 +31,7 @@ def test_json(c, run):
print("[+] Beginning JSON test in {}".format(str(run)))
run(c)
def test_csv(c, run):
c.Store_csv = True
c.Output = "test_twint.csv"
......@@ -34,52 +39,54 @@ def test_csv(c, run):
print("[+] Beginning CSV test in {}".format(str(run)))
run(c)
def main():
c = twint.Config()
c.Username = "verified"
c.Limit = 20
c.Store_object = True
# Seperate objects are neccessary.
# Separate objects are necessary.
f = twint.Config()
f.Username = "verified"
f.Limit = 20
f.Store_object = True
f.User_full = True
runs = [twint.run.Following,
twint.run.Followers,
twint.run.Search,
twint.run.Profile,
twint.run.Favorites
]
runs = [
twint.run.Profile, # this doesn't
twint.run.Search, # this works
twint.run.Following,
twint.run.Followers,
twint.run.Favorites,
]
tests = [test_reg, test_json, test_csv, test_db]
# Something breaks if we don't split these up
for run in runs[:2]:
for test in tests:
test(f, run)
# Something breaks if we don't split these up
for run in runs[2:]:
for run in runs[:3]:
if run == twint.run.Search:
c.Since = "2012-1-1 20:30:22"
c.Until = "2017-1-1"
else:
c.Since = ""
c.Until = ""
for test in tests:
test(c, run)
for run in runs[3:]:
for test in tests:
test(f, run)
files = ["test_twint.db", "test_twint.json", "test_twint.csv"]
for _file in files:
os.remove(_file)
os.remove(_file)
print("[+] Testing complete!")
if __name__ == '__main__':
main()
......@@ -16,12 +16,14 @@ from . import run
from . import config
from . import storage
def error(_error, message):
""" Print errors to stdout
"""
print("[-] {}: {}".format(_error, message))
sys.exit(0)
def check(args):
""" Error checking
"""
......@@ -34,7 +36,12 @@ def check(args):
"--userid and -u cannot be used together.")
if args.all:
error("Contradicting Args",
"--all and -u cannot be used together")
"--all and -u cannot be used together.")
elif args.search and args.timeline:
error("Contradicting Args",
"--s and --tl cannot be used together.")
elif args.timeline and not args.username:
error("Error", "-tl cannot be used without -u.")
elif args.search is None:
if args.custom_query is not None:
pass
......@@ -53,6 +60,7 @@ def check(args):
if args.min_wait_time < 0:
error("Error", "Please specifiy a non negative value for min_wait_time")
def loadUserList(ul, _type):
""" Concatenate users
"""
......@@ -67,6 +75,7 @@ def loadUserList(ul, _type):
return un[15:]
return userlist
def initialize(args):
""" Set default values for config from args
"""
......@@ -100,7 +109,7 @@ def initialize(args):
c.Essid = args.essid
c.Format = args.format
c.User_full = args.user_full
c.Profile_full = args.profile_full
# c.Profile_full = args.profile_full
c.Pandas_type = args.pandas_type
c.Index_tweets = args.index_tweets
c.Index_follow = args.index_follow
......@@ -119,7 +128,7 @@ def initialize(args):
c.Tor_control_password = args.tor_control_password
c.Retweets = args.retweets
c.Custom_query = args.custom_query
c.Popular_tweets = args.popular_tweets
c.Popular_tweets = args.popular_tweets
c.Skip_certs = args.skip_certs
c.Hide_output = args.hide_output
c.Native_retweets = args.native_retweets
......@@ -136,6 +145,7 @@ def initialize(args):
c.Min_wait_time = args.min_wait_time
return c
def options():
""" Parse arguments
"""
......@@ -180,7 +190,9 @@ def options():
ap.add_argument("--proxy-host", help="Proxy hostname or IP.")
ap.add_argument("--proxy-port", help="The port of the proxy server.")
ap.add_argument("--tor-control-port", help="If proxy-host is set to tor, this is the control port", default=9051)
ap.add_argument("--tor-control-password", help="If proxy-host is set to tor, this is the password for the control port", default="my_password")
ap.add_argument("--tor-control-password",
help="If proxy-host is set to tor, this is the password for the control port",
default="my_password")
ap.add_argument("--essid",
help="Elasticsearch Session ID, use this to differentiate scraping sessions.",
nargs="?", default="")
......@@ -192,9 +204,16 @@ def options():
ap.add_argument("--user-full",
help="Collect all user information (Use with followers or following only).",
action="store_true")
ap.add_argument("--profile-full",
help="Slow, but effective method of collecting a user's Tweets and RT.",
action="store_true")
# I am removing this this feature for the time being, because it is no longer required, default method will do this
# ap.add_argument("--profile-full",
# help="Slow, but effective method of collecting a user's Tweets and RT.",
# action="store_true")
ap.add_argument(
"-tl",
"--timeline",
help="Collects every tweet from a User's Timeline. (Tweets, RTs & Replies)",
action="store_true",
)
ap.add_argument("--translate",
help="Get tweets translated by Google Translate.",
action="store_true")
......@@ -221,24 +240,28 @@ def options():
ap.add_argument("-pc", "--pandas-clean",
help="Automatically clean Pandas dataframe at every scrape.")
ap.add_argument("-cq", "--custom-query", help="Custom search query.")
ap.add_argument("-pt", "--popular-tweets", help="Scrape popular tweets instead of recent ones.", action="store_true")
ap.add_argument("-pt", "--popular-tweets", help="Scrape popular tweets instead of recent ones.",
action="store_true")
ap.add_argument("-sc", "--skip-certs", help="Skip certs verification, useful for SSC.", action="store_false")
ap.add_argument("-ho", "--hide-output", help="Hide output, no tweets will be displayed.", action="store_true")
ap.add_argument("-nr", "--native-retweets", help="Filter the results for retweets only.", action="store_true")
ap.add_argument("--min-likes", help="Filter the tweets by minimum number of likes.")
ap.add_argument("--min-retweets", help="Filter the tweets by minimum number of retweets.")
ap.add_argument("--min-replies", help="Filter the tweets by minimum number of replies.")
ap.add_argument("--links", help="Include or exclude tweets containing one o more links. If not specified"+
" you will get both tweets that might contain links or not.")
ap.add_argument("--links", help="Include or exclude tweets containing one o more links. If not specified" +
" you will get both tweets that might contain links or not.")
ap.add_argument("--source", help="Filter the tweets for specific source client.")
ap.add_argument("--members-list", help="Filter the tweets sent by users in a given list.")
ap.add_argument("-fr", "--filter-retweets", help="Exclude retweets from the results.", action="store_true")
ap.add_argument("--backoff-exponent", help="Specify a exponent for the polynomial backoff in case of errors.", type=float, default=3.0)
ap.add_argument("--min-wait-time", type=float, default=15, help="specifiy a minimum wait time in case of scraping limit error. This value will be adjusted by twint if the value provided does not satisfy the limits constraints")
ap.add_argument("--backoff-exponent", help="Specify a exponent for the polynomial backoff in case of errors.",
type=float, default=3.0)
ap.add_argument("--min-wait-time", type=float, default=15,
help="specifiy a minimum wait time in case of scraping limit error. This value will be adjusted by twint if the value provided does not satisfy the limits constraints")
args = ap.parse_args()
return args
def main():
""" Main
"""
......@@ -283,7 +306,7 @@ def main():
run.Followers(c)
else:
run.Followers(c)
elif args.retweets or args.profile_full:
elif args.retweets: # or args.profile_full:
if args.userlist:
_userlist = loadUserList(args.userlist, "profile")
for _user in _userlist:
......@@ -301,9 +324,12 @@ def main():
run.Lookup(c)
else:
run.Lookup(c)
elif args.timeline:
run.Profile(c)
else:
run.Search(c)
def run_as_command():
version = ".".join(str(v) for v in sys.version_info[:2])
if float(version) < 3.6:
......@@ -312,5 +338,6 @@ def run_as_command():
main()
if __name__ == '__main__':
main()
......@@ -6,6 +6,7 @@ class Config:
Username: Optional[str] = None
User_id: Optional[str] = None
Search: Optional[str] = None
Lookup: bool = False
Geo: str = ""
Location: bool = False
Near: str = None
......@@ -38,7 +39,7 @@ class Config:
Favorites: bool = False
TwitterSearch: bool = False
User_full: bool = False
Profile_full: bool = False
# Profile_full: bool = False
Store_object: bool = False
Store_object_tweets_list: list = None
Store_object_users_list: list = None
......@@ -83,3 +84,4 @@ class Config:
Min_wait_time: int = 0
Bearer_token: str = None
Guest_token: str = None
deleted: list = None
......@@ -2,10 +2,12 @@ import datetime
import logging as logme
from .tweet import utc_to_local
class Datelock:
_until = None
_since = None
until = None
since = None
_since_def_user = None
......@@ -25,15 +27,18 @@ def Set(Until, Since):
d = Datelock()
if Until:
d._until = datetime.datetime.strptime(convertToDateTime(Until), "%Y-%m-%d %H:%M:%S")
d.until = datetime.datetime.strptime(convertToDateTime(Until), "%Y-%m-%d %H:%M:%S")
d.until = utc_to_local(d.until)
else:
d._until = datetime.datetime.today()
d.until = datetime.datetime.today()
if Since:
d._since = datetime.datetime.strptime(convertToDateTime(Since), "%Y-%m-%d %H:%M:%S")
d.since = datetime.datetime.strptime(convertToDateTime(Since), "%Y-%m-%d %H:%M:%S")
d.since = utc_to_local(d.since)
d._since_def_user = True
else:
d._since = datetime.datetime.strptime("2006-03-21 00:00:00", "%Y-%m-%d %H:%M:%S")
d.since = datetime.datetime.strptime("2006-03-21 00:00:00", "%Y-%m-%d %H:%M:%S")
d.since = utc_to_local(d.since)
d._since_def_user = False
return d
import time
from datetime import datetime
from bs4 import BeautifulSoup
from re import findall
from json import loads
import logging as logme
from .tweet import utc_to_local, Tweet_formats
class NoMoreTweetsException(Exception):
def __init__(self, msg):
......@@ -23,6 +28,7 @@ def Follow(response):
return follow, cursor
# TODO: this won't be used by --profile-full anymore. if it isn't used anywhere else, perhaps remove this in future
def Mobile(response):
logme.debug(__name__ + ':Mobile')
soup = BeautifulSoup(response, "html.parser")
......@@ -48,14 +54,15 @@ def MobileFav(response):
return tweets, max_id
def profile(response):
logme.debug(__name__ + ':profile')
json_response = loads(response)
html = json_response["items_html"]
soup = BeautifulSoup(html, "html.parser")
feed = soup.find_all("div", "tweet")
return feed, feed[-1]["data-item-id"]
def _get_cursor(response):
try:
next_cursor = response['timeline']['instructions'][0]['addEntries']['entries'][-1]['content'][
'operation']['cursor']['value']
except KeyError:
# this is needed because after the first request location of cursor is changed
next_cursor = response['timeline']['instructions'][-1]['replaceEntry']['entry']['content']['operation'][
'cursor']['value']
return next_cursor
def Json(response):
......@@ -67,44 +74,49 @@ def Json(response):
return feed, json_response["min_position"]
def search_v2(response):
# TODO need to implement this
def parse_tweets(config, response):
logme.debug(__name__ + ':parse_tweets')
response = loads(response)
if len(response['globalObjects']['tweets']) == 0:
msg = 'No more data. finished scraping!!'
msg = 'No more data!'
raise NoMoreTweetsException(msg)
# need to modify things at the function call end
# timeline = response['timeline']['instructions'][0]['addEntries']['entries']
feed = []
feed_set = set()
# here we need to remove the quoted and `to-reply` tweets from the list as they may or may not contain the
# for _id in response['globalObjects']['tweets']:
# if 'quoted_status_id_str' in response['globalObjects']['tweets'][_id] or \
# response['globalObjects']['tweets'][_id]['in_reply_to_status_id_str']:
# try:
# feed_set.add(response['globalObjects']['tweets'][_id]['quoted_status_id_str'])
# except KeyError:
# feed_set.add(response['globalObjects']['tweets'][_id]['in_reply_to_status_id_str'])
# i = 1
# for _id in response['globalObjects']['tweets']:
# if _id not in feed_set:
# temp_obj = response['globalObjects']['tweets'][_id]
# temp_obj['user_data'] = response['globalObjects']['users'][temp_obj['user_id_str']]
# feed.append(temp_obj)
for timeline_entry in response['timeline']['instructions'][0]['addEntries']['entries']:
# this will handle the cases when the timeline entry is a tweet
if timeline_entry['entryId'].find('sq-I-t-') == 0:
_id = timeline_entry['content']['item']['content']['tweet']['id']
temp_obj = response['globalObjects']['tweets'][_id]
if (config.TwitterSearch or config.Profile) and (timeline_entry['entryId'].startswith('sq-I-t-') or
timeline_entry['entryId'].startswith('tweet-')):
if 'tweet' in timeline_entry['content']['item']['content']:
_id = timeline_entry['content']['item']['content']['tweet']['id']
# skip the ads
if 'promotedMetadata' in timeline_entry['content']['item']['content']['tweet']:
continue
elif 'tombstone' in timeline_entry['content']['item']['content'] and 'tweet' in \
timeline_entry['content']['item']['content']['tombstone']:
_id = timeline_entry['content']['item']['content']['tombstone']['tweet']['id']
else:
_id = None
if _id is None:
raise ValueError('Unable to find ID of tweet in timeline.')
try:
temp_obj = response['globalObjects']['tweets'][_id]
except KeyError:
logme.info('encountered a deleted tweet with id {}'.format(_id))
config.deleted.append(_id)
continue
temp_obj['user_data'] = response['globalObjects']['users'][temp_obj['user_id_str']]
if 'retweeted_status_id_str' in temp_obj:
rt_id = temp_obj['retweeted_status_id_str']
_dt = response['globalObjects']['tweets'][rt_id]['created_at']
_dt = datetime.strptime(_dt, '%a %b %d %H:%M:%S %z %Y')
_dt = utc_to_local(_dt)
_dt = str(_dt.strftime(Tweet_formats['datetime']))
temp_obj['retweet_data'] = {
'user_rt_id': response['globalObjects']['tweets'][rt_id]['user_id_str'],
'user_rt': response['globalObjects']['tweets'][rt_id]['full_text'],
'retweet_id': rt_id,
'retweet_date': _dt,
}
feed.append(temp_obj)
try:
next_cursor = response['timeline']['instructions'][0]['addEntries']['entries'][-1]['content'][
'operation']['cursor']['value']
except KeyError:
# this is needed because after the first request location of cursor is changed
next_cursor = response['timeline']['instructions'][-1]['replaceEntry']['entry']['content']['operation'][
'cursor']['value']
next_cursor = _get_cursor(response)
return feed, next_cursor
......@@ -105,27 +105,21 @@ def get_connector(config):
return _connector
async def RequestUrl(config, init, headers=[]):
async def RequestUrl(config, init):
logme.debug(__name__ + ':RequestUrl')
_connector = get_connector(config)
_serialQuery = ""
params = []
_url = ""
_headers = {}
_headers = [("authorization", config.Bearer_token), ("x-guest-token", config.Guest_token)]
# TODO : do this later
if config.Profile:
if config.Profile_full:
logme.debug(__name__ + ':RequestUrl:Profile_full')
_url = await url.MobileProfile(config.Username, init)
else:
logme.debug(__name__ + ':RequestUrl:notProfile_full')
_url = await url.Profile(config.Username, init)
_serialQuery = _url
logme.debug(__name__ + ':RequestUrl:Profile')
_url, params, _serialQuery = url.SearchProfile(config, init)
elif config.TwitterSearch:
logme.debug(__name__ + ':RequestUrl:TwitterSearch')
_url, params, _serialQuery = await url.Search(config, init)
_headers = [("authorization", config.Bearer_token), ("x-guest-token", config.Guest_token)]
else:
if config.Following:
logme.debug(__name__ + ':RequestUrl:Following')
......@@ -212,21 +206,25 @@ async def Tweet(url, config, conn):
logme.critical(__name__ + ':Tweet:' + str(e))
async def User(username, config, conn, bearer_token, guest_token, user_id=False):
async def User(username, config, conn, user_id=False):
logme.debug(__name__ + ':User')
_dct = {'screen_name': username, 'withHighlightedLabel': False}
_url = 'https://api.twitter.com/graphql/jMaTS-_Ea8vh9rpKggJbCQ/UserByScreenName?variables={}'\
.format(dict_to_url(_dct))
_headers = {
'authorization': bearer_token,
'x-guest-token': guest_token,
'authorization': config.Bearer_token,
'x-guest-token': config.Guest_token,
}
try:
response = await Request(_url, headers=_headers)
j_r = loads(response)
if user_id:
_id = j_r['data']['user']['rest_id']
return _id
try:
_id = j_r['data']['user']['rest_id']
return _id
except KeyError as e:
logme.critical(__name__ + ':User:' + str(e))
return
await Users(j_r, config, conn)
except Exception as e:
logme.critical(__name__ + ':User:' + str(e))
......
......@@ -88,13 +88,13 @@ def _output(obj, output, config, **extra):
logme.debug(__name__ + ':_output:Lowercase:tweet')
obj.username = obj.username.lower()
author_list.update({obj.username})
for i in range(len(obj.mentions)):
obj.mentions[i] = obj.mentions[i].lower()
for dct in obj.mentions:
for key, val in dct.items():
dct[key] = val.lower()
for i in range(len(obj.hashtags)):
obj.hashtags[i] = obj.hashtags[i].lower()
# TODO : dont know what cashtags are, <also modify in tweet.py>
# for i in range(len(obj.cashtags)):
# obj.cashtags[i] = obj.cashtags[i].lower()
for i in range(len(obj.cashtags)):
obj.cashtags[i] = obj.cashtags[i].lower()
else:
logme.info('_output:Lowercase:hiddenTweetFound')
print("[x] Hidden tweet found, account suspended due to violation of TOS")
......@@ -128,49 +128,40 @@ def _output(obj, output, config, **extra):
async def checkData(tweet, config, conn):
logme.debug(__name__ + ':checkData')
tweet = Tweet(tweet, config)
if not tweet.datestamp:
logme.critical(__name__ + ':checkData:hiddenTweetFound')
print("[x] Hidden tweet found, account suspended due to violation of TOS")
return
if datecheck(tweet.datestamp + " " + tweet.timestamp, config):
output = format.Tweet(config, tweet)
if config.Database:
logme.debug(__name__ + ':checkData:Database')
db.tweets(conn, tweet, config)
if config.Pandas:
logme.debug(__name__ + ':checkData:Pandas')
panda.update(tweet, config)
if config.Store_object:
logme.debug(__name__ + ':checkData:Store_object')
if hasattr(config.Store_object_tweets_list, 'append'):
config.Store_object_tweets_list.append(tweet)
else:
tweets_list.append(tweet)
if config.Elasticsearch:
logme.debug(__name__ + ':checkData:Elasticsearch')
elasticsearch.Tweet(tweet, config)
_output(tweet, output, config)
# else:
# logme.critical(__name__+':checkData:copyrightedTweet')
async def Tweets(tweets, config, conn, url=''):
async def Tweets(tweets, config, conn):
logme.debug(__name__ + ':Tweets')
if config.Favorites or config.Profile_full or config.Location:
if config.Favorites or config.Location:
logme.debug(__name__ + ':Tweets:fav+full+loc')
for tw in tweets:
if tw['data-item-id'] == url.split('?')[0].split('/')[-1]:
await checkData(tw, config, conn)
elif config.TwitterSearch:
await checkData(tw, config, conn)
elif config.TwitterSearch or config.Profile:
logme.debug(__name__ + ':Tweets:TwitterSearch')
await checkData(tweets, config, conn)
else:
......
This diff is collapsed.
......@@ -91,7 +91,7 @@ def update(object, config):
"photos": Tweet.photos,
"video": Tweet.video,
"thumbnail": Tweet.thumbnail,
#"retweet": Tweet.retweet,
"retweet": Tweet.retweet,
"nlikes": int(Tweet.likes_count),
"nreplies": int(Tweet.replies_count),
"nretweets": int(Tweet.retweets_count),
......@@ -100,11 +100,11 @@ def update(object, config):
"near": Tweet.near,
"geo": Tweet.geo,
"source": Tweet.source,
#"user_rt_id": Tweet.user_rt_id,
#"user_rt": Tweet.user_rt,
#"retweet_id": Tweet.retweet_id,
"user_rt_id": Tweet.user_rt_id,
"user_rt": Tweet.user_rt,
"retweet_id": Tweet.retweet_id,
"reply_to": Tweet.reply_to,
#"retweet_date": Tweet.retweet_date,
"retweet_date": Tweet.retweet_date,
"translate": Tweet.translate,
"trans_src": Tweet.trans_src,
"trans_dest": Tweet.trans_dest
......
......@@ -53,7 +53,7 @@ def Csv(obj, config):
fieldnames, row = struct(obj, config.Custom[_obj_type], _obj_type)
base = addExt(config.Output, _obj_type, "csv")
dialect = 'excel-tab' if config.Tabs else 'excel'
dialect = 'excel-tab' if 'Tabs' in config.__dict__ else 'excel'
if not (os.path.exists(base)):
with open(base, "w", newline='', encoding="utf-8") as csv_file:
......
......@@ -21,21 +21,21 @@ def tweetData(t):
"hashtags": t.hashtags,
"cashtags": t.cashtags,
"link": t.link,
# "retweet": t.retweet,
"retweet": t.retweet,
"quote_url": t.quote_url,
"video": t.video,
"thumbnail": t.thumbnail,
"near": t.near,
"geo": t.geo,
"source": t.source,
# "user_rt_id": t.user_rt_id,
# "user_rt": t.user_rt,
# "retweet_id": t.retweet_id,
"user_rt_id": t.user_rt_id,
"user_rt": t.user_rt,
"retweet_id": t.retweet_id,
"reply_to": t.reply_to,
# "retweet_date": t.retweet_date,
"retweet_date": t.retweet_date,
"translate": t.translate,
"trans_src": t.trans_src,
"trans_dest": t.trans_dest
"trans_dest": t.trans_dest,
}
return data
......
from time import strftime, localtime
from datetime import datetime, timezone
import json
import logging as logme
from googletransx import Translator
......@@ -22,33 +21,44 @@ def utc_to_local(utc_dt):
return utc_dt.replace(tzinfo=timezone.utc).astimezone(tz=None)
def getMentions(tw):
Tweet_formats = {
'datetime': '%Y-%m-%d %H:%M:%S %Z',
'datestamp': '%Y-%m-%d',
'timestamp': '%H:%M:%S'
}
def _get_mentions(tw):
"""Extract mentions from tweet
"""
logme.debug(__name__ + ':getMentions')
mentions = []
logme.debug(__name__ + ':get_mentions')
try:
for mention in tw['entities']['user_mentions']:
mentions.append(mention['screen_name'])
mentions = [
{
'screen_name': _mention['screen_name'],
'name': _mention['name'],
'id': _mention['id_str'],
} for _mention in tw['entities']['user_mentions']
if tw['display_text_range'][0] < _mention['indices'][0]
]
except KeyError:
mentions = []
return mentions
def getQuoteURL(tw):
"""Extract quote from tweet
"""
logme.debug(__name__ + ':getQuoteURL')
base_twitter = "https://twitter.com"
quote_url = ""
def _get_reply_to(tw):
try:
quote = tw.find("div", "QuoteTweet-innerContainer")
quote_url = base_twitter + quote.get("href")
except:
quote_url = ""
return quote_url
reply_to = [
{
'screen_name': _mention['screen_name'],
'name': _mention['name'],
'id': _mention['id_str'],
} for _mention in tw['entities']['user_mentions']
if tw['display_text_range'][0] > _mention['indices'][1]
]
except KeyError:
reply_to = []
return reply_to
def getText(tw):
......@@ -63,107 +73,6 @@ def getText(tw):
return text
def getStat(tw, _type):
"""Get stats about Tweet
"""
logme.debug(__name__ + ':getStat')
st = f"ProfileTweet-action--{_type} u-hiddenVisually"
return tw.find("span", st).find("span")["data-tweet-stat-count"]
def getRetweet(tw, _config):
"""Get Retweet
"""
logme.debug(__name__ + ':getRetweet')
if _config.Profile:
if int(tw["data-user-id"]) != _config.User_id:
return _config.User_id, _config.Username
else:
_rt_object = tw.find('span', 'js-retweet-text')
if _rt_object:
_rt_id = _rt_object.find('a')['data-user-id']
_rt_username = _rt_object.find('a')['href'][1:]
return _rt_id, _rt_username
return '', ''
# def getThumbnail(tw):
# """Get Thumbnail
# """
# divs = tw.find_all("div", "PlayableMedia-player")
# thumb = ""
# for div in divs:
# thumb = div.attrs["style"].split("url('")[-1]
# thumb = thumb.replace("')", "")
# return thumb
# def Tweet(tw, config):
# """Create Tweet object
# """
# logme.debug(__name__+':Tweet')
# t = tweet()
# t.id = int(tw["data-item-id"])
# t.id_str = tw["data-item-id"]
# t.conversation_id = tw["data-conversation-id"]
# t.datetime = int(tw.find("span", "_timestamp")["data-time-ms"])
# t.datestamp = strftime("%Y-%m-%d", localtime(t.datetime/1000.0))
# t.timestamp = strftime("%H:%M:%S", localtime(t.datetime/1000.0))
# t.user_id = int(tw["data-user-id"])
# t.user_id_str = tw["data-user-id"]
# t.username = tw["data-screen-name"]
# t.name = tw["data-name"]
# t.place = tw.find("a","js-geo-pivot-link").text.strip() if tw.find("a","js-geo-pivot-link") else ""
# t.timezone = strftime("%z", localtime())
# for img in tw.findAll("img", "Emoji Emoji--forText"):
# img.replaceWith(img["alt"])
# t.mentions = getMentions(tw)
# t.urls = [link.attrs["data-expanded-url"] for link in tw.find_all('a',{'class':'twitter-timeline-link'}) if link.has_attr("data-expanded-url")]
# t.photos = [photo_node.attrs['data-image-url'] for photo_node in tw.find_all("div", "AdaptiveMedia-photoContainer")]
# t.video = 1 if tw.find_all("div", "AdaptiveMedia-video") != [] else 0
# t.thumbnail = getThumbnail(tw)
# t.tweet = getText(tw)
# t.lang = tw.find('p', 'tweet-text')['lang']
# t.hashtags = [hashtag.text for hashtag in tw.find_all("a","twitter-hashtag")]
# t.cashtags = [cashtag.text for cashtag in tw.find_all("a", "twitter-cashtag")]
# t.replies_count = getStat(tw, "reply")
# t.retweets_count = getStat(tw, "retweet")
# t.likes_count = getStat(tw, "favorite")
# t.link = f"https://twitter.com/{t.username}/status/{t.id}"
# t.user_rt_id, t.user_rt = getRetweet(tw, config)
# t.retweet = True if t.user_rt else False
# t.retweet_id = ''
# t.retweet_date = ''
# if not config.Profile:
# t.retweet_id = tw['data-retweet-id'] if t.user_rt else ''
# t.retweet_date = datetime.fromtimestamp(((int(t.retweet_id) >> 22) + 1288834974657)/1000.0).strftime("%Y-%m-%d %H:%M:%S") if t.user_rt else ''
# t.quote_url = getQuoteURL(tw)
# t.near = config.Near if config.Near else ""
# t.geo = config.Geo if config.Geo else ""
# t.source = config.Source if config.Source else ""
# t.reply_to = [{'user_id': t['id_str'], 'username': t['screen_name']} for t in json.loads(tw["data-reply-to-users-json"])]
# t.translate = ''
# t.trans_src = ''
# t.trans_dest = ''
# if config.Translate == True:
# try:
# ts = translator.translate(text=t.tweet, dest=config.TranslateDest)
# t.translate = ts.text
# t.trans_src = ts.src
# t.trans_dest = ts.dest
# # ref. https://github.com/SuniTheFish/ChainTranslator/blob/master/ChainTranslator/__main__.py#L31
# except ValueError as e:
# raise Exception("Invalid destination language: {} / Tweet: {}".format(config.TranslateDest, t.tweet))
# logme.debug(__name__+':Tweet:translator.translate:'+str(e))
# return t
Tweet_formats = {
'datetime': '%Y-%m-%d %H:%M:%S %Z',
'datestamp': '%Y-%m-%d',
'timestamp': '%H:%M:%S'
}
def Tweet(tw, config):
"""Create Tweet object
"""
......@@ -185,14 +94,10 @@ def Tweet(tw, config):
t.user_id_str = tw["user_id_str"]
t.username = tw["user_data"]['screen_name']
t.name = tw["user_data"]['name']
t.place = tw['geo'] if tw['geo'] else ""
t.place = tw['geo'] if 'geo' in tw and tw['geo'] else ""
t.timezone = strftime("%z", localtime())
# for img in tw.findAll("img", "Emoji Emoji--forText"):
# img.replaceWith(img["alt"])
try:
t.mentions = [_mention['screen_name'] for _mention in tw['entities']['user_mentions']]
except KeyError:
t.mentions = []
t.mentions = _get_mentions(tw)
t.reply_to = _get_reply_to(tw)
try:
t.urls = [_url['expanded_url'] for _url in tw['entities']['urls']]
except KeyError:
......@@ -216,21 +121,27 @@ def Tweet(tw, config):
t.hashtags = [hashtag['text'] for hashtag in tw['entities']['hashtags']]
except KeyError:
t.hashtags = []
# don't know what this is
t.cashtags = [cashtag['text'] for cashtag in tw['entities']['symbols']]
try:
t.cashtags = [cashtag['text'] for cashtag in tw['entities']['symbols']]
except KeyError:
t.cashtags = []
t.replies_count = tw['reply_count']
t.retweets_count = tw['retweet_count']
t.likes_count = tw['favorite_count']
t.link = f"https://twitter.com/{t.username}/status/{t.id}"
# TODO: someone who is familiar with this code, needs to take a look at what this is
# t.user_rt_id, t.user_rt = getRetweet(tw, config)
# t.retweet = True if t.user_rt else False
# t.retweet_id = ''
# t.retweet_date = ''
# if not config.Profile:
# t.retweet_id = tw['data-retweet-id'] if t.user_rt else ''
# t.retweet_date = datetime.fromtimestamp(((int(t.retweet_id) >> 22) + 1288834974657) / 1000.0).strftime(
# "%Y-%m-%d %H:%M:%S") if t.user_rt else ''
try:
if 'user_rt_id' in tw['retweet_data']:
t.retweet = True
t.retweet_id = tw['retweet_data']['retweet_id']
t.retweet_date = tw['retweet_data']['retweet_date']
t.user_rt = tw['retweet_data']['user_rt']
t.user_rt_id = tw['retweet_data']['user_rt_id']
except KeyError:
t.retweet = False
t.retweet_id = ''
t.retweet_date = ''
t.user_rt = ''
t.user_rt_id = ''
try:
t.quote_url = tw['quoted_status_permalink']['expanded'] if tw['is_quote_status'] else ''
except KeyError:
......@@ -239,13 +150,10 @@ def Tweet(tw, config):
t.near = config.Near if config.Near else ""
t.geo = config.Geo if config.Geo else ""
t.source = config.Source if config.Source else ""
# TODO: check this whether we need the list of all the users to whom this tweet is a reply or we only need
# the immediately above user id
t.reply_to = {'user_id': tw['in_reply_to_user_id_str'], 'username': tw['in_reply_to_screen_name']}
t.translate = ''
t.trans_src = ''
t.trans_dest = ''
if config.Translate == True:
if config.Translate:
try:
ts = translator.translate(text=t.tweet, dest=config.TranslateDest)
t.translate = ts.text
......@@ -253,6 +161,6 @@ def Tweet(tw, config):
t.trans_dest = ts.dest
# ref. https://github.com/SuniTheFish/ChainTranslator/blob/master/ChainTranslator/__main__.py#L31
except ValueError as e:
raise Exception("Invalid destination language: {} / Tweet: {}".format(config.TranslateDest, t.tweet))
logme.debug(__name__ + ':Tweet:translator.translate:' + str(e))
raise Exception("Invalid destination language: {} / Tweet: {}".format(config.TranslateDest, t.tweet))
return t
......@@ -5,7 +5,6 @@ from urllib.parse import urlencode
from urllib.parse import quote
mobile = "https://mobile.twitter.com"
# base = "https://twitter.com/i"
base = "https://api.twitter.com/2/search/adaptive.json"
......@@ -65,18 +64,6 @@ async def MobileProfile(username, init):
return url
async def Profile(username, init):
logme.debug(__name__ + ':Profile')
url = f"{base}/profiles/show/{username}/timeline/tweets?include_"
url += "available_features=1&lang=en&include_entities=1"
url += "&include_new_items_bar=true"
if init != '-1':
url += f"&max_position={init}"
return url
async def Search(config, init):
logme.debug(__name__ + ':Search')
url = base
......@@ -123,7 +110,7 @@ async def Search(config, init):
q += f" geocode:{config.Geo}"
if config.Search:
q += f"{config.Search}"
q += f" {config.Search}"
if config.Year:
q += f" until:{config.Year}-1-1"
if config.Since:
......@@ -173,17 +160,18 @@ async def Search(config, init):
if config.Custom_query:
q = config.Custom_query
q = q.strip()
params.append(("q", q))
_serialQuery = _sanitizeQuery(url, params)
return url, params, _serialQuery
# maybe dont need this
async def SearchProfile(config, init=None):
def SearchProfile(config, init=None):
logme.debug(__name__ + ':SearchProfile')
_url = 'https://api.twitter.com/2/timeline/profile/{}.json?'
q = ""
_url = 'https://api.twitter.com/2/timeline/profile/{user_id}.json'.format(user_id=config.User_id)
tweet_count = 100
params = [
# some of the fields are not required, need to test which ones aren't required
('include_profile_interstitial_type', '1'),
('include_blocking', '1'),
('include_blocked_by', '1'),
......@@ -205,14 +193,12 @@ async def SearchProfile(config, init=None):
('include_ext_media_availability', 'true'),
('send_error_codes', 'true'),
('simple_quoted_tweet', 'true'),
('include_tweet_replies', 'false'),
('count', '50'),
('userId', '1934388686'),
('ext', 'mediaStats,ChighlightedLabel'),
('include_tweet_replies', 'true'),
('count', tweet_count),
('ext', 'mediaStats%2ChighlightedLabel'),
]
if init:
params.append(('cursor', init))
if type(init) == str:
params.append(('cursor', str(init)))
_serialQuery = _sanitizeQuery(_url, params)
return _url, params, _serialQuery
pass
......@@ -2,12 +2,13 @@ import datetime
import logging as logme
class User:
class user:
type = "user"
def __init__(self):
pass
User_formats = {
'join_date': '%Y-%m-%d',
'join_time': '%H:%M:%S %Z'
......@@ -21,31 +22,31 @@ def User(ur):
msg = 'malformed json! cannot be parsed to get user data'
logme.fatal(msg)
raise KeyError(msg)
_usr = User()
_usr = user()
_usr.id = ur['data']['user']['rest_id']
_usr.name = ur['data']['user']['rest_id']['legacy']['name']
_usr.username = ur['data']['user']['rest_id']['legacy']['screen_name']
_usr.bio = ur['data']['user']['rest_id']['legacy']['description']
_usr.location = ur['data']['user']['rest_id']['legacy']['location']
_usr.url = ur['data']['user']['rest_id']['legacy']['screen_name']['url']
_usr.name = ur['data']['user']['legacy']['name']
_usr.username = ur['data']['user']['legacy']['screen_name']
_usr.bio = ur['data']['user']['legacy']['description']
_usr.location = ur['data']['user']['legacy']['location']
_usr.url = ur['data']['user']['legacy']['url']
# parsing date to user-friendly format
_dt = ur['data']['user']['rest_id']['legacy']['created_at']
_dt = ur['data']['user']['legacy']['created_at']
_dt = datetime.datetime.strptime(_dt, '%a %b %d %H:%M:%S %z %Y')
# date is of the format year,
_usr.join_date = _dt.strftime(User_formats['join_date'])
_usr.join_time = _dt.strftime(User_formats['join_time'])
# :type `int`
_usr.tweets = int(ur['data']['user']['rest_id']['legacy']['statuses_count'])
_usr.following = int(ur['data']['user']['rest_id']['legacy']['friends_count'])
_usr.followers = int(ur['data']['user']['rest_id']['legacy']['followers_count'])
_usr.likes = int(ur['data']['user']['rest_id']['legacy']['favourites_count'])
_usr.media_count = int(ur['data']['user']['rest_id']['legacy']['media_count'])
_usr.is_private = ur['data']['user']['rest_id']['legacy']['protected']
_usr.is_verified = ur['data']['user']['rest_id']['legacy']['verified']
_usr.avatar = ur['data']['user']['rest_id']['legacy']['profile_image_url_https']
_usr.background_image = ur['data']['user']['rest_id']['legacy']['profile_banner_url']
_usr.tweets = int(ur['data']['user']['legacy']['statuses_count'])
_usr.following = int(ur['data']['user']['legacy']['friends_count'])
_usr.followers = int(ur['data']['user']['legacy']['followers_count'])
_usr.likes = int(ur['data']['user']['legacy']['favourites_count'])
_usr.media_count = int(ur['data']['user']['legacy']['media_count'])
_usr.is_private = ur['data']['user']['legacy']['protected']
_usr.is_verified = ur['data']['user']['legacy']['verified']
_usr.avatar = ur['data']['user']['legacy']['profile_image_url_https']
_usr.background_image = ur['data']['user']['legacy']['profile_banner_url']
# TODO : future implementation
# legacy_extended_profile is also available in some cases which can be used to get DOB of user
return _usr
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment